Compressing Gradients by Exploiting Temporal Correlation in Momentum-SGD
An increasing bottleneck in decentralized optimization is communication. Bigger models and growing datasets mean that decentralization of computation is important and that the amount of information exchanged is quickly growing. While compression techniques have been introduced to cope with the latter, none has considered leveraging the temporal correlations that exist in consecutive vector updates. An important example is distributed momentum-SGD where temporal correlation is enhanced by the low-pass-filtering effect of applying momentum.