Abstract: Distributed data-parallel algorithms aim to accelerate the training of deep neural networks by parallelizing the computation of large mini-batch gradient updates across multiple nodes. Approaches that synchronize nodes using exact distributed averaging (e.g., via AllReduce) are sensitive to stragglers and communication delays. The PushSum gossip algorithm is robust to these issues, but only performs approximate distributed averaging. PushSum diffuses information over a peer-to-peer network connecting the nodes, and is closely related to graph filtering. In this talk I will discuss our recent work on Stochastic Gradient Push (SGP) for supervised learning, which combines PushSum with stochastic gradient updates, as well as a recent extension that includes a slow momentum update to improve optimization and generalization. We prove that these methods converge to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD. Moreover, the nodes are guaranteed to achieve a consensus, and these methods achieve a linear speedup with respect to the number of compute nodes. We empirically validate the performance of SGP on image classification and machine translation tasks. The talk is based on joint work with Mido Assran, Nicolas Ballas, Nicolas Loizou, and Jianyu Wang.
Bio: Michael Rabbat received the B.Sc. from the University of Illinois, Urbana-Champaign, in 2001, the M.Sc. from Rice University, Houston, TX, in 2003, and the Ph.D. from the University of Wisconsin, Madison, in 2006, all in electrical engineering. He is currently a Research Scientist in Facebook's Artificial Intelligence Research group (FAIR). From 2007 to 2018 he was a professor in the Department of Electrical and Computer Engineering at McGill University, Montreal, Canada. During the 2013--2014 academic year he held visiting positions at Telecom Bretegne, Brest, France, the Inria Bretagne-Atlantique Reserch Centre, Rennes, France, and KTH Royal Institute of Technology, Stockholm, Sweden. His research interests include optimization, distributed algorithms, and signal processing for machine learning.