Minimizing Staleness and Communication Overhead in Distributed SGD for Collaborative Filtering

journal article
research article
Published by Institute of Electrical and Electronics Engineers (IEEE) in International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

Vol. 72 (10), 2925-2937
https://doi.org/10.1109/tc.2023.3275107

Abstract

Distributed asynchronous stochastic gradient descent (ASGD) algorithms that approximate low-rank matrix factorizations for collaborative filtering perform one or more synchronizations per epoch where staleness is reduced with more synchronizations. However, high number of synchronizations would prohibit the scalability of the algorithm. We propose a parallel ASGD algorithm, $\eta$ -PASGD, for efficiently handling $\eta$ synchronizations per epoch in a scalable fashion. The proposed algorithm puts an upper limit of $K$ on $\eta$ , for a $K$ -processor system, such that performing $\eta =K$ synchronizations per epoch would eliminate the staleness completely. The rating data used in collaborative filtering are usually represented as sparse matrices. The sparsity allows for reduction in the staleness and communication overhead combinatorially via intelligently distributing the data to processors. We analyze the staleness and the total volume incurred during an epoch of $\eta$ -PASGD. Following this analysis, we propose a hypergraph partitioning model to encapsulate reducing staleness and volume while minimizing the maximum number of synchronizations required for a stale-free SGD. This encapsulation is achieved with a novel cutsize metric that is realized via a new recursive-bipartitioning-based algorithm. Experiments on up to 512 processors show the importance of the proposed partitioning method in improving staleness, volume, RMSE and parallel runtime.

Keywords

Funding Information

Scientific and Technological Research Council of Türkiye (EEEAG-119E035)
National Center for High Performance Computing of Türkiye (4009972021)

This publication has 14 references indexed in Scilit:

Ups and Downs
Published by Association for Computing Machinery (ACM) ,2016
GASGD
Published by Association for Computing Machinery (ACM) ,2014
NOMAD
Proceedings of the VLDB Endowment, 2014
A survey of collaborative filtering based social recommender systems
Computer Communications, 2014
A new metric enabling an exact hypergraph model for the communication volume in distributed-memory parallel applications
Parallel Computing, 2013
Distributed large-scale natural graph factorization
Published by Association for Computing Machinery (ACM) ,2013
Distributed Matrix Completion
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
Large-scale matrix factorization with distributed stochastic gradient descent
Published by Association for Computing Machinery (ACM) ,2011
Introduction to Recommender Systems Handbook
Published by Springer Science and Business Media LLC ,2010
Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication
IEEE Transactions on Parallel and Distributed Systems, 1999