skip to main content
10.1145/3078597.3078602acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs

Published: 26 June 2017 Publication History

Abstract

Stochastic gradient descent (SGD) is widely used by many machine learning algorithms. It is efficient for big data ap- plications due to its low algorithmic complexity. SGD is inherently serial and its parallelization is not trivial. How to parallelize SGD on many-core architectures (e.g. GPUs) for high efficiency is a big challenge. In this paper, we present cuMF_SGD, a parallelized SGD solution for matrix factorization on GPUs. We first design high-performance GPU computation kernels that accelerate individual SGD updates by exploiting model parallelism. We then design efficient schemes that parallelize SGD updates by exploiting data parallelism. Finally, we scale cuMF SGD to large data sets that cannot fit into one GPU's memory. Evaluations on three public data sets show that cuMF_SGD outperforms existing solutions, including a 64- node CPU system, by a large margin using only one GPU card.

References

[1]
Recommending items to more than a billion people, 2015. https://code.facebook.com/posts/861999383875667/recommending-items-to-more-than-a-billion-people/.
[2]
NVIDIA CUDA programming guide., 2016. http://docs.nvidia.com/cuda/cuda-c-programming-guide.
[3]
NVIDIA Maxwell Architecture . https://developer.nvidia.com/maxwell-compute-architecture, 2016.
[4]
NVIDIA NVLink, 2016. http://www.nvidia.com/object/nvlink.html.
[5]
NVIDIA Pascal Architecture, 2016. http://www.geforce.com/hardware/10series/architecture.
[6]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA, 2016.
[7]
S. Al-Kiswany, A. Gharaibeh, and M. Ripeanu. GPUs as storage system accelerators. IEEE Transactions on Parallel and Distributed Systems, 24(8):1556--1566, 2013.
[8]
L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010, pages 177--186. Springer, 2010.
[9]
M. Butler, K. Sajjapongse, and M. Becchi. Improving application concurrency on GPUs by managing implicit and explicit synchronizations. In Parallel and Distributed Systems (ICPADS), 2015 IEEE 21st International Conference on, pages 535--544. IEEE, 2015.
[10]
X. Cai, Z. Xu, G. Lai, C. Wu, and X. Lin. GPU-accelerated restricted boltzmann machine for collaborative filtering. In International Conference on Algorithms and Architectures for Parallel Processing. Springer, 2012.
[11]
J. Canny and H. Zhao. Bidmach: Large-scale learning with zero memory allocation. In BigLearning, NIPS Workshop, 2013.
[12]
S. Chang, Y. Zhang, J. Tang, D. Yin, Y. Chang, M. A. Hasegawa-Johnson, and T. S. Huang. Streaming recommender systems. In Proceedings of the 26th International Conference on World Wide Web, WWW '17, 2017.
[13]
J. Chen, L. Tan, P. Wu, D. Tao, H. Li, X. Liang, S. Li, R. Ge, L. Bhuyan, and Z. Chen. GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 57. IEEE Press, 2016.
[14]
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
[15]
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Transactions on Intelligent Systems and Technology (TIST), 2015.
[16]
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A learning-rate schedule for stochastic gradient methods to matrix factorization. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2015.
[17]
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223--1231, 2012.
[18]
B. Del Monte and R. Prodan. A scalable GPU-enabled framework for training deep neural networks. In Green High Performance Computing (ICGHPC), 2016 2nd International Conference on, pages 1--8. IEEE, 2016.
[19]
C. del Mundo and W.-c. Feng. Enabling efficient intra-warp communication for Fourier transforms in a many-core architecture. In Supercomputing, 2013. Proceedings of the 2013 ACM/IEEE International Conference on, 2013.
[20]
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121--2159, 2011.
[21]
J. Fang, A. L. Varbanescu, and H. Sips. A comprehensive performance comparison of CUDA and OpenCL. In 2011 International Conference on Parallel Processing, pages 216--225. IEEE, 2011.
[22]
M. Gates, H. Anzt, J. Kurzak, and J. Dongarra. Accelerating collaborative filtering using concepts from high performance computing. In Big Data, 2015 IEEE International Conference on, 2015.
[23]
R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
[24]
A. Goswami, J. Young, K. Schwan, N. Farooqui, A. Gavrilovska, M. Wolf, and G. Eisenhauer. GPUShare: Fair-sharing middleware for GPU clouds. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International, pages 1769--1776. IEEE, 2016.
[25]
S. Heldens, A. L. Varbanescu, and A. Iosup. Dynamic load balancing for high-performance graph processing on hybrid CPU-GPU platforms. In Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms, pages 62--65. IEEE Press, 2016.
[26]
C.-J. Hsieh and I. S. Dhillon. Fast coordinate descent methods with variable selection for non-negative matrix factorization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
[27]
J. Jin, S. Lai, S. Hu, J. Lin, and X. Lin. GPUSGD: A GPU-accelerated stochastic gradient descent algorithm for matrix factorization. Concurrency and Computation: Practice and Experience, 2015.
[28]
A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In ACM SIGPLAN Notices, volume 48, pages 395--406. ACM, 2013.
[29]
A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler, M. T. Kandemir, and C. R. Das. Anatomy of GPU memory system for multi-application execution. In Proceedings of the 2015 International Symposium on Memory Systems, pages 223--234. ACM, 2015.
[30]
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Orchestrated scheduling and prefetching for GPGPUs. In ACM SIGARCH Computer Architecture News, volume 41, pages 332--343. ACM, 2013.
[31]
R. Kaleem, S. Pai, and K. Pingali. Stochastic gradient descent on GPUs. In Proceedings of the 8th Workshop on General Purpose Processing using GPUs, pages 81--89. ACM, 2015.
[32]
D. B. Kirk and W. H. Wen-mei. Programming massively parallel processors: a hands-on approach. Newnes, 2012.
[33]
T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455--500, 2009.
[34]
Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 2009.
[35]
B. Li, S. Tata, and Y. Sismanis. Sparkler: supporting large-scale matrix factorization. In Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013.
[36]
C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 67--77. ACM, 2015.
[37]
M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for stochastic optimization. In SIGKDD, pages 661--670. ACM, 2014.
[38]
Z. Liu, Y.-X. Wang, and A. Smola. Fast differentially private matrix factorization. In Proceedings of the 9th ACM Conference on Recommender Systems, RecSys'15, 2015.
[39]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 2012.
[40]
X. Meng, J. Bradley, B. Yuvaz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. JMLR, 2016.
[41]
Y. Nishioka and K. Taura. Scalable task-parallel SGD on matrix factorization in multicore architectures. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, IPDPSW '15, pages 1178--1184, Washington, DC, USA, 2015. IEEE Computer Society.
[42]
J. Oh, W.-S. Han, H. Yu, and X. Jiang. Fast and robust parallel SGD matrix factorization. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.
[43]
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
[44]
B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.
[45]
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '08, 2008.
[46]
S. Schelter, V. Satuluri, and R. Zadeh. Factorbird-a parameter server approach to distributed matrix factorization. arXiv preprint arXiv:1411.0602, 2014.
[47]
G. M. Shipman, T. S. Woodall, R. L. Graham, A. B. Maccabe, and P. G. Bridges. Infiniband scalability in Open MPI. In Proceedings 20th IEEE International Parallel and Distributed Processing Symposium, pages 10--pp. IEEE, 2006.
[48]
D. Song and S. Chen. Exploiting SIMD for complex numerical predicates. In 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW), 2016.
[49]
S. Song and K. W. Cameron. System-level power-performance efficiency modeling for emergent GPU architectures. In PACT, pages 473--474, 2012.
[50]
J. Tan, S. L. Song, K. Yan, X. Fu, A. Marquez, and D. Kerbyson. Combating the reliability challenge of GPU register file at low supply voltage. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on, pages 3--15. IEEE, 2016.
[51]
W. Tan, L. Cao, and L. Fong. Faster and Cheaper: Parallelizing large-scale matrix factorization on GPUs. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC '16, 2016.
[52]
C. Teflioudi, F. Makari, and R. Gemulla. Distributed matrix completion. In IEEE 12th International Conference on Data Mining. IEEE, 2012.
[53]
A. Todd, H. Truong, J. Deters, J. Long, G. Conant, and M. Becchi. Parallel gene upstream comparison via multi-level hash tables on GPU. In 22nd IEEE International Conference on Parallel and Distributed Systems (ICPADS), 2016.
[54]
S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65--76, 2009.
[55]
X. Xie, Y. Liang, X. Li, Y. Wu, G. Sun, T. Wang, and D. Fan. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48.
[56]
X. Xie, Y. Liang, G. Sun, and D. Chen. An efficient compiler framework for cache bypassing on GPUs. In IEEE/ACM International Conference on Computer-Aided Design, 2013.
[57]
X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. Coordinated static and dynamic cache bypassing for GPUs. In International Symposium on High Performance Computer Architecture, HPCA'15, pages 76--88, 2015.
[58]
F. Yan, O. Ruwase, Y. He, and E. Smirni. SERF: efficient scheduling for fast deep neural network serving via judicious parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 26. IEEE Press, 2016.
[59]
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '10, pages 86--97, 2010.
[60]
H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In 2012 IEEE 12th International Conference on Data Mining. IEEE, 2012.
[61]
H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In 2012 IEEE 12th International Conference on Data Mining. IEEE, 2012.
[62]
H.-F. Yu, C.-J. Hsieh, H. Yun, S. Vishwanathan, and I. S. Dhillon. A scalable asynchronous distributed algorithm for topic modeling. In Proceedings of the 24th International Conference on WWW, pages 1340--1350. ACM, 2015.
[63]
H. Yun, H.-F. Yu, C.-J. Hsieh, S. V. N. Vishwanathan, and I. Dhillon. NOMAD: Non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. Proc. VLDB Endow., 2014.
[64]
D. Zastrau and S. Edelkamp. Stochastic gradient descent with GPGPU. In Annual Conference on Artificial Intelligence. Springer, 2012.
[65]
T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116. ACM, 2004.
[66]
Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the Netflix prize. In International Conference on Algorithmic Applications in Management. Springer, 2008.
[67]
M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In NIPS, 2010.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
June 2017
254 pages
ISBN:9781450346993
DOI:10.1145/3078597
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gpgpu
  2. matrix factorization
  3. parallel computing

Qualifiers

  • Research-article

Conference

HPDC '17
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media