research-article

Free access

Communication-efficient distributed stochastic AUC maximization with deep neural networks

AUTHORs:

Tianbao YangAuthors Info & Claims

ICML'20: Proceedings of the 37th International Conference on Machine Learning

Article No.: 362, Pages 3864 - 3874

Published: 13 July 2020 Publication History

PDF eReader Publisher Site

Abstract

In this paper, we study distributed algorithms for large-scale AUC maximization with a deep neural network as a predictive model. Although distributed learning techniques have been investigated extensively in deep learning, they are not directly applicable to stochastic AUC maximization with deep neural networks due to its striking differences from standard loss minimization problems (e.g., cross-entropy). Towards addressing this challenge, we propose and analyze a communication-efficient distributed optimization algorithm based on a non-convex concave reformulation of the AUC maximization, in which the communication of both the primal variable and the dual variable between each worker and the parameter server only occurs after multiple steps of gradient-based updates in each worker. Compared with the naive parallel version of an existing algorithm that computes stochastic gradients at individual machines and averages them for updating the model parameters, our algorithm requires a much less number of communication rounds and still achieves a linear speedup in theory. To the best of our knowledge, this is the first work that solves the non-convex concave min-max problem for AUC maximization with deep neural networks in a communication-efficient distributed manner while still maintaining the linear speedup property in theory. Our experiments on several benchmark datasets show the effectiveness of our algorithm and also confirm our theory.

Supplementary Material

Additional material (3524938.3525300_supp.pdf)

Supplemental material.

Download
724.27 KB

References

[1]

Basu, D., Data, D., Karakus, C., and Diggavi, S. Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations. In Advances in Neural Information Processing Systems, pp. 14668-14679, 2019.

[2]

Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. signsgd: Compressed optimisation for nonconvex problems. arXiv preprint arXiv:1802.04434, 2018.

[3]

Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.

[4]

Chen, K. and Huo, Q. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In 2016 ieee international conference on acoustics, speech and signal processing (icassp), pp. 5880-5884. IEEE, 2016.

[5]

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223-1231, 2012.

Digital Library

[6]

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[7]

Elkan, C. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, volume 17, pp. 973-978. Lawrence Erlbaum Associates Ltd, 2001.

Digital Library

[8]

Gao, W., Jin, R., Zhu, S., and Zhou, Z.-H. One-pass auc optimization. In ICML (3), pp. 906-914, 2013.

[9]

Godichon-Baggioni, A. and Saadane, S. On the rates of convergence of parallelized averaged stochastic gradient algorithms. arXiv preprint arXiv:1710.07926, 2017.

[10]

Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

[11]

Haddadpour, F., Kamani, M. M., Mahdavi, M., and Cadambe, V. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. In Advances in Neural Information Processing Systems, pp. 11080- 11092, 2019.

[12]

Hanley, J. A. and McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29-36, 1982.

[13]

Hanley, J. A. and McNeil, B. J. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148(3):839-843, 1983.

[14]

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.

[15]

Jaggi, M., Smith, V., Takác, M., Terhorst, J., Krishnan, S., Hofmann, T., and Jordan, M. I. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3068-3076, 2014.

[16]

Jain, P., Netrapalli, P., Kakade, S. M., Kidambi, R., and Sidford, A. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. The Journal of Machine Learning Research, 18(1):8258-8299, 2017.

Digital Library

[17]

Jiang, P. and Agrawal, G. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Advances in Neural Information Processing Systems, pp. 2525-2536, 2018.

[18]

Jin, C., Netrapalli, P., and Jordan, M. I. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618, 2019.

[19]

Kamp, M., Adilova, L., Sicking, J., Hüger, F., Schlicht, P., Wirtz, T., and Wrobel, S. Efficient decentralized deep learning by dynamic model averaging. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 393-409. Springer, 2018.

[20]

Koloskova, A., Stich, S. U., and Jaggi, M. Decentralized stochastic optimization and gossip algorithms with compressed communication. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 3478- 3487, 2019.

[21]

Koloskova*, A., Lin*, T., Stich, S. U., and Jaggi, M. Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkgGCkrKvH.

[22]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097-1105, 2012.

Digital Library

[23]

Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI}14), pp. 583-598, 2014.

Digital Library

[24]

Lin, Q., Liu, M., Rafique, H., and Yang, T. Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality. arXiv preprint arXiv:1810.10207, 2018a.

[25]

Lin, T., Stich, S. U., Patel, K. K., and Jaggi, M. Don't use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217, 2018b.

[26]

Lin, T., Jin, C., and Jordan, M. I. On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331, 2019.

[27]

Liu, M., Zhang, X., Chen, Z., Wang, X., and Yang, T. Fast stochastic auc maximization with o (1/n)-convergence rate. In International Conference on Machine Learning, pp. 3195-3203, 2018.

[28]

Liu, M., Mroueh, Y., Ross, J., Zhang, W., Cui, X., Das, P., and Yang, T. Towards better understanding of adaptive gradient algorithms in generative adversarial nets. In International Conference on Learning Representations, 2020a. URL https://openreview.net/forum?id=SJxIm0VtwH.

[29]

Liu, M., Yuan, Z., Ying, Y., and Yang, T. Stochastic auc maximization with deep neural networks. ICLR, 2020b.

[30]

Lu, S., Tsaknakis, I., Hong, M., and Chen, Y. Hybrid block successive approximation for one-sided nonconvex min-max problems: Algorithms and applications. arXiv preprint arXiv:1902.08294, 2019.

[31]

McDonald, R., Hall, K., and Mann, G. Distributed training strategies for the structured perceptron. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, pp. 456-464. Association for Computational Linguistics, 2010.

[32]

McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.

[33]

Natole, M., Ying, Y., and Lyu, S. Stochastic proximal algorithms for auc maximization. In International Conference on Machine Learning, pp. 3707-3716, 2018.

[34]

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574- 1609, 2009.

Digital Library

[35]

Nesterov, Y. E. Introductory Lectures on Convex Optimization - A Basic Course, volume 87 of Applied Optimization. Springer, 2004.

[36]

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024-8035, 2019.

[37]

Povey, D., Zhang, X., and Khudanpur, S. Parallel training of dnns with natural gradient and parameter averaging. arXiv preprint arXiv:1410.7455, 2014.

[38]

Rafique, H., Liu, M., Lin, Q., and Yang, T. Non-convex minmax optimization: Provable algorithms and applications in machine learning. arXiv preprint arXiv:1810.02060, 2018.

[39]

Rockafellar, R. T. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877-898, 1976.

[40]

Sanjabi, M., Razaviyayn, M., and Lee, J. D. Solving non-convex non-concave min-max games under polyak-l ojasiewicz condition. arXiv preprint arXiv:1812.02878, 2018.

[41]

Shamir, O. and Srebro, N. Distributed stochastic optimization and learning. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 850-857. IEEE, 2014.

[42]

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.

[43]

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354-359, 2017.

[44]

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[45]

Stich, S. U. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.

[46]

Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pp. 4447-4458, 2018.

[47]

Su, H. and Chen, H. Experiments on parallel training of deep neural network using model averaging. arXiv preprint arXiv:1507.01239, 2015.

[48]

Wang, J. and Joshi, G. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. arXiv preprint arXiv:1810.08313, 2018a.

[49]

Wang, J. and Joshi, G. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018b.

[50]

Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pp. 1299-1309, 2018.

[51]

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5754-5764, 2019.

[52]

Ying, Y., Wen, L., and Lyu, S. Stochastic online auc maximization. In Advances in Neural Information Processing Systems, pp. 451-459, 2016.

[53]

Yu, H., Jin, R., and Yang, S. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. arXiv preprint arXiv:1905.03817, 2019a.

[54]

Yu, H., Yang, S., and Zhu, S. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5693-5700, 2019b.

Digital Library

[55]

Yuan, Z., Yan, Y., Jin, R., and Yang, T. Stagewise training accelerates convergence of testing error over SGD. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 2604-2614, 2019.

[56]

Zhang, Y., Duchi, J. C., and Wainwright, M. J. Communication-efficient algorithms for statistical optimization. The Journal of Machine Learning Research, 14 (1):3321-3363, 2013.

Digital Library

[57]

Zhao, P., Jin, R., Yang, T., and Hoi, S. C. Online auc maximization. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 233-240, 2011.

Digital Library

[58]

Zhou, F. and Cong, G. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv preprint arXiv:1708.01012, 2017.

[59]

Zinkevich, M., Weimer, M., Li, L., and Smola, A. J. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pp. 2595-2603, 2010.

Digital Library

Cited By

Zhu DWang BChen ZWang YSonka MWu XYang TKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Provable multi-instance deep AUC maximization with stochastic poolingProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620228(43205-43227)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3620228
Yang TYing Y(2022)AUC Maximization in the Era of Big Data and AI: A SurveyACM Computing Surveys10.1145/355472955:8(1-37)Online publication date: 3-Aug-2022
https://dl.acm.org/doi/10.1145/3554729

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'20: Proceedings of the 37th International Conference on Machine Learning

July 2020

11702 pages

Editors:
Hal Daumé,
Aarti Singh

Copyright © 2020.

Publisher

JMLR.org

Publication History

Published: 13 July 2020

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
44
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)3

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu DWang BChen ZWang YSonka MWu XYang TKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Provable multi-instance deep AUC maximization with stochastic poolingProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620228(43205-43227)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3620228
Yang TYing Y(2022)AUC Maximization in the Era of Big Data and AI: A SurveyACM Computing Surveys10.1145/355472955:8(1-37)Online publication date: 3-Aug-2022
https://dl.acm.org/doi/10.1145/3554729

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents