Article

Accelerated mini-batch stochastic dual coordinate ascent

Authors: Shai Shalev-Shwartz, Tong ZhangAuthors Info & Claims

NIPS'13: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1

Pages 378 - 385

Published: 05 December 2013 Publication History

Abstract

Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007].

References

[1]

Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pages 5451-5452. IEEE, 2012.

[2]

Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. A reliable effective terascale linear learning system. arXiv preprint arXiv:1110.4198, 2011.

[3]

Maria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. Distributed learning, communication complexity and privacy. arXiv preprint arXiv:1204.3514, 2012.

[4]

Joseph K Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In ICML, 2011.

Digital Library

[5]

Andrew Cotter, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. arXiv preprint arXiv:1106.4574, 2011.

[6]

Hal Daume III, Jeff M Phillips, Avishek Saha, and Suresh Venkatasubramanian. Protocols for learning classifiers on distributed data. arXiv preprint arXiv:1202.6078, 2012.

[7]

Ofer Dekel. Distribution-calibrated hierarchical classification. In NIPS, 2010.

Digital Library

[8]

Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 13:165-202, 2012.

Digital Library

[9]

John Duchi, Alekh Agarwal, and Martin J Wainwright. Distributed dual averaging in networks. Advances in Neural Information Processing Systems, 23, 2010.

[10]

Olivier Fercoq and Peter Richtárik. Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv preprint arXiv:1309.5885, 2013.

[11]

Nicolas Le Roux, Mark Schmidt, and Francis Bach. A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. arXiv preprint arXiv:1202.6258, 2012.

[12]

Phil Long and Rocco Servedio. Algorithms and hardness results for parallel large margin learning. In NIPS, 2011.

[13]

Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M Hellerstein. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv.W06.4990, 2010.

[14]

Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8):716-727, 2012.

Digital Library

[15]

Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103 (1):127-152, 2005.

Digital Library

[16]

Yurii Nesterov. Gradient methods for minimizing composite objective function, 2007.

[17]

Feng Niu, Benjamin Recht, Christopher Ré, and Stephen J Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. arXiv preprint arXiv:1106.5730, 2011.

[18]

Peter Richtárik and Martin Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, pages 1-38, 2012a.

Digital Library

[19]

Peter Richtárik and Martin Takáč. Parallel coordinate descent methods for big data optimization. arXiv preprint arXiv:1212.0873, 2012b.

[20]

Peter Richtárik and Martin Takáč. Distributed coordinate descent method for learning with big data. arXiv preprint arXiv:1310.2059, 2013.

[21]

Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567-599, Feb 2013a.

Digital Library

[22]

Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. arxiv, 2013b.

[23]

Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In ICML, pages 807-814, 2007.

Digital Library

[24]

Martin Takác, Avleen Bijral, Peter Richtárik, and Nathan Srebro. Mini-batch primal and dual methods for svms. arxiv, 2013.

Cited By

Chauhan VDahiya KSharma A(2019)Problem formulations and solvers in linear SVMArtificial Intelligence Review10.1007/s10462-018-9614-652:2(803-855)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.1007/s10462-018-9614-6
Chen LWang HZhao JKoutris PPapailiopoulos D(2018)The effect of network width on the performance of large-batch trainingProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327546.3327603(9322-9332)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327546.3327603
Salehi FThiran PCelis L(2018)Coordinate descent with bandit samplingProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327546.3327598(9267-9277)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327546.3327598
Show More Cited By

Index Terms

Accelerated mini-batch stochastic dual coordinate ascent

Index terms have been assigned to the content through auto-classification.

Recommendations

Efficient mini-batch training for stochastic optimization
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Stochastic gradient descent (SGD) is a popular technique for large-scale optimization problems in machine learning. In order to parallelize SGD, minibatch training needs to be employed to reduce the communication cost. However, an increase in minibatch ...
Accelerated mini-batch randomized block coordinate descent method
NIPS'14: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2

We consider regularized empirical risk minimization problems. In particular, we minimize the sum of a smooth empirical risk function and a nonsmooth regularization function. When the regularization function is block separable, we can solve the ...
Stochastic dual coordinate ascent methods for regularized loss

Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'13: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1

December 2013

3236 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 05 December 2013

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chauhan VDahiya KSharma A(2019)Problem formulations and solvers in linear SVMArtificial Intelligence Review10.1007/s10462-018-9614-652:2(803-855)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.1007/s10462-018-9614-6
Chen LWang HZhao JKoutris PPapailiopoulos D(2018)The effect of network width on the performance of large-batch trainingProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327546.3327603(9322-9332)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327546.3327603
Salehi FThiran PCelis L(2018)Coordinate descent with bandit samplingProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327546.3327598(9267-9277)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327546.3327598
Wang GZhao DZhang L(2018)Minimizing adaptive regret with one gradient per iterationProceedings of the 27th International Joint Conference on Artificial Intelligence10.5555/3304889.3305044(2762-2768)Online publication date: 13-Jul-2018
https://dl.acm.org/doi/10.5555/3304889.3305044
Liu BYuan XWang LLiu QMetaxas D(2017)Dual iterative hard ThresholdingProceedings of the 34th International Conference on Machine Learning - Volume 7010.5555/3305890.3305906(2179-2187)Online publication date: 6-Aug-2017
https://dl.acm.org/doi/10.5555/3305890.3305906
Lei QYen IWu CDhillon IRavikumar P(2017)Doubly Greedy Primal-Dual Coordinate Descent for sparse empirical risk minimizationProceedings of the 34th International Conference on Machine Learning - Volume 7010.5555/3305890.3305891(2034-2042)Online publication date: 6-Aug-2017
https://dl.acm.org/doi/10.5555/3305890.3305891
Levy K(2017)Online to offline conversions, universality and adaptive Minibatch sizesProceedings of the 31st International Conference on Neural Information Processing Systems10.5555/3294771.3294925(1612-1621)Online publication date: 4-Dec-2017
https://dl.acm.org/doi/10.5555/3294771.3294925
Smith VForte SMa CTakáč MJordan MJaggi M(2017)CoCoAThe Journal of Machine Learning Research10.5555/3122009.329041518:1(8590-8638)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.5555/3122009.3290415
Jain PNetrapalli PKakade SKidambi RSidford A(2017)Parallelizing stochastic gradient descent for least squares regressionThe Journal of Machine Learning Research10.5555/3122009.324208018:1(8258-8299)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.5555/3122009.3242080
Zhang YXiao L(2017)Stochastic primal-dual coordinate method for regularized empirical risk minimizationThe Journal of Machine Learning Research10.5555/3122009.317682818:1(2939-2980)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.5555/3122009.3176828
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents