article

Free access

Stochastic primal-dual coordinate method for regularized empirical risk minimization

Authors:

Lin XiaoAuthors Info & Claims

The Journal of Machine Learning Research, Volume 18, Issue 1

Pages 2939 - 2980

Published: 01 January 2017 Publication History

PDF eReader Publisher Site

Abstract

We consider a generic convex optimization problem associated with regularized empirical risk minimization of linear predictors. The problem structure allows us to reformulate it as a convex-concave saddle point problem. We propose a stochastic primal-dual coordinate (SPDC) method, which alternates between maximizing over a randomly chosen dual variable and minimizing over the primal variables. An extrapolation step on the primal variables is performed to obtain accelerated convergence rate. We also develop a mini-batch version of the SPDC method which facilitates parallel computing, and an extension with weighted sampling probabilities on the dual variables, which has a better complexity than uniform sampling on unnormalized data. Both theoretically and empirically, we show that the SPDC method has comparable or better performance than several state-of-the-art optimization methods.

References

[1]

Alekh Agarwal and Léon Bottou. A lower bound for the optimization of finite sums. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 78-86, Lille, France, 2015.

Digital Library

[2]

Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1200-1205, Montreal, Canada, June 2017.

Digital Library

[3]

Amir Beck and Marc Teboulle. A fast iterative shrinkage-threshold algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183-202, 2009.

Digital Library

[4]

Dimitri P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, Ser. B, 129:163-195, 2011.

Digital Library

[5]

Dimitri P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. In S. Sra, S. Nowozin, and S. J. Wright, editors, Optimization for Machine Learning, chapter 4. The MIT Press, 2012.

[6]

Doron Blatt, Alfred Hero, and Hillel Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1):29-51, 2007.

Digital Library

[7]

Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves Lechevallier and Gilbert Saporta, editors, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010), pages 177-187, Paris, France, August 2010. Springer.

[8]

Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 161-168. MIT Press, Cambridge, MA, 2008.

Digital Library

[9]

Olivier Bousquet and André Elisseff. Stability and generalization. Journal of Machine Learning Research, 2:499-526, 2002.

Digital Library

[10]

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1-122, 2010.

Digital Library

[11]

Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40 (1):120-145, 2011.

Digital Library

[12]

Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. Coordinate descent method for large-scale l2-loss linear support vector machines. Journal of Machine Learning Research, 9: 1369-1398, 2008.

Digital Library

[13]

Min-Te Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3): 653-656, 1982.

[14]

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646-1654. 2014.

Digital Library

[15]

John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2873-2898, 2009.

Digital Library

[16]

Rong-En Fan and Chih-Jen Lin. LIBSVM data: Classi_cation, regression and multi-label. URL: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets, 2011.

[17]

Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of The 32nd International Conference on Machine Learning (ICML), pages 2540- 2548. 2015.

Digital Library

[18]

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, 2nd edition, 2009.

[19]

Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of Convex Analysis. Springer, 2001.

[20]

Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th International Conference on Machine Learning (ICML), pages 408-415, 2008.

Digital Library

[21]

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pages 315-323. 2013.

Digital Library

[22]

Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Technical report, Department of Industrial and System Engineering, University of Florida, July 2015.

[23]

John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777-801, 2009.

Digital Library

[24]

Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems 25, pages 2672-2680. 2012.

[25]

Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems 28, pages 3384- 3392. 2015a.

Digital Library

[26]

Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM Journal on Optimization, 25(4):2244-2273, 2015b.

Digital Library

[27]

P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964-979, December 1979.

Digital Library

[28]

Angelia Nedic and Dimitri P. Bertsekas. Incremental subgradient methods for nondiérentiable optimization. SIAM Journal on Optimization, 12(1):109-138, 2001.

Digital Library

[29]

Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Mathematical Programming, 155 (1-2):549-573, 2016.

Digital Library

[30]

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.

Digital Library

[31]

Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston, 2004.

Digital Library

[32]

Yurii Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, 103:127-152, 2005.

Digital Library

[33]

Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341-362, 2012.

[34]

Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, Ser. B, 140:125-161, 2013.

[35]

Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.

[36]

Hua Ouyang, Niao He, Long Tran, and Alexander Gray. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, USA, 2013.

Digital Library

[37]

John Platt. Fast training of support vector machine using sequential minimal optimization. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, pages 185-208. MIT Press, Cambridge, MA, USA, 1999.

Digital Library

[38]

Boris T. Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30:838-855, 1992.

Digital Library

[39]

Zheng Qu, Peter Richtárik, and Tong Zhang. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems 28, pages 865-873. 2015.

Digital Library

[40]

Peter Richtárik and Martin Takác. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 (1):1-38, 2014.

Digital Library

[41]

Peter Richtárik and Martin Takác. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1):433-484, 2016.

Digital Library

[42]

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Technical Report HAL 00860051, INRIA, Paris, France, 2013.

[43]

Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567-599, 2013a.

Digital Library

[44]

Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 378-385. 2013b.

Digital Library

[45]

Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1):105-145, 2015.

Digital Library

[46]

Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 392-400, Atlanta, GA, USA, 2013.

Digital Library

[47]

Taiji Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages 736-744, Beijing, 2014.

Digital Library

[48]

Martin Takác, Avleen Bijral, Peter Richtárik, and Nathan Srebro. Mini-batch primal and dual methods for SVMs. In Proceedings of the 30th International Conference on Machine Learning (ICML), 2013.

Digital Library

[49]

Paul Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsiz rule. SIAM Journal on Optimization, 8(2):506-531, 1998.

Digital Library

[50]

Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. Unpublished manuscript, 2008.

[51]

Huahua Wang and Arindam Banerjee. Online alternating direction method. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 1119-1126, Edinburgh, Scotland, UK, 2012.

Digital Library

[52]

Blake Woodworth and Nathan Srebro. Tight complexity bounds for optimizing composite objectives. arXiv:1605.08003, 2016.

[53]

Stephen J. Wright. Coordinate descent algorithms. Mathematical Programming, Series B, 151(1):3-34, 2015.

Digital Library

[54]

Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2534-2596, 2010.

Digital Library

[55]

Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057-2075, 2014.

Digital Library

[56]

Tianbao Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems 26, pages 629-637. 2013.

Digital Library

[57]

Adams Wei Yu, Qihang Lin, and Tianbao Yang. Double stochastic primal-dual coordinate method for regularized empirical risk minimization with factorized data. arXiv:1508.03390, 2015.

[58]

Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21st International Conference on Machine Learning (ICML), pages 116-123, Banff, Alberta, Canada, 2004.

Digital Library

[59]

Xiaoqun Zhang, Martin Burger, and Stanley Osher. A unifoed primal-dual algorithm framework based on Bregman iteration. Journal of Scientific Computing, 46(1):20-46, January 2011.

Digital Library

[60]

Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In Proceedings of The 32nd International Conference on Machine Learning (ICML), pages 353-361. 2015.

Digital Library

[61]

Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Proceedings, pages 1-9. JMLR.org, 2015.

Digital Library

[62]

Leon Wenliang Zhong and James T. Kwok. Fast stochastic alternating direction method of multipliers. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages 46-54, Beijing, China, 2014.

Digital Library

Cited By

Li DLi HZhang J(2024)General Procedure to Provide High-Probability Guarantees for Stochastic Saddle Point ProblemsJournal of Scientific Computing10.1007/s10915-024-02567-5100:1Online publication date: 28-May-2024
https://dl.acm.org/doi/10.1007/s10915-024-02567-5
Yan Y(2024)Improve robustness of machine learning via efficient optimization and conformal predictionAI Magazine10.1002/aaai.1217345:2(270-279)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1002/aaai.12173
Liu CLuo LKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Quasi-newton methods for saddle point problemsProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600557(3975-3987)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3600557
Show More Cited By

Recommendations

Stochastic primal-dual coordinate method for regularized empirical risk minimization
ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

We consider a generic convex optimization problem associated with regularized empirical risk minimization of linear predictors. The problem structure allows us to reformulate it as a convex-concave saddle point problem. We propose a stochastic primal-...
A Distributed Quasi-Newton Algorithm for Empirical Risk Minimization with Nonsmooth Regularization
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

We propose a communication- and computation-efficient distributed optimization algorithm using second-order information for solving ERM problems with a nonsmooth regularization term. Current second-order and quasi-Newton methods for this problem either ...
An Accelerated Randomized Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization

We consider the problem of minimizing the sum of two convex functions: one is smooth and given by a gradient oracle, and the other is separable over blocks of coordinates and has a simple known structure over each block. We develop an accelerated ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 18, Issue 1

January 2017

8830 pages

ISSN:1532-4435

EISSN:1533-7928

Editors:
Kevin Murphy
Google
,
Bernhard Schölkopf
MPI for Intelligent Systems

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 January 2017

Published in JMLR Volume 18, Issue 1

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
144
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)5

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li DLi HZhang J(2024)General Procedure to Provide High-Probability Guarantees for Stochastic Saddle Point ProblemsJournal of Scientific Computing10.1007/s10915-024-02567-5100:1Online publication date: 28-May-2024
https://dl.acm.org/doi/10.1007/s10915-024-02567-5
Yan Y(2024)Improve robustness of machine learning via efficient optimization and conformal predictionAI Magazine10.1002/aaai.1217345:2(270-279)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1002/aaai.12173
Liu CLuo LKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Quasi-newton methods for saddle point problemsProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600557(3975-3987)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3600557
Lu H(2022)An -resolution ODE framework for understanding discrete-time algorithms and applications to the linear convergence of minimax problemsMathematical Programming: Series A and B10.1007/s10107-021-01669-4194:1-2(1061-1112)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1007/s10107-021-01669-4
Xiao LYu ALin QChen W(2021)DSCOVRThe Journal of Machine Learning Research10.5555/3322706.336198420:1(1634-1691)Online publication date: 9-Mar-2021
https://dl.acm.org/doi/10.5555/3322706.3361984
Lu HFreund R(2021)Generalized stochastic Frank–Wolfe algorithm with stochastic “substitute” gradient for structured convex optimizationMathematical Programming: Series A and B10.1007/s10107-020-01480-7187:1-2(317-349)Online publication date: 1-May-2021
https://dl.acm.org/doi/10.1007/s10107-020-01480-7
Gutiérrez EDelplancke CEhrhardt M(2021)Convergence Properties of a Randomized Primal-Dual Algorithm with Applications to Parallel MRIScale Space and Variational Methods in Computer Vision10.1007/978-3-030-75549-2_21(254-266)Online publication date: 16-May-2021
https://dl.acm.org/doi/10.1007/978-3-030-75549-2_21
Xie GLuo LLian YZhang ZDaumé HSingh A(2020)Lower complexity bounds for finite-sum convex-concave minimax optimization problemsProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525911(10504-10513)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.5555/3524938.3525911
Alacaoglu AFercoq OCevher VDaumé HSingh A(2020)Random extrapolation for primal-dual coordinate descentProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3524957(191-201)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.5555/3524938.3524957
Luo LYe HHuang ZZhang TLarochelle HRanzato MHadsell RBalcan MLin H(2020)Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problemsProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497451(20566-20577)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.5555/3495724.3497451
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents