skip to main content
article
Free access

Stochastic primal-dual coordinate method for regularized empirical risk minimization

Published: 01 January 2017 Publication History

Abstract

We consider a generic convex optimization problem associated with regularized empirical risk minimization of linear predictors. The problem structure allows us to reformulate it as a convex-concave saddle point problem. We propose a stochastic primal-dual coordinate (SPDC) method, which alternates between maximizing over a randomly chosen dual variable and minimizing over the primal variables. An extrapolation step on the primal variables is performed to obtain accelerated convergence rate. We also develop a mini-batch version of the SPDC method which facilitates parallel computing, and an extension with weighted sampling probabilities on the dual variables, which has a better complexity than uniform sampling on unnormalized data. Both theoretically and empirically, we show that the SPDC method has comparable or better performance than several state-of-the-art optimization methods.

References

[1]
Alekh Agarwal and Léon Bottou. A lower bound for the optimization of finite sums. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 78-86, Lille, France, 2015.
[2]
Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1200-1205, Montreal, Canada, June 2017.
[3]
Amir Beck and Marc Teboulle. A fast iterative shrinkage-threshold algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183-202, 2009.
[4]
Dimitri P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, Ser. B, 129:163-195, 2011.
[5]
Dimitri P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. In S. Sra, S. Nowozin, and S. J. Wright, editors, Optimization for Machine Learning, chapter 4. The MIT Press, 2012.
[6]
Doron Blatt, Alfred Hero, and Hillel Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1):29-51, 2007.
[7]
Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves Lechevallier and Gilbert Saporta, editors, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010), pages 177-187, Paris, France, August 2010. Springer.
[8]
Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 161-168. MIT Press, Cambridge, MA, 2008.
[9]
Olivier Bousquet and André Elisseff. Stability and generalization. Journal of Machine Learning Research, 2:499-526, 2002.
[10]
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1-122, 2010.
[11]
Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40 (1):120-145, 2011.
[12]
Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. Coordinate descent method for large-scale l2-loss linear support vector machines. Journal of Machine Learning Research, 9: 1369-1398, 2008.
[13]
Min-Te Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3): 653-656, 1982.
[14]
Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646-1654. 2014.
[15]
John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2873-2898, 2009.
[16]
Rong-En Fan and Chih-Jen Lin. LIBSVM data: Classi_cation, regression and multi-label. URL: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets, 2011.
[17]
Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of The 32nd International Conference on Machine Learning (ICML), pages 2540- 2548. 2015.
[18]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, 2nd edition, 2009.
[19]
Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of Convex Analysis. Springer, 2001.
[20]
Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th International Conference on Machine Learning (ICML), pages 408-415, 2008.
[21]
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pages 315-323. 2013.
[22]
Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Technical report, Department of Industrial and System Engineering, University of Florida, July 2015.
[23]
John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777-801, 2009.
[24]
Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems 25, pages 2672-2680. 2012.
[25]
Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems 28, pages 3384- 3392. 2015a.
[26]
Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM Journal on Optimization, 25(4):2244-2273, 2015b.
[27]
P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964-979, December 1979.
[28]
Angelia Nedic and Dimitri P. Bertsekas. Incremental subgradient methods for nondiérentiable optimization. SIAM Journal on Optimization, 12(1):109-138, 2001.
[29]
Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Mathematical Programming, 155 (1-2):549-573, 2016.
[30]
Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.
[31]
Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston, 2004.
[32]
Yurii Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, 103:127-152, 2005.
[33]
Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341-362, 2012.
[34]
Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, Ser. B, 140:125-161, 2013.
[35]
Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.
[36]
Hua Ouyang, Niao He, Long Tran, and Alexander Gray. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, USA, 2013.
[37]
John Platt. Fast training of support vector machine using sequential minimal optimization. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, pages 185-208. MIT Press, Cambridge, MA, USA, 1999.
[38]
Boris T. Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30:838-855, 1992.
[39]
Zheng Qu, Peter Richtárik, and Tong Zhang. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems 28, pages 865-873. 2015.
[40]
Peter Richtárik and Martin Takác. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 (1):1-38, 2014.
[41]
Peter Richtárik and Martin Takác. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1):433-484, 2016.
[42]
Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Technical Report HAL 00860051, INRIA, Paris, France, 2013.
[43]
Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567-599, 2013a.
[44]
Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 378-385. 2013b.
[45]
Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1):105-145, 2015.
[46]
Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 392-400, Atlanta, GA, USA, 2013.
[47]
Taiji Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages 736-744, Beijing, 2014.
[48]
Martin Takác, Avleen Bijral, Peter Richtárik, and Nathan Srebro. Mini-batch primal and dual methods for SVMs. In Proceedings of the 30th International Conference on Machine Learning (ICML), 2013.
[49]
Paul Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsiz rule. SIAM Journal on Optimization, 8(2):506-531, 1998.
[50]
Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. Unpublished manuscript, 2008.
[51]
Huahua Wang and Arindam Banerjee. Online alternating direction method. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 1119-1126, Edinburgh, Scotland, UK, 2012.
[52]
Blake Woodworth and Nathan Srebro. Tight complexity bounds for optimizing composite objectives. arXiv:1605.08003, 2016.
[53]
Stephen J. Wright. Coordinate descent algorithms. Mathematical Programming, Series B, 151(1):3-34, 2015.
[54]
Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2534-2596, 2010.
[55]
Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057-2075, 2014.
[56]
Tianbao Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems 26, pages 629-637. 2013.
[57]
Adams Wei Yu, Qihang Lin, and Tianbao Yang. Double stochastic primal-dual coordinate method for regularized empirical risk minimization with factorized data. arXiv:1508.03390, 2015.
[58]
Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21st International Conference on Machine Learning (ICML), pages 116-123, Banff, Alberta, Canada, 2004.
[59]
Xiaoqun Zhang, Martin Burger, and Stanley Osher. A unifoed primal-dual algorithm framework based on Bregman iteration. Journal of Scientific Computing, 46(1):20-46, January 2011.
[60]
Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In Proceedings of The 32nd International Conference on Machine Learning (ICML), pages 353-361. 2015.
[61]
Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Proceedings, pages 1-9. JMLR.org, 2015.
[62]
Leon Wenliang Zhong and James T. Kwok. Fast stochastic alternating direction method of multipliers. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages 46-54, Beijing, China, 2014.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research
The Journal of Machine Learning Research  Volume 18, Issue 1
January 2017
8830 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 January 2017
Published in JMLR Volume 18, Issue 1

Author Tags

  1. computational complexity
  2. convex-concave saddle
  3. empirical risk minimization
  4. point problems
  5. primal-dual algorithms
  6. randomized algorithms

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)5
Reflects downloads up to 04 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media