skip to main content
article
Free access

Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification

Published: 01 January 2017 Publication History

Abstract

This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work presents a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD in order to decrease the variance in SGD's final iterate. This work presents sharp finite sample generalization error bounds for these schemes for the stochastic approximation problem of least squares regression.
Furthermore, this work establishes a precise problem-dependent extent to which mini-batching can be used to yield provable near-linear parallelization speedups over SGD with batch size one. This characterization is used to understand the relationship between learning rate versus batch size when considering the excess risk of the final iterate of an SGD procedure. Next, this mini-batching characterization is utilized in providing a highly parallelizable SGD method that achieves the minimax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGD-style methods. Following this, a non-asymptotic excess risk bound for model averaging (which is a communication efficient parallelization scheme) is provided.
Finally, this work sheds light on fundamental differences in SGD's behavior when dealing with mis-specified models in the non-realizable least squares problem. This paper shows that maximal stepsizes ensuring minimax risk for the mis-specified case must depend on the noise properties.
The analysis tools used by this paper generalize the operator view of averaged SGD (Défossez and Bach, 2015) followed by developing a novel analysis in bounding these operators to characterize the generalization error. These techniques are of broader interest in analyzing various computational aspects of stochastic approximation.

References

[1]
Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 2012.
[2]
Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. CoRR, abs/1603.05953, 2016.
[3]
Dan Anbar. On Optimal Estimation Methods Using Stochastic Approximation Procedures. University of California, 1971.
[4]
Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Neural Information Processing Systems (NIPS) 24, 2011.
[5]
Francis R. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. In Journal of Machine Learning Research (JMLR), volume 15, 2014.
[6]
Francis R. Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Neural Information Processing Systems (NIPS) 26, 2013.
[7]
Rajendra Bhatia. Positive Definite Matrices. Princeton Series in Applied Mathematics. Princeton University Press, 2007.
[8]
Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Neural Information Processing Systems (NIPS) 20, 2007.
[9]
Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838, 2016.
[10]
Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for 11-regularized loss minimization. In International Conference on Machine Learning (ICML), 2011.
[11]
Louis Augustin Cauchy. Méthode générale pour la résolution des systémes d'équations simultanees. C. R. Acad. Sci. Paris, 1847.
[12]
Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Neural Information Processing Systems (NIPS) 24, 2011.
[13]
Aaron Defazio. A simple practical accelerated method for finite sums. In Neural Information Processing Systems (NIPS) 29, 2016.
[14]
Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Neural Information Processing Systems (NIPS) 27, 2014.
[15]
Alexandre Défossez and Francis R. Bach. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Artifical Intelligence and Statistics (AISTATS), 2015.
[16]
Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research (JMLR), volume 13, 2012.
[17]
Aymeric Dieuleveut and Francis Bach. Non-parametric stochastic approximation with large step sizes. The Annals of Statistics, 2015.
[18]
John C. Duchi, Sorathan Chaturapruek, and Christopher Ré. Asynchronous stochastic convex optimization. CoRR, abs/1508.00882, 2015.
[19]
Vaclav Fabian. Asymptotically efficient stochastic approximation; the RM case. Annals of Statistics, 1(3), 1973.
[20]
Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In International Conference on Machine Learning (ICML), 2015a.
[21]
Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Competing with the empirical risk minimizer in a single pass. In Conference on Learning Theory (COLT), 2015b.
[22]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
[23]
Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming pca: Matching matrix bernstein and near-optimal finite sample guarantees for oja's algorithm. In Conference on Learning Theory (COLT), 2016a.
[24]
Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic approximation through mini-batching and tail-averaging. arXiv preprint arXiv:1610.03774, 2016b.
[25]
Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, Venkata Krishna Pillutla, and Aaron Sidford. A markov chain theory approach to characterizing the minimax optimality of stochastic gradient descent (for least squares). arXiv preprint arXiv:1710.09430, 2017a.
[26]
Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227, 2017b.
[27]
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Neural Information Processing Systems (NIPS) 26, 2013.
[28]
Harold J. Kushner and Dean S. Clark. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, 1978.
[29]
Harold J. Kushner and G. Yin. Asymptotic properties of distributed and communicating stochastic approximation algorithms. SIAM Journal on Control and Optimization, 25(5):1266-1290, 1987.
[30]
Harold J. Kushner and G. Yin. Stochastic approximation and recursive algorithms and applications. Springer-Verlag, 2003.
[31]
Erich L. Lehmann and George Casella. Theory of Point Estimation. Springer Texts in Statistics. Springer, 1998.
[32]
Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. Efficient mini-batch training for stochastic optimization. In Knowledge Discovery and Data Mining (KDD), 2014.
[33]
Hongzhou Lin, Julien Mairal, and Zaïd Harchaoui. A universal catalyst for first-order optimization. In Neural Information Processing Systems (NIPS), 2015.
[34]
Gideon Mann, Ryan T. McDonald, Mehryar Mohri, Nathan Silberman, and Dan Walker. Efficient large-scale distributed training of conditional maximum entropy models. In Neural Information Processing Systems (NIPS) 22, 2009.
[35]
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
[36]
Deanna Needell, Nathan Srebro, and RachelWard. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Mathematical Programming, volume 155, 2016.
[37]
Arkadi S. Nemirovsky and David B. Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley, 1983.
[38]
Yurii E. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Doklady AN SSSR, 269, 1983.
[39]
Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Neural Information Processing Systems (NIPS) 24, 2011.
[40]
Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4, 1964.
[41]
Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J Control Optim, volume 30, 1992.
[42]
Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Stat., vol. 22, 1951.
[43]
Jonathan Rosenblatt and Boaz Nadler. On the optimality of averaging in distributed statistical learning. CoRR, abs/1407.2724, 2014.
[44]
Nicolas Le Roux, Mark Schmidt, and Francis R. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In Neural Information Processing Systems (NIPS) 25, 2012.
[45]
David Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Tech. Report, ORIE, Cornell University, 1988.
[46]
Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. CoRR, abs/1209.1873, 2012.
[47]
Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In Neural Information Processing Systems (NIPS) 26, 2013a.
[48]
Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In Neural Information Processing Systems (NIPS) 26, 2013b.
[49]
Samuel L Smith, Pieter-Jan Kindermans, and Quoc V Le. Don't decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
[50]
Martin Takác, Avleen Singh Bijral, Peter Richtárik, and Nati Srebro. Mini-batch primal and dual methods for SVMs. In International Conference on Machine Learning (ICML), volume 28, 2013.
[51]
Martin Takác, Peter Richtárik, and Nati Srebro. Distributed mini-batch sdca. CoRR, abs/1507.08322, 2015.
[52]
Aad W. van der Vaart. Asymptotic Statistics. Cambridge University Publishers, 2000.
[53]
Yuchen Zhang and Lin Xiao. Disco: Distributed optimization for self-concordant empirical loss. In International Conference on Machine Learning (ICML), 2015.
[54]
Yuchen Zhang, John C. Duchi, and MartinWainwright. Divide and conquer ridge regression: A distributed algorithm with minimax optimal rates. Journal of Machine Learning Research (JMLR), volume 16, 2015.
[55]
Martin A. Zinkevich, Alex Smola, Markus Weimer, and Lihong Li. Parallelized stochastic gradient descent. In Neural Information Processing Systems (NIPS) 24, 2011.

Cited By

View all
  • (2023)First order methods with Markovian noiseProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668063(44820-44835)Online publication date: 10-Dec-2023
  • (2023)Finite-sample analysis of learning high-dimensional single ReLU neuronProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619987(37919-37951)Online publication date: 23-Jul-2023
  • (2022)The power and limitation of pretraining-finetuning for linear regression under covariate shiftProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602664(33041-33053)Online publication date: 28-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research
The Journal of Machine Learning Research  Volume 18, Issue 1
January 2017
8830 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Revised: 01 March 2018
Published: 01 January 2017
Published in JMLR Volume 18, Issue 1

Author Tags

  1. agnostic learning
  2. batchsize doubling
  3. heteroscedastic noise
  4. iterate averaging
  5. least squares regression
  6. mini batch SGD
  7. mis-specified models
  8. model averaging
  9. parallelization
  10. parameter mixing
  11. stochastic approximation
  12. stochastic gradient descent
  13. suffix averaging

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)3
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)First order methods with Markovian noiseProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668063(44820-44835)Online publication date: 10-Dec-2023
  • (2023)Finite-sample analysis of learning high-dimensional single ReLU neuronProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619987(37919-37951)Online publication date: 23-Jul-2023
  • (2022)The power and limitation of pretraining-finetuning for linear regression under covariate shiftProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602664(33041-33053)Online publication date: 28-Nov-2022
  • (2022)Risk bounds of multi-pass SGD for least squares in the interpolation regimeProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601208(12909-12920)Online publication date: 28-Nov-2022
  • (2021)Streaming linear system identification with reverse experience replayProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3542568(30140-30152)Online publication date: 6-Dec-2021
  • (2021)The benefits of implicit regularization from SGD in least squares problemsProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540678(5456-5468)Online publication date: 6-Dec-2021
  • (2020)Is local SGD better than minibatch SGD?Proceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525895(10334-10343)Online publication date: 13-Jul-2020
  • (2020)Communication-efficient distributed stochastic AUC maximization with deep neural networksProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525300(3864-3874)Online publication date: 13-Jul-2020
  • (2020)The implicit regularization of stochastic gradient flow for least squaresProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3524961(233-244)Online publication date: 13-Jul-2020
  • (2020)Robust, accurate stochastic optimization for variational inferenceProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3496644(10961-10973)Online publication date: 6-Dec-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media