On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length

Jastrzębski, Stanisław; Kenton, Zachary; Ballas, Nicolas; Fischer, Asja; Bengio, Yoshua; Storkey, Amos

Statistics > Machine Learning

arXiv:1807.05031v2 (stat)

[Submitted on 13 Jul 2018 (v1), revised 13 Dec 2018 (this version, v2), latest version 23 Dec 2019 (v6)]

Title:On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length

Authors:Stanisław Jastrzębski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey

View PDF

Abstract:The training of deep neural networks with Stochastic Gradient Descent (SGD) with a large learning rate or a small batch-size typically ends in flat regions of the weight space, as indicated by small eigenvalues of the Hessian of the training loss. This was found to correlate with a good final generalization performance. In this paper we extend previous work by investigating the curvature of the loss surface along the whole training trajectory, rather than only at the endpoint. We find that initially SGD visits increasingly sharp regions, reaching a maximum sharpness determined by both the learning rate and the batch-size of SGD. At this peak value SGD starts to fail to minimize the loss along directions in the loss surface corresponding to the largest curvature (sharpest directions). To further investigate the effect of these dynamics in the training process, we study a variant of SGD using a reduced learning rate along the sharpest directions which we show can improve training speed while finding both a sharper and better generalizing solution, compared to vanilla SGD. Overall, our results show that the SGD dynamics in the subspace of the sharpest directions influence the regions that SGD steers to (where larger learning rate or smaller batch size result in wider regions visited), the overall training speed, and the generalization ability of the final model.

Comments:	Short version accepted into Modern Trends in Nonconvex Optimization for Machine Learning workshop at International Conference on Machine Learning 2018
Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG)
Cite as:	arXiv:1807.05031 [stat.ML]
	(or arXiv:1807.05031v2 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.1807.05031
Journal reference:	International Conference on Learning Representations (ICLR) 2019

Submission history

From: Stanisław Jastrzębski [view email]
[v1] Fri, 13 Jul 2018 12:17:41 UTC (2,692 KB)
[v2] Thu, 13 Dec 2018 21:50:08 UTC (3,811 KB)
[v3] Wed, 13 Mar 2019 06:57:09 UTC (3,706 KB)
[v4] Fri, 15 Mar 2019 16:10:19 UTC (3,706 KB)
[v5] Tue, 9 Jul 2019 19:37:11 UTC (3,706 KB)
[v6] Mon, 23 Dec 2019 12:50:22 UTC (3,706 KB)

Statistics > Machine Learning

Title:On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators