skip to main content
10.1145/3267809.3275463acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
poster

Fast Distributed Deep Learning via Worker-adaptive Batch Sizing

Published: 11 October 2018 Publication History

Abstract

In heterogeneous or shared clusters, distributed learning processes are slowed down by straggling workers. In this work, we propose LB-BSP, a new synchronization scheme that eliminates stragglers by adapting each worker's training load (batch size) to its processing capability. For training in shared production clusters, a prerequisite for deciding the workers' batch sizes is to know their processing speeds before each iteration starts. To this end, we adopt NARX, an extended recurrent neural network that accounts for both the historical speeds and the driving factors such as CPU and memory in prediction.

References

[1]
Eugen Diaconescu. 2008. The use of NARX neural networks to predict chaotic time series. Wseas Transactions on computer research 3, 3 (2008), 182--191.
[2]
Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R Ganger, Phillip B Gibbons, Garth A Gibson, and Eric P Xing. 2016. Addressing the straggler problem for iterative convergent parallel ML. In ACM SoCC.
[3]
Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware distributed parameter servers. In ACM SIGMOD.
[4]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In USENIX OSDI.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '18: Proceedings of the ACM Symposium on Cloud Computing
October 2018
546 pages
ISBN:9781450360111
DOI:10.1145/3267809
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2018

Check for updates

Author Tags

  1. Distributed deep learning
  2. batch size
  3. load balancing

Qualifiers

  • Poster
  • Research
  • Refereed limited

Conference

SoCC '18
Sponsor:
SoCC '18: ACM Symposium on Cloud Computing
October 11 - 13, 2018
CA, Carlsbad, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media