skip to main content
research-article

Identifying the Root Causes of Wait States in Large-Scale Parallel Applications

Published: 20 July 2016 Publication History

Abstract

Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira, Jr., et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. By replaying event traces in parallel both forward and backward, we can identify the processes and call paths responsible for the most severe imbalances, even for runs with hundreds of thousands of processes.

References

[1]
Accelerated Strategic Computing Initiative. 1995. The ASCI SWEEP3D Benchmark Code. (1995). http://www.ccs3.lanl.gov/pal/software/sweep3d/sweep3d_readme.html.
[2]
Laksono Adhianto, Sinchan Banerjee, Michael W. Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R. Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience (April 2010).
[3]
Daniel Becker, Rolf Rabenseifner, Felix Wolf, and John C. Linford. 2009. Scalable timestamp synchronization for event traces of message-passing applications. Parallel Computing 35, 12 (2009), 595--607.
[4]
David Böhme, Bronis R. de Supinski, Markus Geimer, Martin Schulz, and Felix Wolf. 2012. Scalable critical-path based performance analysis. In Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS). 1330--1340.
[5]
David Böhme, Markus Geimer, Felix Wolf, and Lukas Arnold. 2010. Identifying the root causes of wait states in large-scale parallel applications. In Proceedings of the 39th International Conference on Parallel Processing (ICPP). IEEE Computer Society, 90--100. Best Paper Award.
[6]
Maria Calzarossa, Luisa Massari, and Daniele Tessera. 2004. A methodology towards automatic performance analysis of parallel applications. Parallel Computing 30, 2 (Feb. 2004), 211--223.
[7]
Todd Gamblin, Bronis R. de Supinski, Martin Schulz, Rob Fowler, and Daniel A. Reed. 2008. Scalable load-balance measurement for SPMD codes. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’08).
[8]
Markus Geimer, Felix Wolf, Brian J. N. Wylie, and Bernd Mohr. 2009. A scalable tool architecture for diagnosing wait states in massively-parallel applications. Parallel Computing 35, 7 (2009), 375--388.
[9]
Michael Geissler, S. Rykovanov, Jörg Schreiber, Jürgen Meyer ter Vehn, and G. D. Tsakiris. 2007. 3D simulations of surface harmonic generation with few-cycle laser pulses. New Journal of Physics 9, 7 (2007), 218.
[10]
Michael Geissler, Jörg Schreiber, and Jürgen Meyer ter Vehn. 2006. Bubble acceleration of electrons with few-cycle laser pulses. New Journal of Physics 8, 9 (2006), 186.
[11]
John C. Hayes, Michael L. Norman, Robert A. Fiedler, James O. Bordner, Pak Shing Li, Stephen E. Clark, Asif Ud-Doula, and Mordecai-Mark MacLow. 2006. Simulating radiating and magnetized flows in multi-dimensions with ZEUS-MP. Astrophysical Journal Supplement 165 (2006), 188--228.
[12]
Marc-André Hermanns, Manfred Miklosch, David Böhme, and Felix Wolf. 2013. Understanding the formation of wait states in applications with one-sided communication. In Procedings of the 20th European MPI Users’ Group Meeting (EuroMPI’13). ACM, New York, NY, 73--78.
[13]
Adolfy Hoisie, Olaf Lubeck, and Harvey Wasserman. 1999. Performance analysis of wavefront algorithms on very-large scale distributed systems. In Proceedings of the Workshop on Wide Area Networks and High Performance Computing, Lecture Notes in Control and Information Sciences, Vol. 249. Springer Berlin/Heidelberg, 171--187.
[14]
Jeffrey K. Hollingsworth. 1996. An online computation of critical path profiling. In Proceedings of the 1st ACM SIGMETRICS Symposium on Parallel and Distributed Tools. 11--20.
[15]
Hassan M. Jafri. 2007. Measuring causal propagation of overhead of inefficiencies in parallel applications. In Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems. Cambridge, MA, 237--243.
[16]
Allen D. Malony, Sameer S. Shende, and Alan Morris. 2005. Phase-based parallel performance profiling. In Proceedings of the Conference on Parallel Computing (ParCo, Malaga, Spain) (NIC Series), Vol. 33. John von Neumann Institute for Computing, 203--210.
[17]
Wagner Meira, Jr., Thomas J. LeBlanc, and Virgílio A. F. Almeida. 1998. Using cause-effect analysis to understand the performance of distributed programs. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT’98). ACM, New York, NY, 101--111.
[18]
Wagner Meira, Jr., Thomas J. LeBlanc, and Alexandros Poulos. 1996. Waiting time analysis and performance visualization in Carnival. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT’96). ACM, New York, NY, 1--10.
[19]
Oleg Morajko, Anna Morajko, Tomas Margalef, and Emilio Luque. 2008. On-line performance modeling for MPI applications. In Proceedings of the 14th Euro-Par Conference, Lecture Notes in Computer Science, Vol. 5168. Springer, 68--77.
[20]
Martin Schulz. 2005. Extracting critical path graphs from MPI applications. In Proceedings of the IEEE Cluster Conference. Boston, MA.
[21]
Martin Schulz, Greg Bronevetsky, and Bronis R. de Supinski. 2008. On the performance of transparent MPI piggyback messages. In Proceedings of the 15th European PVM/MPI Users’ Group Meeting, Lecture Notes in Computer Science, Vol. 5205. Springer, 194--201.
[22]
David Sundaram-Stukel and Mary K. Vernon. 1999. Predictive analysis of a wavefront application using LogGP. In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, Vol. 34. 141--150.
[23]
Zoltán Szebenyi, Felix Wolf, and Brian J. N. Wylie. 2009. Space-efficient time-series call-path profiling of parallel applications. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’09).
[24]
Zoltán Szebenyi, Brian J. N. Wylie, and Felix Wolf. 2008. Scalasca parallel performance analyses of SPEC MPI2007 applications. In Proceedings of the 1st SPEC International Performance Evaluation Workshop, Lecture Notes in Computer Science, Vol. 5119. Springer, 99--123.
[25]
Nathan R. Tallent, Laksono Adhianto, and John Mellor-Crummey. 2010. Scalable identification of load imbalance in parallel executions using call path profiles. In Supercomputing 2010. New Orleans, LA.
[26]
University Corporation for Atmospheric Research (UCAR). 2012. The Community Earth System Model. (Feb. 2012). http://www.cesm.ucar.edu/.
[27]
Jeffrey Vetter (Ed.). 2007. Report of the Workshop on Software Development Tools for Petascale Computing. (August 2007). U.S. Department of Energy, http://www.csm.ornl.gov/workshops/Petascale07/sdtpc_workshop_report.pdf.
[28]
Brian J. N. Wylie. 2012. Parallel performance measurement and analysis scaling lessons. SC’12 Workshop on Extreme-Scale Performance Tools.
[29]
Brian J. N. Wylie, David Böhme, Bernd Mohr, Zoltán Szebenyi, and Felix Wolf. 2010. Performance analysis of Sweep3D on Blue Gene/P with the Scalasca toolset. In Proceedings of the 24th International Parallel & Distributed Processing Symposium and Workshops (IPDPS). IEEE Computer Society.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 3, Issue 2
August 2016
154 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/2974644
Issue’s Table of Contents
© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2016
Accepted: 01 May 2016
Revised: 01 October 2014
Received: 01 July 2013
Published in TOPC Volume 3, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MPI
  2. OpenMP
  3. Performance analysis
  4. event tracing
  5. load imbalance
  6. root-cause analysis

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • G8 Research Councils Initiative on Multilateral Research
  • Deutsche Forschungsgemeinschaft (German Research Foundation)
  • U.S. Department of Energy by Lawrence Livermore National Laboratory
  • Interdisciplinary Program on Application Software towards Exascale Computing for Global Scale Issues is gratefully acknowledged
  • Helmholtz Association of German Research Centers

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media