skip to main content
10.1145/3624062.3624145acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

ZeroSum: User Space Monitoring of Resource Utilization and Contention on Heterogeneous HPC Systems

Published: 12 November 2023 Publication History

Abstract

Heterogeneous High Performance Computing (HPC) systems are highly specialized, complex, powerful, and expensive systems. Efficient utilization of these systems requires monitoring tools to confirm that users have configured their jobs, workflows, and applications correctly to consume the limited allocations they have been awarded. Historically system monitoring tools are designed for – and only available to – system administrators and facilities personnel to ensure that the system is healthy, utilized, and operating within acceptable parameters. However, there is a demand for user space monitoring capabilities to address the configuration validation and optimization problem. In this paper, we describe a prototype tool, ZeroSum, designed to provide user space monitoring of application processes, lightweight processes (threads), and hardware resources on heterogeneous, distributed HPC systems. ZeroSum is designed to be used either as a limited-use porting tool or as an always-on monitoring library.

Supplemental Material

MP4 File
Recording of "ZeroSum: User Space Monitoring of Resource Utilization and Contention on Heterogeneous HPC Systems" presentation at HUST-23.

References

[1]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. 2010. HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org. Concurr. Comput. : Pract. Exper. 22 (April 2010), 685–701. Issue 6. https://doi.org/10.1002/cpe.v22:6
[2]
Anthony Agelastos, Benjamin Allan, Jim Brandt, Paul Cassella, Jeremy Enos, Joshi Fullop, Ann Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, Mahesh Rajan, Michael Showerman, Joel Stevenson, Narate Taerat, and Tom Tucker. 2014. The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 154–165. https://doi.org/10.1109/SC.2014.18
[3]
Dong H. Ahn, Ned Bass, Albert Chu, Jim Garlick, Mark Grondona, Stephen Herbein, Helgi I. Ingólfsson, Joseph Koning, Tapasya Patki, Thomas R.W. Scogland, Becky Springmeyer, and Michela Taufer. 2020. Flux: Overcoming scheduling challenges for exascale workflows. Future Generation Computer Systems 110 (2020), 202–213. https://doi.org/10.1016/j.future.2020.04.006
[4]
AMD. 2022. ROCm System Management Interface. https://github.com/RadeonOpenCompute/rocm_smi_lib
[5]
Inc. Apple. 2023. Activity Monitor User Guide. online. https://support.apple.com/guide/activity-monitor/welcome/mac
[6]
Abhinav Bhatele, Kathryn Mohror, Steven H. Langer, and Katherine E. Isaacs. 2013. There Goes the Neighborhood: Performance Degradation Due to Nearby Jobs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’13). Association for Computing Machinery, New York, NY, USA, Article 41, 12 pages. https://doi.org/10.1145/2503210.2503247
[7]
David Boehme, Todd Gamblin, David Beckingsale, Peer-Timo Bremer, Alfredo Gimenez, Matthew LeGendre, Olga Pearce, and Martin Schulz. 2016. Caliper: performance introspection for HPC software stacks. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 550–560.
[8]
David Boehme, Todd Gamblin, David Beckingsale, Peer-Timo Bremer, Alfredo Gimenez, Matthew LeGendre, Olga Pearce, and Martin Schulz. 2016. Caliper: Performance Introspection for HPC Software Stacks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Salt Lake City, Utah) (SC ’16). IEEE Press, Article 47, 11 pages.
[9]
David Boehme, Kevin Huck, Jonathan Madsen, and Josef Weidendorfer. 2019. The Case for a Common Instrumentation Interface for HPC Codes. In 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools). 33–39. https://doi.org/10.1109/ProTools49597.2019.00010
[10]
François Broquedis, Jérôme Clet-Ortega, Stéphanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. 2010. hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications. In PDP 2010 - The 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, IEEE (Ed.). Inria.fr, Pisa, Italy. https://doi.org/10.1109/PDP.2010.67
[11]
P. Carns, R. Latham, R. Ross, K. Iskra, S. Lang, and K. Riley. 2009. 24/7 Characterization of petascale I/O workloads. In 2009 IEEE International Conference on Cluster Computing and Workshops (CLUSTER). IEEE Computer Society, Los Alamitos, CA, USA, 1–10. https://doi.org/10.1109/CLUSTR.2009.5289150
[12]
Intel Corporation. 2023. Data Parallel C++: the oneAPI Implementation of SYCL*. https://www.intel.com/content/www/us/en/developer/tools/oneapi/data-parallel-c-plus-plus.html#gs.3mmagp
[13]
Todd Evans, William L. Barth, James C. Browne, Robert L. DeLeon, Thomas R. Furlani, Steven M. Gallo, Matthew D. Jones, and Abani K. Patra. 2014. Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats. In 2014 First International Workshop on HPC User Support Tools. 13–21. https://doi.org/10.1109/HUST.2014.7
[14]
Hanhua Feng, Vishal Misra, and Dan Rubenstein. 2007. PBS: A Unified Priority-Based Scheduler. SIGMETRICS Perform. Eval. Rev. 35, 1 (jun 2007), 203–214. https://doi.org/10.1145/1269899.1254906
[15]
William F Godoy, Norbert Podhorszki, Ruonan Wang, Chuck Atkins, Greg Eisenhauer, Junmin Gu, Philip Davis, Jong Choi, Kai Germaschewski, Kevin Huck, 2020. ADIOS 2: The adaptable input output system. a framework for high-performance data management. SoftwareX 12 (2020), 100561.
[16]
Robert Hager, E.S. Yoon, S. Ku, E.F. D’Azevedo, P.H. Worley, and C.S. Chang. 2016. A fully non-linear multi-species Fokker–Planck–Landau collision operator for simulation of fusion plasma. J. Comput. Phys. 315 (2016), 644–660. https://doi.org/10.1016/j.jcp.2016.03.064
[17]
Alex Harper. 2023. MenuMeters. online. https://ragingmenace.com/software/menumeters/
[18]
Jeongnim Kim, Andrew D Baczewski, Todd D Beaudet, Anouar Benali, M Chandler Bennett, Mark A Berrill, Nick S Blunt, Edgar Josué Landinez Borda, Michele Casula, David M Ceperley, Simone Chiesa, Bryan K Clark, Raymond C Clay, Kris T Delaney, Mark Dewing, Kenneth P Esler, Hongxia Hao, Olle Heinonen, Paul R C Kent, Jaron T Krogel, Ilkka Kylänpää, Ying Wai Li, M Graham Lopez, Ye Luo, Fionn D Malone, Richard M Martin, Amrita Mathuriya, Jeremy McMinis, Cody A Melton, Lubos Mitas, Miguel A Morales, Eric Neuscamman, William D Parker, Sergio D Pineda Flores, Nichols A Romero, Brenda M Rubenstein, Jacqueline A R Shea, Hyeondeok Shin, Luke Shulenburger, Andreas F Tillack, Joshua P Townsend, Norm M Tubman, Brett Van Der Goetz, Jordan E Vincent, D ChangMo Yang, Yubo Yang, Shuai Zhang, and Luning Zhao. 2018. QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids. Journal of Physics: Condensed Matter 30, 19 (apr 2018), 195901. https://doi.org/10.1088/1361-648X/aab9c3
[19]
Dalibor Klusáček, Václav Chlumský, and Hana Rudová. 2015. Planning and Optimization in TORQUE Resource Manager. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA) (HPDC ’15). Association for Computing Machinery, New York, NY, USA, 203–206. https://doi.org/10.1145/2749246.2749266
[20]
Andreas Knüpfer, Christian Rössel, Scott Biersdorff, Kai Diethelm, Dominic Eschweiler, Markus Geimer, Michael Gerndt, Daniel Lorenz, Allen Malony, Wolfgang E Nagel, 2012. Score-p: A joint performance measurement run-time infrastructure for periscope, scalasca, tau, and vampir. In Tools for High Performance Computing 2011. Springer, 79–91.
[21]
Linux man pages. 2023. htop(1) — Linux manual page. online. https://man7.org/linux/man-pages/man1/htop.1.html
[22]
Linux man pages. 2023. proc - process information, system information, and sysctl pseudo-filesystem. online. https://man7.org/linux/man-pages/man5/proc.5.html
[23]
Linux man pages. 2023. top(1) — Linux manual page. online. https://man7.org/linux/man-pages/man1/top.1.html
[24]
Matthew L Massie, Brent N Chun, and David E Culler. 2004. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30, 7 (2004), 817–840. https://doi.org/10.1016/j.parco.2004.04.001
[25]
Microsoft. 2023. Using Resource Monitor to Troubleshoot Windows Performance Issues. online. https://techcommunity.microsoft.com/t5/ask-the-performance-team/using-resource-monitor-to-troubleshoot-windows-performance/ba-p/375008
[26]
Servesh Muralidharan. 2023. An overview of Argonne’s Aurora Exascale Supercomputer and its Programming Models. online. https://extremecomputingtraining.anl.gov/wp-content/uploads/sites/96/2022/11/ATPESC-2022-Track-1-Talk-9-Muralidharan-Aurora.pdf
[27]
NERSC. 2023. Perlmutter Architecture. online. https://docs.nersc.gov/systems/perlmutter/architecture/
[28]
T. Newhouse and J. Pasquale. 2006. ALPS: An Application-Level Proportional-Share Scheduler. In 2006 15th IEEE International Conference on High Performance Distributed Computing. IEEE, Paris, France, 279–290. https://doi.org/10.1109/HPDC.2006.1652159
[29]
NVIDIA. 2020. NVIDIA Management Library (NVML). https://developer.nvidia.com/nvidia-management-library-nvml.
[30]
OLCF. 2023. Frontier User Guide. online. https://docs.olcf.ornl.gov/systems/frontier_user_guide.html
[31]
OLCF. 2023. Summit User Guide. online. https://docs.olcf.ornl.gov/systems/summit_user_guide.html
[32]
OpenMP. 2022. OpenMP Specifications. https://www.openmp.org/specifications/
[33]
Josko Plazonic, Jonathan Halverson, and Troy Comi. 2023. Jobstats: A Slurm-Compatible Job Monitoring Platform for CPU and GPU Clusters. In Practice and Experience in Advanced Research Computing (Portland, OR, USA) (PEARC ’23). Association for Computing Machinery, New York, NY, USA, 102–108. https://doi.org/10.1145/3569951.3604396
[34]
Allan Porterfield, Rob Fowler, Anirban Mandal, David O’Brien, Stephen Olivier, and Michael Spiegel. 2012. Adaptive scheduling using performance introspection. Technical Report. TR-12-02. RENCI, 2012. https://renci.org/wp-content/uploads/2012/07/TR-12-02.pdf
[35]
The Open MPI Project. 2023. Portable Hardware Locality (hwloc). online. https://www.open-mpi.org/projects/hwloc/
[36]
Puppet, Inc. a Perforce Company. 2023. Welcome to Puppet Enterprise 2023.2. online. https://www.puppet.com/docs/pe/2023.2/pe_user_guide.html
[37]
QMCPACK. 2023. QMCPACK miniapp: a simplified real space QMC code for algorithm development, performance portability testing, and computer science experiments.online. https://github.com/QMCPACK/miniqmc/tree/OMP_offload
[38]
Robert B Ross, George Amvrosiadis, Philip Carns, Charles D Cranor, Matthieu Dorier, Kevin Harms, Greg Ganger, Garth Gibson, Samuel K Gutierrez, Robert Latham, 2020. Mochi: Composing data services for high-performance computing environments. Journal of Computer Science and Technology 35 (2020), 121–144.
[39]
Sangmin Seo, Abdelhalim Amer, Pavan Balaji, Cyril Bordage, George Bosilca, Alex Brooks, Philip Carns, Adrián Castelló, Damien Genet, Thomas Herault, Shintaro Iwasaki, Prateek Jindal, Laxmikant V. Kalé, Sriram Krishnamoorthy, Jonathan Lifflander, Huiwei Lu, Esteban Meneses, Marc Snir, Yanhua Sun, Kenjiro Taura, and Pete Beckman. 2018. Argobots: A Lightweight Low-Level Threading and Tasking Framework. IEEE Transactions on Parallel and Distributed Systems 29, 3 (2018), 512–526. https://doi.org/10.1109/TPDS.2017.2766062
[40]
S. Shende and A. D. Malony. Summer 2006. The TAU Parallel Performance System. International Journal of High Performance Computing Applications 20, 2 (Summer 2006), 287–331.
[41]
Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010. Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009. Springer, 157–173.
[42]
ORNL Tom Papatheodore. 2023. Hello jsrun. online. https://code.ornl.gov/t4p/Hello_jsrun
[43]
Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing, Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 44–60.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
November 2023
2180 pages
ISBN:9798400707858
DOI:10.1145/3624062
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. parallel computing
  2. resource contention
  3. resource utilization
  4. scheduling

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research

Conference

SC-W 2023

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)65
  • Downloads (Last 6 weeks)5
Reflects downloads up to 30 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media