skip to main content
10.1145/2503210.2503230acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Enabling comprehensive data-driven system management for large computational facilities

Published: 17 November 2013 Publication History

Abstract

This paper presents a tool chain, based on the open source tool TACC_Stats, for systematic and comprehensive job level resource use measurement for large cluster computers, and its incorporation into XDMoD, a reporting and analytics framework for resource management that targets meeting the information needs of users, application developers, systems administrators, systems management and funding managers. Accounting, scheduler and event logs are integrated with system performance data from TACC_Stats. TACC_Stats periodically records resource use including many hardware counters for each job running on each node. Furthermore, system level metrics are obtained through aggregation of the node (job) level data. Analysis of this data generates many types of standard and custom reports and even a limited predictive capability that has not previously been available for open-source, Linux-based software systems. This paper presents case studies of information that can be applied for effective resource management. We believe this system to be the first fully comprehensive system for supporting the information needs of all stakeholders in open-source software based HPC systems.

References

[1]
Del Vento, D., Engel, T., Ghosh S., Hart, D., Kelly, R., Liu, S. and Valent, R. "System-level monitoring of floating-point performance to improve effective system utilization." In 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11)
[2]
Furlani, T. R., Jones, M. D., Gallo, S. M., Bruno, A. E., Lu, C.-D., Ghadersohi, A., Gentner, R. J., Patra, A., DeLeon, R. L., von Laszewski, G., Wang, F., and Zimmerman, A., "Performance metrics and auditing framework using application kernels for high performance computer systems," Concurrency and Computation: Practice and Experience, 2012. {Online} Available: http://dx.doi.org/10.1002/cpe.2871 XDMoD: http://xdmod.ccr.buffalo.edu
[3]
http://www.thegeekstuff.com/2011/12/linux-performance-monitoring-tools/
[4]
K. A. Huck, A. D. Malony, S. Shende, and A. Morris. "Knowledge Support and Automation for Performance Analysis with PerfExplorer 2.0." Large-Scale Programming Tools and Environments, Special Issue of Scientific Programming, vol. 16, no. 2-3, pp. 123--134. 2008.
[5]
S. Shende and A. Malony. "The Tau Parallel Performance System." International Journal of High Performance Computing Applications, 20(2): 287--311.
[6]
Tau: http://www.cs.uoregon.edu/research/tau/home.php.
[7]
N. R. Tallent, J. M. Mellor-Crummey, L. Adhianto, M. W. Fagan, and M. Krentel. "HPCToolkit: performance tools for scientific computing." Journal of Physics: Conference Series, 125. 2008.
[8]
L. Djoudi, D. Barthou, P. Carribault, C. Lemuet, J.-T. Acquaviva, and W. Jalby. "Exploring Application Performance: a New Tool for a Static/Dynamic Approach." The Sixth Los Alamos Computer Science Institute Symp. 2005.
[9]
HPCToolkit: http://www.hpctoolkit.org/. Last accessed April 1, 2011.
[10]
Open|SpeedShop: http://www.openspeedshop.org/wp/.
[11]
M. Geimer, P. Saviankou, A. Strube, Z. Szebenyi, F. Wolf, B. J. N. Wylie: Further improving the scalability of the Scalasca toolset. In Proc. of PARA 2010: State of the Art in Scientific and Parallel Computing, Part II: Minisymposium Scalable tools for High Performance Computing, Reykjavik, Iceland, June 6--9 2010, volume 7134 of Lecture Notes in Computer Science, pages 463--474, Springer, 2012.
[12]
VTune: software.intel.com/intel-vtune-amplifier-xe
[13]
B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall. "The Paradyn Parallel Performance Measurement Tool." IEEE Computer, 28:37--46. 1995.
[14]
M. Burtscher, B. D. Kim, J. Diamond, J. McCalpin, L. Koesterke, and J. Browne. "PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications." SC 2010 Int. Conference for High-Performance Computing, Networking, Storage and Analysis. November 2010.
[15]
PerfExpert: http://www.tacc.utexas.edu/perfexpert/.
[16]
http://clumon.ncsa.illinois.edu/
[17]
http://oss.sgi.com/projects/pcp/
[18]
http://ganglia.sourceforge.net/
[19]
http://www.nagios.org/
[20]
http://www.komodolabs.com/
[21]
https://e-reports-ext.llnl.gov/pdf/754636.pdf
[22]
https://computing.llnl.gov/linux/slurm/overview.html
[23]
http://www.splunk.com/
[24]
http://github.com/TACCProjects/tacc_stats
[25]
Hammond, J. "TACC_stats: I/O performance monitoring for the intransigent" In 2011 Workshop for Interfaces and Architectures for Scientific Data Storage (IASDS 2011)
[26]
Edward Chuah_, Arshad Jhumka_, Sai Narasimhamurthy, John Hammond, James C. Browne, Bill Barth Linking Resource Usage Anomalies with System Failures from Cluster Log Data (Submitted to SRDS 2013 -- Preprint available from mailto:[email protected])
[27]
J. L. Hammond, T. Minyard, and J. Browne, "End-to-end framework for fault management for open source clusters: Ranger," in Proceedings of ACM TeraGrid, no. 9, 2010.
[28]
Scott, D. W. "Multivariate Density Estimation". Wiley, New York, 1992.

Cited By

View all
  • (2021)Monitoring applications on the ZHORES cluster at SkoltechProgram Systems: Theory and ApplicationsПрограммные системы: теория и приложения10.25209/2079-3316-2021-12-2-73-10312:2(73-103)Online publication date: 2021
  • (2021)Failure Diagnosis for Cluster Systems using Partial Correlations2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151(1091-1101)Online publication date: Sep-2021
  • (2020)Towards Performant Workflows, Monitoring and Measuring2020 29th International Conference on Computer Communications and Networks (ICCCN)10.1109/ICCCN49398.2020.9209647(1-9)Online publication date: Aug-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2013
1123 pages
ISBN:9781450323789
DOI:10.1145/2503210
  • General Chair:
  • William Gropp,
  • Program Chair:
  • Satoshi Matsuoka
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SC13
Sponsor:

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media