skip to main content
10.5555/3018823.3018829acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

A scalable observation system for introspection and in situ analytics

Published: 13 November 2016 Publication History

Abstract

SOS is a new model for the online in situ characterization and analysis of complex high-performance computing applications. SOS employs a data framework with distributed information management and structured query and access capabilities. The primary design objectives of SOS are flexibility, scalability, and programmability. SOS provides a complete framework that can be configured with and used directly by an application, allowing for a detailed workflow analysis of scientific applications. This paper describes the model of SOS and the experiments used to validate and explore the performance characteristics of its implementation in SOSflow. Experimental results demonstrate that SOS is capable of observation, introspection, feedback and control of complex high-performance applications, and that it has desirable scaling properties.

References

[1]
Omar Aaziz, Jonathan Cook, and Hadi Sharifi. Push me pull you: Integrating opposing data transport modes for efficient hpc application monitoring. In Cluster Computing (CLUSTER), 2015 IEEE International Conference on, pages 674--681. IEEE, 2015.
[2]
Anthony Agelastos, Benjamin Allan, Jim Brandt, Paul Cassella, Jeremy Enos, Joshi Fullop, Ann Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, et al. The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 154--165. IEEE Press, 2014.
[3]
Guilherme da Cunha Rodrigues, Glederson Lessa dos Santos, Vinicius Tavares Guimaraes, Lisandro Zambenedetti Granville, and Liane Margarida Rockenbach Tarouco. An architecture to evaluate scalability, adaptability and accuracy in cloud monitoring systems. In Information Networking (ICOIN), 2014 International Conference on, pages 46--51. IEEE, 2014.
[4]
Todd Evans, William L Barth, James C Browne, Robert L DeLeon, Thomas R Furlani, Steven M Gallo, Matthew D Jones, and Abani K Patra. Comprehensive resource use monitoring for hpc systems with tacc stats. In Proceedings of the First International Workshop on HPC User Support Tools, pages 13--21. IEEE Press, 2014.
[5]
Weiming Gu, Greg Eisenhauer, Eileen Kraemer, Karsten Schwan, John Stasko, Jeffrey Vetter, and Nirupama Mallavarupu. Falcon: On-line monitoring and steering of large-scale parallel programs. In Frontiers of Massively Parallel Computation, 1995. Proceedings. Frontiers' 95., Fifth Symposium on the, pages 422--429. IEEE, 1995.
[6]
Kevin A Huck, Allen D Malony, Sameer Shende, and Alan Morris. Taug: Runtime global performance data access using mpi. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 313--321. Springer, 2006.
[7]
Gregory Katsaros, Roland Kubert, and Georgina Gallizo. Building a service-oriented monitoring framework with rest and nagios. In Services Computing (SCC), 2011 IEEE International Conference on, pages 426--431. IEEE, 2011.
[8]
Mahendra Kutare, Greg Eisenhauer, Chengwei Wang, Karsten Schwan, Vanish Talwar, and Matthew Wolf. Monalytics: online monitoring and analytics for managing large scale data centers. In Proceedings of the 7th international conference on Autonomic computing, pages 141--150. ACM, 2010.
[9]
Matthew L Massie, Brent N Chun, and David E Culler. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817--840, 2004.
[10]
Rafael Keller Tesser and Philippe Olivier Alexandre Navaux. Dimvhcm: An on-line distributed monitoring data collection model. In Parallel, Distributed and Network-Based Processing (PDP), 2012 20th Euromicro International Conference on, pages 37--41. IEEE, 2012.
[11]
Xuechen Zhang, Hasan Abbasi, Kevin Huck, and Allen Malony. Wowmon: A machine learning-based profiler for self-adaptive instrumentation of scientific workflows. Procedia Computer Science, 80:1507--1518, 2016.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ESPT '16: Proceedings of the 5th Workshop on Extreme-Scale Programming Tools
November 2016
54 pages
ISBN:9781509039180

Sponsors

In-Cooperation

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Author Tags

  1. exascale
  2. hpc
  3. in situ
  4. introspection
  5. monalytics
  6. monitoring
  7. scientific workflow
  8. sos
  9. sosflow

Qualifiers

  • Research-article

Conference

SC16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2 of 4 submissions, 50%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media