skip to main content
10.5555/3291656.3291727acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Best practices and lessons from deploying and operating a sustained-petascale system: the blue waters experience

Published: 11 November 2018 Publication History

Abstract

Building and operating versatile extreme-scale computing systems that work productively for a range of frontier research domains present many challenges and opportunities. Solutions created, experiences acquired, and lessons learned, while rarely published, could drive the development of new methods and practices and raise the bar for all organizations supporting research, scholarship, and education. This paper describes the methods and procedures developed for deploying, supporting, and continuously improving the Blue Waters system and its services during the last five years. Being the first US sustained-petascale computing platform available to the open-science community, the Blue Waters project pioneered various unique practices that we are sharing to be adopted and further improved by the community. We present our support and service methodologies, and the leadership practices employed for ensuring that the system stays highly efficient and productive. We also provide the return on investment summaries related to deploying and operating the system.

References

[1]
W. Kramer, "Top500 versus sustained serformance - or the top ten problems with the Top500 list - and what to do about them," in 21ST International Conference On Parallel Architectures And Compilation Techniques (PACT12), Minneapolis, 2012.
[2]
ORNL, "Titan Specification," {Online}. Available: https://www.olcf.ornl.gov/olcf-resources/compute-systems/titan/. {Accessed 14 August 2018}.
[3]
G. H. Bauer, T. Hoefler, W. T. Kramer and R. A. Fiedler, "Analyses and modeling of applications used to demonstrate sustained petascale performance on Blue Waters," in Proceedings of Cray User Group Meeting (CUG-2012), Stuttgart, 2012.
[4]
G. Bauer, V. Anisimov, G. Arnold, B. Bode, R. Brunner, T. Cortese, R. Haas, A. Kot, W. Kramer, J. Kwack, J. Li, C. Mendes, R. Mokos and C. Steffen, "Updating the SPP benchmark suite for extreme-scale systems," in Proceedings of Cry User Group Meeting (CUG-2017), Redmond, WA, 2017.
[5]
B. Bode, M. Butler, T. Dunning, W. Gropp, T. Hoefler, W.-m. Hwu and W. Kramer, "The Blue Waters super-system for super-science," in Contemporary HPC Architectures, J. Vetter, Ed., Sitka Publications, 2012.
[6]
W. Kramer, M. Butler, G. H. Bauer, K. Chadalavada and C. L. Mendes, "National Center for Supercomputing Applications," in High Performance Parallel I/O, Prabhat and Q. Koziol, Eds., Boca Raton, Florida: CRC Publications, Taylor and Francis Group, 2015.
[7]
J. Enos, G. Bauer, R. Brunner, S. Islam, R. A. Fiedler, M. Steed and D. Jackson, "Topology-aware job scheduling strategies for torus networks," in Proceedings of Cray User Group Meeting (CUG-2014), Lugano, Switzerland, 2014.
[8]
"Atlassian Jira," Atlassian, 2018. {Online}. Available: https://www.atlassian.com/software/jira. {Accessed 14 August 2018}.
[9]
NCSA, "NCSA Risk Register," 2015. {Online}. Available: https://wiki.ncsa.illinois.edu/display/ITS/NCSA+Risk+Register. {Accessed 14 August 2018}.
[10]
"Blue Waters Portal," NCSA/University of Illinois, 2018. {Online}. Available: https://bluewaters.ncsa.illinois.edu/. {Accessed 14 August 2018}.
[11]
L. DeStefano and J. S. Sung, "Blue Waters Fellows Program Focus Group Report," Champaign, IL, 2015.
[12]
L. DeStefano and J. S. Sung, "Blue Waters Fellows Program Focus Group Report," Champaign, IL, 2016.
[13]
F. Foertter, "Overview of the OLCF," 2013. {Online}. Available: https://www.olcf.ornl.gov/wp-content/uploads/2013/02/OLCF_Overview_lowres-FF.pdf. {Accessed 14 August 2018}.
[14]
W. Kramer, "Sustained Petascale In Action: Enabling Transformative Research - 2017 Annual Report," 2017. {Online}. Available: https://bluewaters.ncsa.illinois.edu/liferay-content/document-library/BW_AR_2017.pdf. {Accessed 14 August 2018}.
[15]
G. Bauer, C. Mendes, W. Kramer and R. Fiedler, "Expanding Blue Waters with improved acceleration capability," in Proceedings of Cray User Group Meeting (CUG-2014), Lugano, Switzerland, 2014.
[16]
M. D. Klein and J. E. Stone, "Unlocking the full potential of the Cray XK7 accelerator," in Proceedings of Cray User Group Meeting (CUG-2014), Lugano, Switzerland, 2014.
[17]
A. Loftus, "Psync - parallel synchronization of multi-pebibyte file systems," in Proceedings of Cray User Group Meeting (CUG-2016), London, England, 2016.
[18]
R. Vaarandi, "SEC - Simple Event Correlator," 2017. {Online}. Available: https://simple-evcorr.github.io. {Accessed 14 August 2018}.
[19]
A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat and T. Tucker, "Lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications," in Proc. IEEE/ACM International Conference for High Performance Storage, Networking, and Analysis (SC14), New Orleans, 2014.
[20]
M. Jones, J. White, M. Innus, M. DeLeon, A. Simakov, P. J. S. Gallo, T. Furlani, M. Showerman, R. Brunner, A. Kot, G. Bauer, B. Bode, J. Enos and W. Kramer, "Final Report Workload Analysis of Blue Waters (ACI 1650758)," 2017.
[21]
"Jenkins Automation Server," {Online}. Available: https://jenkins.io/. {Accessed 14 August 2018}.
[22]
A. S. Almgren, V. E. Beckner, J. B. Bell, M. S. Day, L. H. Howell, M. J. Joggerst, A. Nonaka, M. Singer and M. Zingale, "CASTRO: A new compressible astrophysical solver. I. Hydrodynamics and self-gravity," The Astrophysical Journal, vol. 715, no. 2, p. 1221, 2010.
[23]
A. Nonaka, A. S. Almgren, J. B. Bell, M. J. Lijewski, C. M. Malone and M. Zingale, "MAESTRO: An adaptive low Mach number hydrodynamics algorithm for stellar flows," The Astrophysical Journal Supplement Series, vol. 188, no. 2, p. 358, 2010.
[24]
E. W. Bethel, J. Van Rosendale, D. Southard, K. Gaither, H. Childs, E. Brugger and S. Ahern, "Visualization at supercomputing centers: the tale of little big iron and the three skinny guys," IEEE Computer Graphics and Applications, vol. 31, no. 1, pp. 90--95, 2011.
[25]
H. Childs, E. Brugger, B. Whitlock, J. Meredith, S. Ahern, D. Pumgire, K. Biagas, M. Miller, C. Harrison, P. Navrátil, G. W. Weber, H. Krishnan, T. Fogal, A. Sanderson, C. Garth, E. W. Bethel, O. Ruebel, M. Durant, J. M. Favre and O. Rübel, "VisIt: an end-user tool for visualizing and analyzing very large data," in High Performance Visualization-Enabling Extreme-Scale Scientific Insight, 2012, pp. 357--372.
[26]
J. Ahrens, B. Geveci and C. Law, "ParaView: an end-user tool for large-data visualization," in Visualization Handbook, C. D. Hansen and C. R. Johnson, Eds., Burlington, Butterworth-Heinemann, 2005, pp. 717 -- 731.
[27]
M. J. Turk, B. D. Smith, J. S. Oishi, S. Skory, S. W. Skillman, T. Abel and M. L. Norman, "yt: A multi-code analysis toolkit for astrophysical simulation data," The Astrophysical Journal Supplement, vol. 192, no. 1, p. 9, 2011.
[28]
"IDL," Harris Geospatial Solutions, {Online}. Available: http://www.harrisgeospatial.com/SoftwareTechnology/IDL.aspx. {Accessed 14 August 2018}.
[29]
"ImageMagick," {Online}. Available: https://www.imagemagick.org. {Accessed 14 August 2018}.
[30]
W. Humphrey, A. Drake and K. Schulten, " VMD: visual molecular dynamics," Journal of molecular graphics, vol. 14, no. 1, pp. 33--38, 1996.
[31]
"NICE DCV," NICE, {Online}. Available: https://www.nice-software.com/products/dcv. {Accessed 14 August 2018}.
[32]
R. Sisneros, L. Orf and G. Bryan, "Ultra-high resolution simulation of a downburst-producing thunderstorm," in Proceedings of Supercomputing, Denver, 2013.
[33]
L. Orf, R. Wilhelmson and L. Wicker, " Visualization of a simulated long-track EF5 tornado embedded within a supercell thunderstorm," Parallel Computing, vol. 55, pp. 28--34, 2016.
[34]
R. Sisneros, "Visualizing the big (and large) data from an HPC resource," In Numerical Modeling of Space Plasma Flows ASTRONUM-2014, vol. 498, p. 240, 2015.
[35]
R. Sisneros, J. Fullop, B. D. Semeraro and G. Bauer, "Ribbons: enabling the effective use of HPC utilization data for system support staff," in EuroVis Workshop on Visual Analytics, Swansea, Wales, 2014.
[36]
R. Sisneros, "A classification of parallel I/O toward demystifying HPC I/O best practices," in Proceedings of Cray User Group Meeting (CUG-2016), London, England, 2016.
[37]
M. Dorier, R. Sisneros, T. Peterka, G. Antoniu and D. Semeraro, "Damaris/viz: a nonintrusive, adaptable and user-friendly in situ visualization framework," Large-Scale Data Analysis and Visualization (LDAV), 2013 IEEE Symposium on, pp. 67--75, 2013.
[38]
R. Sisneros and D. Pugmire, "Tuned to terrible: a study of parallel particle advection state of the practice," in IEEE International Parallel and Distributed Processing Symposium Workshops, Chicago, 2016.
[39]
J. J. Galvez, N. Jain and L. V. Kale, "Automatic topology mapping of diverse large-scale parallel applications," in Proceedings of the International Conference on Supercomputing, Chicago, 2017.
[40]
E. Karrels, "Mesh_IO - parallel IO for mesh-structured data," {Online}. Available: https://github.com/oshkosher/meshio. {Accessed 14 August 2018}.
[41]
F. Franchetti, "SPIRAL - Software/Hardware Generation for DSP Algorithms," {Online}. Available: http://www.spiral.net/. {Accessed 14 August 2018}.
[42]
L. DeStefano and J. Sung, "Blue Waters Symposium for Petascale Computing and Beyond Report," in Blue Waters Symposium, Sunriver, OR, 2017.
[43]
K. Cahill, S. Lathrop and S. I. Gordon, "Building a community of practice to prepare the HPC workforce," in International Conference on Computational Science, Zurich, 2017.
[44]
L. DeStefano and J. Sung, "Blue Waters Fellows Program: Third Cadre Report," Champaign, IL, 2017.
[45]
Z. Chen, "The impact of Blue Waters on the economy of Illinois," 2017. {Online}. Available: http://www.ncsa.illinois.edu/assets/pdf/about/bw_economic_impact.pdf. {Accessed 14 August 2018}.
[46]
W. Gropp and R. Harrison, "Future Directions for NSF Advanced Computing Infrastructure to Support U.S. Science and Engineering in 2017--2020," The National Academies Press, 2016.
[47]
S. Lathrop, C. Mendes, J. Enos, B. Bode, G. Bauer, R. Sisneros and W. Kramer, "Best practices for management and operation of large HPC installations," in Proceedings of Cray User Group Meeting (CUG-2018), Stockholm, Sweeden, 2018.
  1. Best practices and lessons from deploying and operating a sustained-petascale system: the blue waters experience

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
    November 2018
    932 pages

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    IEEE Press

    Publication History

    Published: 11 November 2018

    Check for updates

    Author Tags

    1. HPC center
    2. best practices
    3. system management

    Qualifiers

    • Research-article

    Conference

    SC18
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 150
      Total Downloads
    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 31 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media