skip to main content
10.1145/2491661.2481432acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Evaluating the feasibility of using memory content similarity to improve system resilience

Published: 10 June 2013 Publication History

Abstract

Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grows, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory errors. In this paper, we propose a novel run-time for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the feasibility of this approach by examining memory snapshots collected from eight HPC applications. Based on the characteristics of the similarity that we uncover in these applications, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.

References

[1]
A. Arcangeli, I. Eidus, and C. Wright. Increasing memory density by using KSM. In Proceedings of the Linux Symposium, 2009, Montreal, Quebec, pages 19--28, 2009.
[2]
A. Bartók, M. Payne, R. Kondor, and G. Csányi. Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons. Physical review letters, 104(13):136403, 2010.
[3]
S. Biswas, B. R. d. Supinski, M. Schulz, D. Franklin, T. Sherwood, and F. T. Chong. Exploiting data similarity to reduce memory footprints. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS '11, pages 152--163, Washington, DC, USA, 2011. IEEE Computer Society.
[4]
P. Bridges, M. Hoemmen, K. B. Ferreira, M. Heroux, P. Soltero, and R. Brightwell. Cooperative application/os DRAM fault recovery. Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the Euro-Par Conference, Lecture Notes in Computer Science, pages --, 2011.
[5]
E. Bugnion, S. Devine, K. Govil, and M. Rosenblum. Disco: running commodity operating systems on scalable multiprocessors. ACM Trans. Comput. Syst., 15(4):412--447, Nov. 1997.
[6]
V. Chandra and R. Aitken. Impact of technology and voltage scaling on the soft error susceptibility in nanoscale cmos. In Defect and Fault Tolerance of VLSI Systems, 2008. DFTVS'08. IEEE International Symposium on, pages 114--122. IEEE, 2008.
[7]
Z. Chen and J. Dongarra. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, April 2006.
[8]
E. Elnozahy and J. Plank. Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. Dependable and Secure Computing, IEEE Transactions on, 1(2):97--108, 2004.
[9]
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, Sept. 2002.
[10]
K. Ferreira, R. Riesen, R. Brighwell, P. Bridges, and D. Arnold. libhashckpt: hash-based incremental checkpointing using GPUs. Recent Advances in the Message Passing Interface, pages 272--281, 2011.
[11]
K. Ferreira, R. Riesen, J. Stearley, J. H. L. III, R. Oldfield, K. Pedretti, P. Bridges, D. Arnold, and R. Brightwell. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage, and Analysis, (SC'11), Nov 2011.
[12]
A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello. Uncoordinated checkpointing without domino effect for send-deterministic message passing applications. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium, May 2011.
[13]
D. Gupta, S. Lee, M. Vrable, S. Savage, A. C. Snoeren, G. Varghese, G. M. Voelker, and A. Vahdat. Difference Engine: Harnessing memory redundancy in virtual machines. Commun. ACM, 53(10):85--93, Oct. 2010.
[14]
V. Henson and U. Yang. BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41(1):155--177, 2002.
[15]
M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich. Improving performance via mini-applications. Technical Report SAND2009-5574, Sandia National Laboratories, 2009.
[16]
L. Holst. The general birthday problem. Random Structures and Algorithms, 6(2-3):201--208, 1995.
[17]
A. A. Hwang, I. A. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '12, pages 111--122, New York, NY, USA, 2012. ACM.
[18]
D. B. Johnson and W. Zwaenepoel. Recovery in distributed systems using asynchronous and checkpointing. In Proceedings of the seventh annual ACM Symposium on Principles of distributed computing, pages 171--181, 1988.
[19]
A. Kleen. mcelog: memory error handling in user space. In Proceedings of Linux Kongress 2010, Nuremburg, Germany, September 2010.
[20]
Lawrence Livermore National Laboratories. IRS: Implicit Radiation Solver 1.4 Build Notes. https://asc.llnl.gov/computing_resources/purple/archive/benchmarks/irs/irs.readme.html.
[21]
Lawrence Livermore National Laboratories. SAMRAI. https://computation.llnl.gov/casc/SAMRAI/index.html.
[22]
Lawrence Livermore National Laboratories. ASC Sequoia Benchmark Codes. https://asc.llnl.gov/sequoia/benchmarks, August 2009.
[23]
S. Levy, K. B. Ferreira, P. G. Bridges, A. P. Thompson, and C. Trott. An Examination of Content Similarity within the Memory of HPC Applications. Technical Report SAND2013-0055, Sandia National Laboratories, 2013.
[24]
Los Alamos National Laboratories. Sweep3d. http://www.c3.lanl.gov/pal/software/sweep3d/sweep3d_readme.html, 1999.
[25]
J. McGlaun, S. Thompson, and M. Elrick. CTH: a three-dimensional shock wave physics code. International Journal of Impact Engineering, 10(1):351--360, 1990.
[26]
Sandia National Laboratories. Mantevo. http://software.sandia.gov/mantevo.
[27]
Sandia National Laboratories. The LAMMPS molecular dynamics simulator. http://lammps.sandia.gov, April 2010.
[28]
Sandia National Laboratories. Kitten lightweight kernel. https://software.sandia.gov/trac/kitten, March 10 2012.
[29]
B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN2006), June 2006.
[30]
A. Tuininga. cx_bsdiff. http://starship.python.net/crew/atuining/cx_bsdiff/index.html, February 2006.
[31]
C. Vaughan, M. Rajan, R. Barrett, D. Doerfler, and K. Pedretti. Investigating the impact of the cielo cray xe6 architecture on scientific application codes. In Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, pages 1831--1837. IEEE, 2011.
[32]
C. A. Waldspurger. Memory resource management in VMware ESX server. SIGOPS Oper. Syst. Rev., 36(SI):181--194, Dec. 2002.
[33]
L. Xia and P. A. Dinda. A case for tracking and exploiting inter-node and intra-node memory content sharing in virtualized large-scale parallel systems. In Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing, VTDC '12, pages 11--18, New York, NY, USA, 2012. ACM.
[34]
T. Yang, H. Jiang, D. Feng, Z. Niu, K. Zhou, and Y. Wan. DEBAR: A scalable high-performance de-duplication storage system for backup and archiving. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--12. IEEE, 2010.
[35]
B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the data domain deduplication fole system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, FAST'08, pages 18:1--18:14, Berkeley, CA, USA, 2008. USENIX Association.

Cited By

View all

Index Terms

  1. Evaluating the feasibility of using memory content similarity to improve system resilience

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ROSS '13: Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
    June 2013
    75 pages
    ISBN:9781450321464
    DOI:10.1145/2491661
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    ICS'13
    Sponsor:

    Acceptance Rates

    ROSS '13 Paper Acceptance Rate 9 of 18 submissions, 50%;
    Overall Acceptance Rate 58 of 169 submissions, 34%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media