skip to main content
10.1145/2663165.2663325acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article
Public Access

Affinity-aware checkpoint restart

Published: 08 December 2014 Publication History

Abstract

Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. This work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.

References

[1]
S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the 18th annual international conference on Supercomputing, pages 277--286. ACM, 2004.
[2]
J. Ansel, K. Arya, and G. Cooperman. Dmtcp: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12. IEEE, 2009.
[3]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al. The nas parallel benchmarks. International Journal of High Performance Computing Applications, 5(3):63--73, 1991.
[4]
Z. Chen. Online-abft: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 167--176. ACM, 2013.
[5]
[email protected]. Cryopid - a process freezer for linux. https://github.com/maaziz/cryopid, 2004.
[6]
D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: A general technique for designing numa locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 247--256, New York, NY, USA, 2012. ACM.
[7]
J. Duell. The design and implementation of berkeley lab's linux checkpoint/restart. 2005.
[8]
K. B. Ferreira, R. Riesen, R. Brighwell, P. Bridges, and D. Arnold. libhashckpt: hash-based incremental checkpointing using gpuâĂŹs. In Recent Advances in the Message Passing Interface, pages 272--281. Springer, 2011.
[9]
A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello. Uncoordinated checkpointing without domino effect for send-deterministic mpi applications. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 989--1000. IEEE, 2011.
[10]
H. Jin, M. Frumkin, and J. Yan. The openmp implementation of nas parallel benchmarks and its performance. Technical report, Technical Report NAS-99-011, NASA Ames Research Center, 1999.
[11]
L. V. Kale and S. Krishnan. Charm++: A portable concurrent object oriented system based on c++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, pages 91--108, 1993.
[12]
I. Karlin, A. Bhatele, J. Keasler, B. L. Chamberlain, J. Cohen, Z. DeVito, R. Haque, D. Laney, E. Luke, F. Wang, et al. Exploring traditional and emerging parallel programming models using a proxy application. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 919--932. IEEE, 2013.
[13]
D. Li, Z. Chen, P. Wu, and J. S. Vetter. Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013.
[14]
J. Marathe and F. Mueller. Hardware profile-guided automatic page placement for ccNUMA systems. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 90--99, Mar. 2006.
[15]
A. Moody, G. Bronevetsky, K. Mohror, and B. R. De Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages 1--11. IEEE, 2010.
[16]
N. Naksinehaboon, Y. Liu, C. Leangsuksun, R. Nassar, M. Paun, and S. L. Scott. Reliability-aware approach: An incremental checkpoint/restart model in hpc environments. In Cluster Computing and the Grid, 2008. CCGRID'08. 8th IEEE International Symposium on, pages 783--788. IEEE, 2008.
[17]
X. Ni, E. Meneses, N. Jain, and L. V. Kalé. Acr: automatic checkpoint/restart for soft and hard error protection. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, page 7. ACM, 2013.
[18]
B. Nicolae and F. Cappello. Ai-ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, pages 155--166. ACM, 2013.
[19]
A. J. Oliner, L. Rudolph, and R. K. Sahoo. Cooperative checkpointing: a robust approach to large-scale systems reliability. In Proceedings of the 20th annual international conference on Supercomputing, pages 14--23. ACM, 2006.
[20]
J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kale, and K. Schulten. Scalable molecular dynamics with namd. Journal of computational chemistry, 26(16):1781--1802, 2005.
[21]
E. Roman. A survey of checkpoint/restart implementations. In Lawrence Berkeley National Laboratory, Tech. LBNL, 2002.
[22]
O. Sarood, E. Meneses, and L. V. Kale. A âĂIJcoolâĂİ way of improving the reliability of hpc machines, âĂİ. In Proceedings of The International Conference for High Performance Computing, Networking, Storage and Analysis, 2013.
[23]
K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka. Design and modeling of a non-blocking checkpointing system. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 19. IEEE Computer Society Press, 2012.
[24]
A. Schiper, F. Cappello, T. Martsinkevich, A. Guermouche, and T. Ropars. Spbc: Leveraging the characteristics of mpi hpc applications for scalable checkpointing. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC" 13), number EPFL-CONF-189836, 2013.
[25]
C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Hybrid checkpointing for mpi jobs in hpc environments. In Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems, ICPADS '10, pages 524--533, Washington, DC, USA, 2010. IEEE Computer Society.
[26]
H. Zhong and J. Nieh. Crak: Linux checkpoint/restart as a kernel module. Technical report, CUCS-014-01, Department of Computer Science, Columbia University, 2001.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
Middleware '14: Proceedings of the 15th International Middleware Conference
December 2014
334 pages
ISBN:9781450327855
DOI:10.1145/2663165
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • Orange
  • Conseil Régional d'Aquitaine
  • LaBRI: LaBRI
  • Raytheon BBN Technologies: Raytheon BBN Technologies
  • ACM: Association for Computing Machinery
  • Red Hat JBoss Middleware: Red Hat JBoss Middleware
  • Bordeaux: City of Bordeaux
  • USENIX Assoc: USENIX Assoc
  • GDR ASR: GDR Architecture, Systèmes et Réseaux
  • IBM: IBM
  • HP: HP
  • IFIP

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NUMA
  2. checkpoint and restart
  3. efficiency
  4. fault tolerance
  5. multi-core
  6. system software

Qualifiers

  • Research-article

Funding Sources

Conference

Middleware '14
Sponsor:
  • LaBRI
  • Raytheon BBN Technologies
  • ACM
  • Red Hat JBoss Middleware
  • Bordeaux
  • USENIX Assoc
  • GDR ASR
  • IBM
  • HP

Acceptance Rates

Middleware '14 Paper Acceptance Rate 27 of 144 submissions, 19%;
Overall Acceptance Rate 203 of 948 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)87
  • Downloads (Last 6 weeks)8
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media