research-article

Public Access

Affinity-aware checkpoint restart

Authors:

Eric RomanAuthors Info & Claims

Middleware '14: Proceedings of the 15th International Middleware Conference

Pages 121 - 132

https://doi.org/10.1145/2663165.2663325

Published: 08 December 2014 Publication History

Abstract

Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application performance after restart from a checkpoint for a number of applications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., application tasks originally pinned to cores may be restarted on different cores, and in case of non-uniform memory architectures (NUMA), quite common today, memory pages associated with tasks on a NUMA node may be associated with a different NUMA node after restart. This work contributes a novel design technique for C/R mechanisms to preserve task-to-core maps and NUMA node specific page affinities across restarts. Experimental results with BLCR, a C/R mechanism, enhanced with affinity awareness demonstrate significant performance benefits of 37%-73% for the NAS Parallel Benchmark codes and 6-12% for NAMD with negligible overheads instead of up to nearly four times longer an execution times without affinity-aware restarts on 16 cores.

References

[1]

S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the 18th annual international conference on Supercomputing, pages 277--286. ACM, 2004.

Digital Library

[2]

J. Ansel, K. Arya, and G. Cooperman. Dmtcp: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12. IEEE, 2009.

Digital Library

[3]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al. The nas parallel benchmarks. International Journal of High Performance Computing Applications, 5(3):63--73, 1991.

Digital Library

[4]

Z. Chen. Online-abft: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 167--176. ACM, 2013.

Digital Library

[5]

[email protected]. Cryopid - a process freezer for linux. https://github.com/maaziz/cryopid, 2004.

[6]

D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: A general technique for designing numa locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 247--256, New York, NY, USA, 2012. ACM.

Digital Library

[7]

J. Duell. The design and implementation of berkeley lab's linux checkpoint/restart. 2005.

[8]

K. B. Ferreira, R. Riesen, R. Brighwell, P. Bridges, and D. Arnold. libhashckpt: hash-based incremental checkpointing using gpuâĂ&Zacute;s. In Recent Advances in the Message Passing Interface, pages 272--281. Springer, 2011.

Digital Library

[9]

A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello. Uncoordinated checkpointing without domino effect for send-deterministic mpi applications. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 989--1000. IEEE, 2011.

Digital Library

[10]

H. Jin, M. Frumkin, and J. Yan. The openmp implementation of nas parallel benchmarks and its performance. Technical report, Technical Report NAS-99-011, NASA Ames Research Center, 1999.

[11]

L. V. Kale and S. Krishnan. Charm++: A portable concurrent object oriented system based on c++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, pages 91--108, 1993.

Digital Library

[12]

I. Karlin, A. Bhatele, J. Keasler, B. L. Chamberlain, J. Cohen, Z. DeVito, R. Haque, D. Laney, E. Luke, F. Wang, et al. Exploring traditional and emerging parallel programming models using a proxy application. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 919--932. IEEE, 2013.

Digital Library

[13]

D. Li, Z. Chen, P. Wu, and J. S. Vetter. Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013.

Digital Library

[14]

J. Marathe and F. Mueller. Hardware profile-guided automatic page placement for ccNUMA systems. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 90--99, Mar. 2006.

Digital Library

[15]

A. Moody, G. Bronevetsky, K. Mohror, and B. R. De Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages 1--11. IEEE, 2010.

Digital Library

[16]

N. Naksinehaboon, Y. Liu, C. Leangsuksun, R. Nassar, M. Paun, and S. L. Scott. Reliability-aware approach: An incremental checkpoint/restart model in hpc environments. In Cluster Computing and the Grid, 2008. CCGRID'08. 8th IEEE International Symposium on, pages 783--788. IEEE, 2008.

Digital Library

[17]

X. Ni, E. Meneses, N. Jain, and L. V. Kalé. Acr: automatic checkpoint/restart for soft and hard error protection. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, page 7. ACM, 2013.

Digital Library

[18]

B. Nicolae and F. Cappello. Ai-ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, pages 155--166. ACM, 2013.

Digital Library

[19]

A. J. Oliner, L. Rudolph, and R. K. Sahoo. Cooperative checkpointing: a robust approach to large-scale systems reliability. In Proceedings of the 20th annual international conference on Supercomputing, pages 14--23. ACM, 2006.

Digital Library

[20]

J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kale, and K. Schulten. Scalable molecular dynamics with namd. Journal of computational chemistry, 26(16):1781--1802, 2005.

[21]

E. Roman. A survey of checkpoint/restart implementations. In Lawrence Berkeley National Laboratory, Tech. LBNL, 2002.

[22]

O. Sarood, E. Meneses, and L. V. Kale. A âĂIJcoolâĂİ way of improving the reliability of hpc machines, âĂİ. In Proceedings of The International Conference for High Performance Computing, Networking, Storage and Analysis, 2013.

Digital Library

[23]

K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka. Design and modeling of a non-blocking checkpointing system. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 19. IEEE Computer Society Press, 2012.

Digital Library

[24]

A. Schiper, F. Cappello, T. Martsinkevich, A. Guermouche, and T. Ropars. Spbc: Leveraging the characteristics of mpi hpc applications for scalable checkpointing. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC" 13), number EPFL-CONF-189836, 2013.

Digital Library

[25]

C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Hybrid checkpointing for mpi jobs in hpc environments. In Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems, ICPADS '10, pages 524--533, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[26]

H. Zhong and J. Nieh. Crak: Linux checkpoint/restart as a kernel module. Technical report, CUCS-014-01, Department of Computer Science, Columbia University, 2001.

Cited By

Shohdy SVishnu AAgrawal G(2016)Fault Tolerant Frequent Pattern Mining2016 IEEE 23rd International Conference on High Performance Computing (HiPC)10.1109/HiPC.2016.012(12-21)Online publication date: Dec-2016
https://doi.org/10.1109/HiPC.2016.012
Vogt DMiraglia APortokalidis GBos HTanenbaum AGiuffrida CLea RGopalakrishnan STilevich EMurphy ABlackstock M(2015)Speculative Memory CheckpointingProceedings of the 16th Annual Middleware Conference10.1145/2814576.2814802(197-209)Online publication date: 24-Nov-2015
https://dl.acm.org/doi/10.1145/2814576.2814802

Index Terms

Affinity-aware checkpoint restart
1. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance

Recommendations

Snapify: capturing snapshots of offload applications on xeon phi manycore processors
HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing

Intel Xeon Phi coprocessors provide excellent performance acceleration for highly parallel applications and have been deployed in several top-ranking supercomputers. One popular approach of programming the Xeon Phi is the offload model, where parallel ...
Efficient checkpoint/Restart of CUDA applications
Abstract
We present NVCR which enables transparent checkpoint and restart of CUDA applications. NVCR, works as an extension of major system-level checkpoint software such as BLCR and DMTCP, employs proxy-process and application accesses GPU devices via ...
Highlights
- NVCR enables checkpointing for latest CUDA applications.
- NVCR uses multi-process style for further compatibility.
- NVCR efficiently utilizes SYSV IPC shared memory as CUDA pinned memory.
- NVCR supports multi-threading and ...
COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

It is remarkably easy to offload processing to Intel's newest manycore coprocessor, the Xeon-Phi: it supports a popular ISA (x86-based), a popular OS (Linux) and a popular programming model (OpenMP). Unfortunately, easy portability does not ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

Middleware '14: Proceedings of the 15th International Middleware Conference

December 2014

334 pages

ISBN:9781450327855

DOI:10.1145/2663165

General Chair:
Laurent Réveillère
LaBRI, University of Bordeaux, France
,
Program Chairs:
Lucy Cherkasova
HP Labs, USA
,
François Taïani
Université de Rennes 1 / IRISA, France

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Orange
Conseil Régional d'Aquitaine
LaBRI: LaBRI
Raytheon BBN Technologies: Raytheon BBN Technologies
ACM: Association for Computing Machinery
Red Hat JBoss Middleware: Red Hat JBoss Middleware
Bordeaux: City of Bordeaux
USENIX Assoc: USENIX Assoc
GDR ASR: GDR Architecture, Systèmes et Réseaux
IBM: IBM
HP: HP
IFIP

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

Middleware '14

Sponsor:

LaBRI
Raytheon BBN Technologies
ACM
Red Hat JBoss Middleware
Bordeaux
USENIX Assoc
GDR ASR
IBM
HP

Middleware '14: 15th International Middleware Conference

December 8 - 12, 2014

Bordeaux, France

Acceptance Rates

Middleware '14 Paper Acceptance Rate 27 of 144 submissions, 19%;

Overall Acceptance Rate 203 of 948 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
431
Total Downloads

Downloads (Last 12 months)87
Downloads (Last 6 weeks)8

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shohdy SVishnu AAgrawal G(2016)Fault Tolerant Frequent Pattern Mining2016 IEEE 23rd International Conference on High Performance Computing (HiPC)10.1109/HiPC.2016.012(12-21)Online publication date: Dec-2016
https://doi.org/10.1109/HiPC.2016.012
Vogt DMiraglia APortokalidis GBos HTanenbaum AGiuffrida CLea RGopalakrishnan STilevich EMurphy ABlackstock M(2015)Speculative Memory CheckpointingProceedings of the 16th Annual Middleware Conference10.1145/2814576.2814802(197-209)Online publication date: 24-Nov-2015
https://dl.acm.org/doi/10.1145/2814576.2814802

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents