research-article

Algorithm-based fault tolerance for dense matrix factorizations

Authors:

Aurelien Bouteiller,

George Bosilca,

Thomas Herault,

Jack DongarraAuthors Info & Claims

ACM SIGPLAN Notices, Volume 47, Issue 8

Pages 225 - 234

https://doi.org/10.1145/2370036.2145845

Published: 25 February 2012 Publication History

Abstract

Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead sharply decreases with the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors.

References

[1]

Fault tolerance for extreme-scale computing workshop report, 2009.

[2]

http://www.top500.org/, 2011.

[3]

L. Blackford, A. Cleary, J. Choi, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, et al. ScaLAPACK users' guide. Society for Industrial Mathematics, 1997.

Digital Library

[4]

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing, 69(4):410--416, 2009.

Digital Library

[5]

A. Bouteiller, G. Bosilca, and J. Dongarra. Redesigning the message logging model for high performance. Concurrency and Computation: Practice and Experience, 22(16):2196--2211, 2010.

Digital Library

[6]

G. Burns, R. Daoud, and J. Vaigl. LAM: An open cluster environment for MPI. In Proceedings of SC'94, volume 94, pages 379--386, 1994.

[7]

F. Cappello. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications, 23(3):212, 2009.

Digital Library

[8]

Z. Chen and J. Dongarra. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In IPDPS'06, pages 10pp. IEEE, 2006.

Digital Library

[9]

Z. Chen and J. Dongarra. Scalable techniques for fault tolerant high performance computing. PhD thesis, University of Tennessee, Knoxville, TN, 2006.

Digital Library

[10]

Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE TPDS, 19(12):1628--1641, 2008.

Digital Library

[11]

J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R. Whaley. ScaLAPACK: a portable linear algebra library for distributed memory computers--design issues and performance. Computer Physics Comm., 97(1-2):1--15, 1996.

[12]

T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing. In Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011). ACM.

Digital Library

[13]

J. Dongarra, L. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, et al. ScaLAPACK user's guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997.

Digital Library

[14]

E. Elnozahy, D. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In Reliable Distributed Systems, 1992. Proceedings., 11th Symposium on, pages 39--47. IEEE, 1991.

[15]

G. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. EuroPVM/MPI, 2000.

Digital Library

[16]

G. Gibson. Failure tolerance in petascale computers. In Journal of Physics: Conference Series, volume 78, page 012022, 2007.

[17]

G. Golub and C. Van Loan. Matrix computations. Johns Hopkins Univ Pr, 1996.

[18]

D. Hakkarinen and Z. Chen. Algorithmic Cholesky factorization fault recovery. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--10. IEEE, 2010.

[19]

K. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, 100(6):518--528, 1984.

Digital Library

[20]

V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to parallel computing: design and analysis of algorithms, volume 400. Benjamin/Cummings, 1994.

Digital Library

[21]

C. Lu. Scalable diskless checkpointing for large parallel systems. PhD thesis, Citeseer, 2005.

Digital Library

[22]

F. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques* 1. Journal of Parallel and Distributed Computing, 5(2):172--184, 1988.

Digital Library

[23]

J. Plank, K. Li, and M. Puening. Diskless checkpointing. Parallel and Distributed Systems, IEEE Transactions on, 9(10):972--986, 1998.

Digital Library

[24]

F. Streitz, J. Glosli, M. Patel, B. Chan, R. Yates, B. Supinski, J. Sexton, and J. Gunnels. Simulating solidification in metals at high pressure: The drive to petascale computing. In Journal of Physics: Conference Series, volume 46, page 254. IOP Publishing, 2006.

Cited By

Amarnath CMejri MMa KChatterjee A(2024)Error Resilience in Deep Neural Networks Using Neuron Gradient StatisticsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333514443:4(1149-1162)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TCAD.2023.3335144
Rocco RRepetti LBoella EGregori DPalermo G(2024)Extending the Legio Resilience Framework to Handle Critical Process Failures in MPI2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00015(44-51)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00015
Rocco RBoella EGregori DPalermo G(2024)An overview of the Legio fault resilience framework for MPI applicationsProcedia Computer Science10.1016/j.procs.2024.07.009240(61-69)Online publication date: 2024
https://doi.org/10.1016/j.procs.2024.07.009
Show More Cited By

Index Terms

Algorithm-based fault tolerance for dense matrix factorizations
1. Mathematics of computing
  1. Mathematical software

Recommendations

Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy
Special Issue on PPOPP 2012

Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on ...
Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra
HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and good scalability. However the fault model used to design ABFT has been either abstract, simplistic, or both, leaving a gap between what occurs at the ...
Algorithm-based fault tolerance for dense matrix factorizations
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 47, Issue 8

PPOPP '12

August 2012

334 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2370036

Issue’s Table of Contents

PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
February 2012
352 pages
ISBN:9781450311601
DOI:10.1145/2145816
General Chair:
J. Ramanujam
Louisiana State University, USA
,
Program Chair:
P. Sadayappan
The Ohio State University, USA

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2012

Published in SIGPLAN Volume 47, Issue 8

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

95
Total Citations
View Citations
598
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)5

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Amarnath CMejri MMa KChatterjee A(2024)Error Resilience in Deep Neural Networks Using Neuron Gradient StatisticsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333514443:4(1149-1162)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TCAD.2023.3335144
Rocco RRepetti LBoella EGregori DPalermo G(2024)Extending the Legio Resilience Framework to Handle Critical Process Failures in MPI2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00015(44-51)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00015
Rocco RBoella EGregori DPalermo G(2024)An overview of the Legio fault resilience framework for MPI applicationsProcedia Computer Science10.1016/j.procs.2024.07.009240(61-69)Online publication date: 2024
https://doi.org/10.1016/j.procs.2024.07.009
Fu XTang HLiao HHuang XXu WMeng SZhang WGuo LSato K(2023)A High-dimensional Algorithm-Based Fault Tolerance Scheme2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00061(326-330)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00061
Agullo EAltenbernd MAnzt HBautista-Gomez LBenacchio TBonaventura LBungartz HChatterjee SCiorba FDeBardeleben NDrzisga DEibl SEngelmann CGansterer WGiraud LGöddeke DHeisig MJézéquel FKohl NLi XLion RMehl MMycek PObersteiner MQuintana-Ortí ERizzi FRüde USchulz MFung FSpeck RStals LTeranishi KThibault SThönnes DWagner AWohlmuth B(2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1177/10943420211055188
Rocco RGadioli DPalermo G(2022)Legio: fault resiliency for embarrassingly parallel MPI applicationsThe Journal of Supercomputing10.1007/s11227-021-03951-w78:2(2175-2195)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1007/s11227-021-03951-w
Sharif UMueller-Gritschneder DSchlichtmann U(2021)REPAIR: Control Flow Protection based on Register Pairing Updates for SW-Implemented HW Fault ToleranceACM Transactions on Embedded Computing Systems10.1145/347700120:5s(1-22)Online publication date: 17-Sep-2021
https://dl.acm.org/doi/10.1145/3477001
Dutta SJeong HYang YCadambe VLow TGrover P(2020)Addressing Unreliability in Emerging Devices and Non-von Neumann Architectures Using Coded ComputingProceedings of the IEEE10.1109/JPROC.2020.2986362108:8(1219-1234)Online publication date: Aug-2020
https://doi.org/10.1109/JPROC.2020.2986362
Roffe SGeorge A(2020)Evaluation of Algorithm-Based Fault Tolerance for Machine Learning and Computer Vision under Neutron Radiation2020 IEEE Aerospace Conference10.1109/AERO47225.2020.9172799(1-9)Online publication date: Mar-2020
https://doi.org/10.1109/AERO47225.2020.9172799
Bravo MMateo SKeller KBautista-Gomez LAyguadé EBeltran V(2020)Extending the OpenCHK Model with advanced checkpoint featuresFuture Generation Computer Systems10.1016/j.future.2020.06.003Online publication date: Jun-2020
https://doi.org/10.1016/j.future.2020.06.003
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents