skip to main content
research-article

Algorithm-based fault tolerance for dense matrix factorizations

Published: 25 February 2012 Publication History

Abstract

Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead sharply decreases with the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors.

References

[1]
Fault tolerance for extreme-scale computing workshop report, 2009.
[2]
http://www.top500.org/, 2011.
[3]
L. Blackford, A. Cleary, J. Choi, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, et al. ScaLAPACK users' guide. Society for Industrial Mathematics, 1997.
[4]
G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing, 69(4):410--416, 2009.
[5]
A. Bouteiller, G. Bosilca, and J. Dongarra. Redesigning the message logging model for high performance. Concurrency and Computation: Practice and Experience, 22(16):2196--2211, 2010.
[6]
G. Burns, R. Daoud, and J. Vaigl. LAM: An open cluster environment for MPI. In Proceedings of SC'94, volume 94, pages 379--386, 1994.
[7]
F. Cappello. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications, 23(3):212, 2009.
[8]
Z. Chen and J. Dongarra. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In IPDPS'06, pages 10pp. IEEE, 2006.
[9]
Z. Chen and J. Dongarra. Scalable techniques for fault tolerant high performance computing. PhD thesis, University of Tennessee, Knoxville, TN, 2006.
[10]
Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE TPDS, 19(12):1628--1641, 2008.
[11]
J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R. Whaley. ScaLAPACK: a portable linear algebra library for distributed memory computers--design issues and performance. Computer Physics Comm., 97(1-2):1--15, 1996.
[12]
T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing. In Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011). ACM.
[13]
J. Dongarra, L. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, et al. ScaLAPACK user's guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997.
[14]
E. Elnozahy, D. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In Reliable Distributed Systems, 1992. Proceedings., 11th Symposium on, pages 39--47. IEEE, 1991.
[15]
G. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. EuroPVM/MPI, 2000.
[16]
G. Gibson. Failure tolerance in petascale computers. In Journal of Physics: Conference Series, volume 78, page 012022, 2007.
[17]
G. Golub and C. Van Loan. Matrix computations. Johns Hopkins Univ Pr, 1996.
[18]
D. Hakkarinen and Z. Chen. Algorithmic Cholesky factorization fault recovery. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--10. IEEE, 2010.
[19]
K. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, 100(6):518--528, 1984.
[20]
V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to parallel computing: design and analysis of algorithms, volume 400. Benjamin/Cummings, 1994.
[21]
C. Lu. Scalable diskless checkpointing for large parallel systems. PhD thesis, Citeseer, 2005.
[22]
F. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques* 1. Journal of Parallel and Distributed Computing, 5(2):172--184, 1988.
[23]
J. Plank, K. Li, and M. Puening. Diskless checkpointing. Parallel and Distributed Systems, IEEE Transactions on, 9(10):972--986, 1998.
[24]
F. Streitz, J. Glosli, M. Patel, B. Chan, R. Yates, B. Supinski, J. Sexton, and J. Gunnels. Simulating solidification in metals at high pressure: The drive to petascale computing. In Journal of Physics: Conference Series, volume 46, page 254. IOP Publishing, 2006.

Cited By

View all

Index Terms

  1. Algorithm-based fault tolerance for dense matrix factorizations

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 47, Issue 8
    PPOPP '12
    August 2012
    334 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2370036
    Issue’s Table of Contents
    • cover image ACM Conferences
      PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
      February 2012
      352 pages
      ISBN:9781450311601
      DOI:10.1145/2145816
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 February 2012
    Published in SIGPLAN Volume 47, Issue 8

    Check for updates

    Author Tags

    1. ABFT
    2. LU
    3. QR
    4. fail-stop failure
    5. fault-tolerance

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)26
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 23 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media