skip to main content
research-article
Public Access

Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy

Published: 18 February 2015 Publication History

Abstract

Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This article proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable node and the possibility of losing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead decreases inversely to the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors. Applicability to tolerate multiple failures and accuracy after multiple recovery is also considered.

References

[1]
G. Bosilca, R. Delmas, J. J. Dongarra, and J. Langou. 2009. Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69, 4, 410--416.
[2]
A. Bouteiller, G. Bosilca, and J. J. Dongarra. 2010. Redesigning the message logging model for high performance. Concurrency Computat.: Practice Exp. 22, 16, 2196--2211.
[3]
G. Bums, R. Daoud, and J. Vaigl. 1994. LAM: An open cluster environment for MPI. In Proceedings of the Conference on Supercomputing (SC'94). IEEE/ACM, 379--386.
[4]
F. Cappello. 2009. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23, 3.
[5]
Z. Chen and J. J. Dongarra. 2006a. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In Proceedings of the 20th International Conference on Parallel and Distributed Processing (IPDPS'06). IEEE, 97. http://dl.acrn.org/citation.cfrn?id=1898953.1899028.
[6]
Z. Chen and J. J. Dongarra. 2006b. Scalable techniques for fault tolerant high performance computing. Ph.D. Dissertation, University of Tennessee, Knoxville, TN.
[7]
Z. Chen and J. J. Dongarra. 2008.Algorithm-based fault tolerance for fail-stop failures. IEEE Trans. Parallel Distrib. Syst. 19, 12, 1628--1641.
[8]
J. Choi, J. Demrnel, I. Dhillon, J. J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. 1996. ScaLAP ACK: A portable linear algebra library for distributed memory computers-design issues and performance. Comput. Phys. Commun. 97, 1--2, 1--15.
[9]
M. Cosnard, J.-M. Muller, and Y. Robert. 1986. Parallel QR decomposition of a rectangular matrix. Numer. Math. 48, 2, 239--249.
[10]
T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. 2011. High performance linpack benchmark: A fault tolerant implementation without checkpointing. In Proceedings of the International Conference on Supercomputing (ICS'11). ACM, New York, 162--171.
[11]
J. Demmel, L. Grigori, M. Hoemmen, and J. Langou. 2012. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, 1, 206--239.
[12]
P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. J. Dongarra. 2012. Algorithm-based fault tolerance for dense matrix factorizations. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). ACM, New York, 225--234.
[13]
E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. 1992. The performance of consistent checkpointing. In Proceedings of the 11th IEEE Symposium on Reliable Distributed Systems. 39--47.
[14]
G. E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-grbovic, K. London, and J. J. Dongarra. 2004. Extending the MPI Specification for process fault tolerance on high performance computing systems. In Proceedings of the 19th International Supercomputer Conference.
[15]
G. H. Golub and C. F. Van Loan. 1996. Matrix Computations 3rd Ed. Johns Hopkins University Press, Baltimore, MD.
[16]
D. Hakkarinen and Z. Chen. 2010. Algorithmic Cholesky factorization fault recovery. In Proceedings of the IEEE International Symposium on Parallel Distributed Processing. IEEE, 1--10.
[17]
K. H. Huang and J. A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100, 6, 518--528.
[18]
D. S. Katz, J. Daly, N. DeBardeleben, E. N. Elnozahy, B. Kramer, S. Lathrop, N. Nystrom, K. Milfeld, S. Sanielevici, S. Scott, and L. Votta. 2009. Fault tolerance for extreme-scale computing workshop report. Tech. Rep. ANL/MCS-TM-312, Argonne National Lab., Albuquerque, NM. http://www.teragridforum.org/mediawiki/images/8/8c/FT_workshop_report.pdf.
[19]
V. Kumar, A. Grama, A. Gupta, and G. Karypis. 1994. Introduction to Parallel Computing: Design and Analysis of Algorithms.Benjamin-Cummings Publishing Co., Redwood City, CA.
[20]
Lu, C. 2005. Scalable diskless checkpointing for large parallel systems. Ph.D. Dissertation, University of Illinois at Urbana-Champaign, IL.
[21]
F. T. Luk and H. Park. 1988. An analysis of algorithm-based fault tolerance techniques. J. Parallel Distrib. Comput. 5, 2, 172--184.
[22]
J. S. Plank, K. Li, and M. A. Puening. 1998. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst. 9, 10, 972--986.
[23]
B. Schroeder and G. A. Gibson. 2007. Understanding failures in Petascale computers. J. Physics: Conference Series 78, 1, 012022. http://stacks.iop.org/1742-6596/78/i=1/a=012022.
[24]
F. H. Streitz, J. N. Glosli, M. V. Patel, B. Chan, R. K. Yates, B. R. Supinski, J. Sexton, and J. A. Gunnels. 2006. Simulating solidification in metals at high pressure: The drive to petascale computing. J. Physics: Conference Series 46, 254.

Cited By

View all

Index Terms

  1. Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Parallel Computing
    ACM Transactions on Parallel Computing  Volume 1, Issue 2
    Special Issue on PPOPP 2012
    January 2015
    224 pages
    ISSN:2329-4949
    EISSN:2329-4957
    DOI:10.1145/2737841
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 February 2015
    Accepted: 01 June 2014
    Revised: 01 April 2014
    Received: 01 July 2013
    Published in TOPC Volume 1, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. ABFT
    2. fault-tolerance
    3. high performance computing
    4. linear algebra

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)76
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 24 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media