skip to main content
research-article

High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures

Published: 03 May 2013 Publication History

Abstract

This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.

References

[1]
Agullo, E., Dongarra, J., Nath, R., and Tomov, S. 2010. Autotuned dense QR factorization for multicore architectures. Tech. rep. RR-7526, Institut National de Recherche en Informatique et en Automatique (INRIA). arXiv:1102.5328.
[2]
Agullo, E., Hadri, B., Ltaief, H., and Dongarra, J. 2009. Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC'09). ACM, New York, 1--12.
[3]
Anderson, E., Bai, Z., Bischof, C., Blackford, S. L., Demmel, J. W., Dongarra, J. J., Croz, J. D., Greenbaum, A., Hammarling, S., Mckenney, A., and Sorensen, D. C. 1999. LAPACK User's Guide 3rd Ed, SIAM, Philadelphia, PA.
[4]
Anderson, E. and Dongarra, J. J. 1990. Evaluating block algorithm variants in LAPACK. In Parallel Processing for Scientific Computing, J. Dongarra et al. Eds., SIAM, Philadelphia, PA., 3--8.
[5]
Barlow, J. L., Bosner, N., and Drmač, Z. 2005. A new stable bidiagonal reduction algorithm. Linear Algebra Appl. 397, 1, 35--84.
[6]
Bientinesi, P., Igual, F., Kressner, D., and Quintana-Ort'I, E. 2010. Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures. Parallel Process. Appl. Math. 6067, 387--395.
[7]
Bischof, C. H., Lang, B., and Sun, X. 2000. Algorithm 807: The SBR Toolbox—Software for successive band reduction. ACM Trans. Math. Softw. 26, 4, 602--616.
[8]
Blackford, L. S., Choi, J., Cleary, A., D'azevedo, E. F., Demmel, J. W., Dhillon, I. S., Dongarra, J. J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D. W., and Whaley, R. C. 1997. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA.
[9]
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, A., Thomas Herault, J. K., Langou, J., Lemarinier, P., Ltaief, H., Luszczek, P., Yarkhan, and Dongarra, J. 2011. Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA. In Proceedings of the 12th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC-11). ACM, New York.
[10]
Bosner, N. and Barlow, J. L. 2007. Block and parallel versions of one-sided bidiagonalization. SIAM J. Matrix Anal. Appl. 29, 3, 927--953.
[11]
Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Luszczek, P., and Tomov, S. 2006. The impact of multicore on math software. In Proceedings of the 8th International Workshop on Applied Parallel Computing. State of the Art in Scientific Computing (PARA). B. Kågström, et al. Eds., Lecture Notes in Computer Science, vol. 4699 Springer, Berlin, 1--10.
[12]
Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. J. 2008. Parallel tiled QR factorization for multicore architectures. Concurrency Comput. Pract. Exper. 20, 13, 1573--1590. http://dx.doi.org/10.1002/cpe.1301.
[13]
Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. J. 2009. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. Syst. Appl. 35, 38--53. http://dx.doi.org/10.1016/j.parco.2008.10.002.
[14]
Choi, J., Dongarra, J. J., Ostrouchov, S., Petitet, A., Walker, D. W., and Whaley, R. C. 1996. The design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Sci. Program. 5, 173--184.
[15]
Dackland, K., Elmroth, E., Køagstr Om, B., and Loan, C. V. 1992. Design and evaluation of parallel block algorithms: Lu factorization on an IBM 3090 VF/600J. In Proceedings of the 5th SIAM Conference on Parallel Processing for Scientific Computing. SIAM, Philadelphia, PA, 3--10.
[16]
D'Azevedo, E. and Luszczek, P. 2003. A framework for check-pointed fault-tolerant out-of-core linear algebra. In Proceedings of the SIAM Conference on Computational Science and Engineering (CSE03). SIAM, Philadelphia, PA.
[17]
Deift, P., Demmel, J. W., Li, L.-C., and Tomei, C. 1991. The bidiagonal singular value decomposition and Hamiltonian mechanics. SIAM J. Numer. Anal. 28, 5, 1463--1516. (LAPACK Working Note #11).
[18]
Demmel, J. W. and Kahan, W. 1990. Accurate singular values of bidiagonal matrices. SIAM J. Sci. Stat. Comput. 11, 5, 873--912. (Also LAPACK Working Note #3).
[19]
Dongarra, J. 2010. PLASMA Users' Guide, Parallel Linear Algebra Software for Multicore Architectures, Version 2.3. University of Tennessee.
[20]
Fernando, V. and Parlett, B. 1994. Accurate singular values and differential qd algorithms. Numer. Math. 67, 191--229.
[21]
Golub, G. H. and Reinsch, C. 1970. Singular value decomposition and least squares solutions. Numer. Math. 14, 403--420.
[22]
Golub, G. H. and Van Loan, C. F. 1996. Matrix Computation 3rd Ed. Johns Hopkins University Press, Baltimore, MD.
[23]
Grosser, B. and Lang, B. 1999. Efficient parallel reduction to bidiagonal form. Parallel Comput. 25, 8, 969--986.
[24]
Gu, M. and Eisenstat, S. 1995. A divide-and-conquer algorithm for the bidiagonal SVD. SIAM J. Math. Anal. Appl. 16, 79--92.
[25]
Gustavson, F. G. 2000. New generalized matrix data structures lead to a variety of high-performance algorithms. In Proceedings of the IFIP WG 2.5 Working Conference on Software Architectures for Scientific Computing Applications. Kluwer Academic, Amsterdam, 211--234.
[26]
Haidar, A., Ltaief, H., Yarkhan, A., and Dongarra, J. 2011. Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures. concurrency and computations: Practice and experience. Tech. rep. UT-CS-11-666, University of Tennessee.
[27]
Hennessy, J. L. and Patterson, D. A. 2012. Computer Architecture: A Quantitative Approach 5th Ed. Morgan Kaufmann.
[28]
Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. J. Edu. Psych. 24, 417--441, 498--520.
[29]
Hotelling, H. 1935. Simplified calculation of principal components. Psychometrica 1, 27--35.
[30]
Householder, A. S. 1958. Unitary triangularization of a nonsymmetric matrix. J. ACM 5, 4. DOI 10.1145/320941.320947.
[31]
Jessup, E. R. and Sorensen, D. 1994. A parallel algorithm for computing the singular value decomposition of a matrix. SIAM J. Matrix Anal. Appl. 15, 530--548.
[32]
Kågstr&omuml;, B., Kressner, D., Quintana-Ortí, E., and Quintana-Ortí, G. 2008. Blocked algorithms for the reduction to Hessenberg-triangular form revisited. BIT Numer. Math. 48, 563--584.
[33]
Ltaief, H., Kurzak, J., and Dongarra, J. 2010. Parallel two-sided matrix reduction to band bidiagonal form on multicore architectures. IEEE Trans. Parallel Distrib. Syst. 417--423.
[34]
Luszczek, P., Ltaief, H., and Dongarra, J. 2011. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In Proceedings of IPDPS 2011. ACM, New York.
[35]
MKL. 2011. Intel, Math Kernel Library (MKL). http://www.intel.com/software/products/mkl/. Version 10.2.
[36]
Moore, B. C. 1981. Principal component analysis in linear systems: Controllability, observability, and model reduction. IEEE Trans. Autom. Control AC-26, 1.
[37]
Perez, J., Badia, R., and Labarta, J. 2008. A dependency-aware task-based programming environment for multi-core architectures. In Proceedings of the IEEE International Conference on Cluster Computing. IEEE, Los Alamitos, CA. 142--151.
[38]
Rui Ralha. 2003. One-sided reduction to bidiagonal form. Linear Algebra Appl. 358, 219--238.
[39]
SMPSs Team. 2008. SMP Superscalar (SMPSs) User's Manual. Version 2.3.
[40]
Stewart, G. W. 2000. The decompositional approach to matrix computation. Comput. Sci. Eng. 2, 1, 50--59.
[41]
Trefethen, L. N. and Bau, D. 1997. Numerical Linear Algebra. SIAM, Philadelphia, PA.
[42]
Yi, Q., Kennedy, K., You, H., Seymour, K., and Dongarra, J. 2004. Automatic blocking of qr and lu factorizations for locality. In Proceedings of the 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP'04). ACM, New York.

Cited By

View all

Index Terms

  1. High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Mathematical Software
    ACM Transactions on Mathematical Software  Volume 39, Issue 3
    April 2013
    149 pages
    ISSN:0098-3500
    EISSN:1557-7295
    DOI:10.1145/2450153
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 May 2013
    Accepted: 01 June 2012
    Revised: 01 March 2012
    Received: 01 May 2011
    Published in TOMS Volume 39, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Bidiagional reduction
    2. bulge chasing
    3. data translation layer
    4. dynamic scheduling
    5. high performance kernels
    6. tile algorithms
    7. two-stage approach

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media