skip to main content
research-article

The Singular Value Decomposition: : Anatomy of Optimizing an Algorithm for Extreme Scale

Published: 01 January 2018 Publication History

Abstract

The computation of the singular value decomposition, or SVD, has a long history with many improvements over the years, both in its implementations and algorithmically. Here, we survey the evolution of SVD algorithms for dense matrices, discussing the motivation and performance impacts of changes. There are two main branches of dense SVD methods: bidiagonalization and Jacobi. Bidiagonalization methods started with the implementation by Golub and Reinsch in Algol60, which was subsequently ported to Fortran in the EISPACK library, and was later more efficiently implemented in the LINPACK library, targeting contemporary vector machines. To address cache-based memory hierarchies, the SVD algorithm was reformulated to use Level 3 BLAS in the LAPACK library. To address new architectures, ScaLAPACK was introduced to take advantage of distributed computing, and MAGMA was developed for accelerators such as GPUs. Algorithmically, the divide and conquer and MRRR algorithms were developed to reduce the number of operations. Still, these methods remained memory bound, so two-stage algorithms were developed to reduce memory operations and increase the computational intensity, with efficient implementations in PLASMA, DPLASMA, and MAGMA. Jacobi methods started with the two-sided method of Kogbetliantz and the one-sided method of Hestenes. They have likewise had many developments, including parallel and block versions and preconditioning to improve convergence. In this paper, we investigate the impact of these changes by testing various historical and current implementations on a common, modern multicore machine and a distributed computing platform. We show that algorithmic and implementation improvements have increased the speed of the SVD by several orders of magnitude, while using up to 40 times less energy.

References

[1]
E. Agullo, B. Hadri, H. Ltaief, and J. Dongarrra, Comparative study of one-sided factorizations with multiple software packages on multi-core hardware, in Proceedings of the Conference on High Performance Computing, Networking, Storage and Analysis (SC'09), ACM, 2009, art. 20, https://doi.org/10.1145/1654059.1654080.
[2]
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users' Guide, 3rd ed., SIAM, Philadelphia, 1999, https://doi.org/10.1137/1.9780898719604.
[3]
H. Andrews and C. Patterson, Singular value decomposition (SVD) image coding, IEEE Trans. Commun., 24 (1976), pp. 425--432, https://doi.org/10.1109/TCOM.1976.1093309.
[4]
P. Arbenz and G. H. Golub, On the spectral decomposition of Hermitian matrices modified by low rank perturbations with applications, SIAM J. Matrix Anal. Appl., 9 (1988), pp. 40--58, https://doi.org/10.1137/0609004.
[5]
P. Arbenz and I. Slapničar, An analysis of parallel implementations of the block-Jacobi algorithm for computing the SVD, in Proceedings of the 17th International Conference on Information Technology Interfaces ITI, 1995, pp. 13--16, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.4595.
[6]
G. Ballard, J. Demmel, and N. Knight, Avoiding communication in successive band reduction, ACM Trans. Parallel Comput., 1 (2015), p. 11, https://doi.org/10.1145/2686877.
[7]
J. L. Barlow, More accurate bidiagonal reduction for computing the singular value decomposition, SIAM J. Matrix Anal. Appl., 23 (2002), pp. 761--798, https://doi.org/10.1137/S0895479898343541.
[8]
M. Bečka, G. Okša, and M. Vajteršic, Dynamic ordering for a parallel block-Jacobi SVD algorithm, Parallel Comput., 28 (2002), pp. 243--262, https://doi.org/10.1016/S0167-8191(01)00138-7.
[9]
M. Bečka, G. Okša, and M. Vajteršic, New dynamic orderings for the parallel one--sided block-Jacobi SVD algorithm, Parallel Process. Lett., 25 (2015), art. 1550003, https://doi.org/10.1142/S0129626415500036.
[10]
M. Bečka, G. Okša, M. Vajteršic, and L. Grigori, On iterative QR pre-processing in the parallel block-Jacobi SVD algorithm, Parallel Comput., 36 (2010), pp. 297--307, https://doi.org/10.1016/j.parco.2009.12.013.
[11]
M. Bečka and M. Vajteršic, Block-Jacobi SVD algorithms for distributed memory systems I: Hypercubes and rings, Parallel Algorithms Appl., 13 (1999), pp. 265--287, https://doi.org/10.1080/10637199808947377.
[12]
M. Bečka and M. Vajteršic, Block-Jacobi SVD algorithms for distributed memory systems II: Meshes, Parallel Algorithms Appl., 14 (1999), pp. 37--56, https://doi.org/10.1080/10637199808947370.
[13]
C. Bischof, B. Lang, and X. Sun, Algorithm 807: The SBR Toolbox---software for successive band reduction, ACM Trans. Math. Software, 26 (2000), pp. 602--616, https://doi.org/10.1145/365723.365736.
[14]
C. Bischof and C. Van Loan, The WY representation for products of Householder matrices, SIAM J. Sci. Statist. Comput., 8 (1987), pp. 2--13, https://doi.org/10.1137/0908009.
[15]
C. H. Bischof, Computing the singular value decomposition on a distributed system of vector processors, Parallel Comput., 11 (1989), pp. 171--186, https://doi.org/10.1016/0167-8191(89)90027-6.
[16]
L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet et al., ScaLAPACK Users' Guide, SIAM, Philadelphia, 1997, https://doi.org/10.1137/1.9780898719642.
[17]
L. S. Blackford, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry et al., An updated set of basic linear algebra subprograms (BLAS), ACM Trans. Math. Software, 28 (2002), pp. 135--151, https://doi.org/10.1145/567806.567807.
[18]
G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar, T. Herault, J. Kurzak, J. Langou, P. Lemarinier, H. Ltaief, et al., Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA, in 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Ph.D. Forum (IPDPSW), IEEE, 2011, pp. 1432--1441, https://doi.org/10.1109/IPDPS.2011.299.
[19]
G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier, and J. Dongarra, DAGuE: A generic distributed DAG engine for high performance computing, Parallel Comput., 38 (2012), pp. 37--51, https://doi.org/10.1016/j.parco.2011.10.003.
[20]
W. H. Boukaram, G. Turkiyyah, H. Ltaief, and D. E. Keyes, Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression, Parallel Comput., 74 (2017), pp. 19--33, https://doi.org/10.1016/j.parco.2017.09.001.
[21]
R. P. Brent and F. T. Luk, The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays, SIAM J. Sci. Statist. Comput., 6 (1985), pp. 69--84, https://doi.org/10.1137/0906007.
[22]
R. P. Brent, F. T. Luk, and C. Van Loan, Computation of the singular value decomposition using mesh-connected processors, J. VLSI Comput. Syst., 1 (1985), pp. 242--270, http://maths-people.anu.edu.au/ brent/pd/rpb080i.pdf.
[23]
T. F. Chan, An improved algorithm for computing the singular value decomposition, ACM Trans. Math. Software, 8 (1982), pp. 72--83, https://doi.org/10.1145/355984.355990.
[24]
J. Choi, J. Dongarra, and D. W. Walker, The design of a parallel dense linear algebra software library: Reduction to Hessenberg, tridiagonal, and bidiagonal form, Numer. Algorithms, 10 (1995), pp. 379--399, https://doi.org/10.1007/BF02140776.
[25]
J. J. M. Cuppen, A divide and conquer method for the symmetric tridiagonal eigenproblem, Numer. Math., 36 (1980), pp. 177--195, https://doi.org/10.1007/BF01396757.
[26]
P. I. Davies and N. J. Higham, Numerically stable generation of correlation matrices and their factors, BIT, 40 (2000), pp. 640--651, https://doi.org/10.1023/A:102238421.
[27]
P. P. M. de Rijk, A one-sided Jacobi algorithm for computing the singular value decomposition on a vector computer, SIAM J. Sci. Statist. Comput., 10 (1989), pp. 359--371, https://doi.org/10.1137/0910023.
[28]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, J. Amer. Soc. Inform. Sci., 41 (1990), pp. 391--407, https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
[29]
J. Demmel, L. Grigori, M. Hoemmen, and J. Langou, Communication-optimal parallel and sequential QR and LU factorizations, SIAM J. Sci. Comput., 34 (2012), pp. A206--A239, https://doi.org/10.1137/080731992.
[30]
J. Demmel, M. Gu, S. Eisenstat, I. Slapničar, K. Veselić, and Z. Drmač, Computing the singular value decomposition with high relative accuracy, Linear Algebra Appl., 299 (1999), pp. 21--80, https://doi.org/10.1016/S0024-3795(99)00134-2.
[31]
J. Demmel and W. Kahan, Accurate singular values of bidiagonal matrices, SIAM J. Sci. Statist. Comput., 11 (1990), pp. 873--912, https://doi.org/10.1137/0911052.
[32]
J. Demmel and K. Veselić, Jacobi's method is more accurate than QR, SIAM J. Matrix Anal. Appl., 13 (1992), pp. 1204--1245, https://doi.org/10.1137/0613074.
[33]
J. W. Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997, https://doi.org/10.1137/1.9781611971446.
[34]
J. W. Demmel, I. Dhillon, and H. Ren, On the correctness of some bisection-like parallel eigenvalue algorithms in floating point arithmetic, Electron. Trans. Numer. Anal., 3 (1995), pp. 116--149, http://emis.ams.org/journals/ETNA/vol.3.1995/pp116-149.dir/pp116-149.pdf.
[35]
I. S. Dhillon, A New O($n^2$) Algorithm for the Symmetric Tridiagonal Eigenvalue/Eigenvector Problem, Ph.D. thesis, EECS Department, University of California, Berkeley, 1997, http://www.dtic.mil/docs/citations/ADA637073.
[36]
I. S. Dhillon and B. N. Parlett, Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices, Linear Algebra Appl., 387 (2004), pp. 1--28, https://doi.org/10.1016/j.laa.2003.12.028.
[37]
I. S. Dhillon and B. N. Parlett, Orthogonal eigenvectors and relative gaps, SIAM J. Matrix Anal. Appl., 25 (2004), pp. 858--899, https://doi.org/10.1137/S0895479800370111.
[38]
I. S. Dhillon, B. N. Parlett, and C. Vömel, The design and implementation of the MRRR algorithm, ACM Trans. Math. Software, 32 (2006), pp. 533--560, https://doi.org/10.1145/1186785.1186788.
[39]
J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart, LINPACK Users' Guide, SIAM, Philadelphia, 1979, https://doi.org/10.1137/1.9781611971811.
[40]
J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff, A set of level $3$ basic linear algebra subprograms, ACM Trans. Math. Software, 16 (1990), pp. 1--17, https://doi.org/10.1145/77626.79170.
[41]
J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, An extended set of FORTRAN basic linear algebra subprograms, ACM Trans. Math. Software, 14 (1988), pp. 1--17, https://doi.org/10.1145/42288.42291.
[42]
J. Dongarra, D. C. Sorensen, and S. J. Hammarling, Block reduction of matrices to condensed forms for eigenvalue computations, J. Comput. Appl. Math., 27 (1989), pp. 215--227, https://doi.org/10.1016/0377-0427(89)90367-1.
[43]
Z. Drmač, Algorithm 977: A QR-preconditioned QR SVD method for computing the SVD with high accuracy, ACM Trans. Math. Software, 44 (2017), p. 11, https://doi.org/10.1145/3061709.
[44]
Z. Drmač and K. Veselić, New fast and accurate Jacobi SVD algorithm, I, SIAM J. Matrix Anal. Appl., 29 (2008), pp. 1322--1342, https://doi.org/10.1137/050639193.
[45]
Z. Drmač and K. Veselić, New fast and accurate Jacobi SVD algorithm, II, SIAM J. Matrix Anal. Appl., 29 (2008), pp. 1343--1362, https://doi.org/10.1137/05063920X.
[46]
P. Eberlein, On one-sided Jacobi methods for parallel computation, SIAM J. Algebraic Discrete Methods, 8 (1987), pp. 790--796, https://doi.org/10.1137/0608064.
[48]
K. V. Fernando and B. N. Parlett, Accurate singular values and differential qd algorithms, Numer. Math., 67 (1994), pp. 191--229, https://doi.org/10.1007/s002110050024.
[49]
G. E. Forsythe and P. Henrici, The cyclic Jacobi method for computing the principal values of a complex matrix, Trans. Amer. Math. Soc., 94 (1960), pp. 1--23, https://doi.org/10.2307/1993275.
[50]
B. S. Garbow, J. M. Boyle, C. B. Moler, and J. Dongarra, Matrix eigensystem routines -- EISPACK guide extension, Lecture Notes in Comput. Sci. 51, Springer, Berlin, 1977, https://doi.org/10.1007/3-540-08254-9.
[51]
M. Gates, S. Tomov, and J. Dongarraa, Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs, Parallel Comput., 74 (2018), pp. 3--18, https://doi.org/10.1016/j.parco.2017.10.004.
[52]
G. Golub, Some modified matrix eigenvalue problems, SIAM Rev., 15 (1973), pp. 318--334, https://doi.org/10.1137/1015032.
[53]
G. Golub and W. Kahan, Calculating the singular values and pseudo-inverse of a matrix, J. Soc. Indust. Appl. Math. Ser. B Numer. Anal., 2 (1965), pp. 205--224, https://doi.org/10.1137/0702016.
[54]
G. Golub and C. Reinsch, Singular value decomposition and least squares solutions, Numer. Math., 14 (1970), pp. 403--420, https://doi.org/10.1007/BF02163027.
[55]
B. Großer and B. Lang, Efficient parallel reduction to bidiagonal form, Parallel Comput., 25 (1999), pp. 969--986, https://doi.org/10.1016/S0167-8191(99)00041-1.
[56]
M. Gu, J. Demmel, and I. Dhillon, Efficient Computation of the Singular Value Decomposition with Applications to Least Squares Problems, Tech. Report LBL-36201, Lawrence Berkeley Laboratory, 1994, http://www.cs.utexas.edu/users/inderjit/public_papers/least_squares.pdf.
[57]
M. Gu and S. C. Eisenstat, A Divide and Conquer Algorithm for the Bidiagonal SVD, Tech. Report YALEU/DCS/TR-933, Department of Computer Science, Yale University, 1992, http://cpsc.yale.edu/research/technical-reports/1992-technical-reports.
[58]
M. Gu and S. C. Eisenstat, A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem, SIAM J. Matrix Anal. Appl., 15 (1994), pp. 1266--1276, https://doi.org/10.1137/S089547989223924X.
[59]
M. Gu and S. C. Eisenstat, A divide-and-conquer algorithm for the bidiagonal SVD, SIAM J. Matrix Anal. Appl., 16 (1995), pp. 79--92, https://doi.org/10.1137/S0895479892242232.
[60]
A. Haidar, J. Kurzak, and P. Luszczek, An improved parallel singular value algorithm and its implementation for multicore hardware, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC'13), ACM, 2013, art. 90, https://doi.org/10.1145/2503210.2503292.
[61]
A. Haidar, H. Ltaief, and J. Dongarra, Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels, in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11), ACM, 2011, art. 8, https://doi.org/10.1145/2063384.2063394.
[62]
A. Haidar, H. Ltaief, P. Luszczek, and J. Dongarra, A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction, in 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS), IEEE, 2012, pp. 25--35, https://doi.org/10.1109/IPDPS.2012.13.
[63]
S. Hammarling, A note on modifications to the Givens plane rotation, IMA J. Appl. Math., 13 (1974), pp. 215--218, https://doi.org/10.1093/imamat/13.2.215.
[64]
V. Hari, Accelerating the SVD block-Jacobi method, Computing, 75 (2005), pp. 27--53, https://doi.org/10.1007/s00607-004-0113-z.
[65]
V. Hari and J. Matejaš, Accuracy of two SVD algorithms for $2\times 2$ triangular matrices, Appl. Math. Comput., 210 (2009), pp. 232--257, https://doi.org/10.1016/j.amc.2008.12.086.
[66]
V. Hari and K. Veselić, On Jacobi methods for singular value decompositions, SIAM J. Sci. Statist. Comput., 8 (1987), pp. 741--754, https://doi.org/10.1137/0908064.
[67]
M. Heath, A. Laub, C. Paige, and R. Ward, Computing the singular value decomposition of a product of two matrices, SIAM J. Sci. Statist. Comput., 7 (1986), pp. 1147--1159, https://doi.org/10.1137/0907078.
[68]
M. R. Hestenes, Inversion of matrices by biorthogonalization and related results, J. Soc. Indust. Appl. Math., 6 (1958), pp. 51--90, https://doi.org/10.1137/0106005.
[69]
G. W. Howell, J. W. Demmel, C. T. Fulton, S. Hammarling, and K. Marmol, Cache efficient bidiagonalization using BLAS $2.5$ operators, ACM Trans. Math. Software, 34 (2008), art. 14, https://doi.org/10.1145/1356052.1356055.
[71]
Intel Corporation, User's Guide for Intel Math Kernel Library for Linux OS, 2015, http://software.intel.com/en-us/mkl-for-linux-userguide.
[72]
I. C. F. Ipsen, Computing an eigenvector with inverse iteration, SIAM Rev., 39 (2006), pp. 254--291, https://doi.org/10.1137/S0036144596300773.
[73]
C. G. J. Jacobi, Über ein leichtes verfahren die in der theorie der säcularstörungen vorkommenden gleichungen numerisch aufzulösen, J. Reine Angew. Math., 30 (1846), pp. 51--94, http://eudml.org/doc/147275.
[74]
E. Jessup and D. Sorensen, A divide and conquer algorithm for computing the singular value decomposition, in Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, 1989, SIAM, Philadelphia, pp. 61--66.
[75]
W. Kahan, Accurate Eigenvalues of a Symmetric Tri-diagonal Matrix, Tech. Report, Stanford University, Stanford, CA, 1966, http://www.dtic.mil/docs/citations/AD0638796.
[76]
E. Kogbetliantz, Solution of linear equations by diagonalization of coefficients matrix, Quart. Appl. Math., 13 (1955), pp. 123--132, http://www.ams.org/journals/qam/1955-13-02/S0033-569X-1955-88795-9/S0033-569X-1955-88795-9.pdf.
[77]
J. Kurzak, P. Wu, M. Gates, I. Yamazaki, P. Luszczek, G. Ragghianti, and J. Dongarra, Designing SLATE: Software for Linear Algebra Targeting Exascale, SLATE Working Note 3, Innovative Computing Laboratory, University of Tennessee, 2017, http://www.icl.utk.edu/publications/swan-003.
[78]
B. Lang, Parallel reduction of banded matrices to bidiagonal form, Parallel Comput., 22 (1996), pp. 1--18, https://doi.org/10.1016/0167-8191(95)00064-X.
[79]
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, Basic linear algebra subprograms for FORTRAN usage, ACM Trans. Math. Software, 5 (1979), pp. 308--323, https://doi.org/10.1145/355841.355847.
[80]
R.-C. Li, Solving Secular Equations Stably and Efficiently, Tech. Report UCB//CSD-94-851, Computer Science Division, University of California Berkeley, 1994, http://www.netlib.org/lapack/lawns/. Also: LAPACK Working Note 89.
[81]
S. Li, M. Gu, L. Cheng, X. Chi, and M. Sun, An accelerated divide-and-conquer algorithm for the bidiagonal SVD problem, SIAM J. Matrix Anal. Appl., 35 (2014), pp. 1038--1057, https://doi.org/10.1137/130945995.
[82]
H. Ltaief, J. Kurzak, and J. Dongarra, Parallel two-sided matrix reduction to band bidiagonal form on multicore architectures, IEEE Trans. Parallel Distrib. Syst., 21 (2010), pp. 417--423, https://doi.org/10.1109/TPDS.2009.79.
[83]
H. Ltaief, P. Luszczek, and J. Dongarra, High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures, ACM Trans. Math. Software, 39 (2013), art. 16, https://doi.org/10.1145/2450153.2450154.
[84]
F. T. Luk, Computing the singular-value decomposition on the ILLIAC IV, ACM Trans. Math. Software, 6 (1980), pp. 524--539, https://doi.org/10.1145/355921.355925.
[85]
F. T. Luk and H. Park, On parallel Jacobi orderings, SIAM J. Sci. Statist. Comput., 10 (1989), pp. 18--26, https://doi.org/10.1137/0910002.
[86]
O. Marques and P. B. Vasconcelos, Computing the bidiagonal SVD through an associated tridiagonal eigenproblem, in International Conference on Vector and Parallel Processing (VECPAR), Springer, 2016, pp. 64--74, https://doi.org/10.1007/978-3-319-61982-8_8.
[87]
W. F. Mascarenhas, On the convergence of the Jacobi method for arbitrary orderings, SIAM J. Matrix Anal. Appl., 16 (1995), pp. 1197--1209, https://doi.org/10.1137/S0895479890179631.
[88]
J. Matejaš and V. Hari, Accuracy of the Kogbetliantz method for scaled diagonally dominant triangular matrices, Appl. Math. Comput., 217 (2010), pp. 3726--3746, https://doi.org/10.1016/j.amc.2010.09.020.
[89]
J. Matejaš and V. Hari, On high relative accuracy of the Kogbetliantz method, Linear Algebra Appl., 464 (2015), pp. 100--129, https://doi.org/10.1016/j.laa.2014.02.024.
[91]
J. D. McCalpin, A survey of memory bandwidth and machine balance in current high performance computers, IEEE Comput. Soc. Tech. Committee Comput. Architect. (TCCA) Newslett., 19 (1995), pp. 19--25, http://tab.computer.org/tcca/NEWS/DEC95/dec95_mccalpin.ps.
[92]
B. Moore, Principal component analysis in linear systems: Controllability, observability, and model reduction, IEEE Trans. Automat. Control, 26 (1981), pp. 17--32, https://doi.org/10.1109/TAC.1981.1102568.
[93]
MPI Forum, MPI: A Message-Passing Interface Standard, Version 3.1, June 2015, http://www.mpi-forum.org/.
[94]
NVIDIA Corporation, CUDA Toolkit 7.0, March 2015, http://developer.nvidia.com/cuda-zone.
[95]
G. Okša and M. Vajteršic, Efficient pre-processing in the parallel block-Jacobi SVD algorithm, Parallel Comput., 32 (2006), pp. 166--176, https://doi.org/10.1016/j.parco.2005.06.006.
[96]
[97]
B. N. Parlett, The new QD algorithms, Acta Numer., 4 (1995), pp. 459--491, https://doi.org/10.1017/S0962492900002580.
[98]
B. N. Parlett and I. S. Dhillon, Fernando's solution to Wilkinson's problem: An application of double factorization, Linear Algebra Appl., 267 (1997), pp. 247--279, https://doi.org/10.1016/S0024-3795(97)80053-5.
[99]
V. Rokhlin, A. Szlam, and M. Tygert, A randomized algorithm for principal component analysis, SIAM J. Matrix Anal. Appl., 31 (2009), pp. 1100--1124, https://doi.org/10.1137/080736417.
[100]
H. Rutishauser, Der quotienten-differenzen-algorithmus, Z. Angew. Math. Phys., 5 (1954), pp. 233--251, https://doi.org/10.1007/BF01600331.
[101]
H. Rutishauser, Solution of eigenvalue problems with the LR-transformation, Nat. Bur. Standards Appl. Math. Ser., 49 (1958), pp. 47--81.
[102]
H. Rutishauser, The Jacobi method for real symmetric matrices, in Handbook for Automatic Computation: Volume II: Linear Algebra, Grundlehren Math. Wiss. 186, Springer-Verlag, New York, 1971, pp. 202--211, https://doi.org/10.1007/978-3-642-86940-2.
[103]
A. H. Sameh, On Jacobi and Jacobi-like algorithms for a parallel computer, Math. Comp., 25 (1971), pp. 579--590, https://doi.org/10.1090/S0025-5718-1971-0297131-6.
[104]
R. Schreiber and C. Van Loan, A storage-efficient WY representation for products of Householder transformations, SIAM J. Sci. Statist. Comput., 10 (1989), pp. 53--57, https://doi.org/10.1137/0910005.
[105]
B. T. Smith, J. M. Boyle, J. Dongarra, B. S. Garbow, Y. Ikebe, V. C. Klema, and C. B. Moler, Matrix Eigensystem Routines -- EISPACK Guide, Second Edition, Lecture Notes in Comput. Sci. 6, Springer, Berlin, 1976, https://doi.org/10.1007/3-540-07546-1.
[106]
G. W. Stewart, The efficient generation of random orthogonal matrices with an application to condition estimators, SIAM J. Numer. Anal., 17 (1980), pp. 403--409, https://doi.org/10.1137/0717034.
[107]
G. W. Stewart, On the early history of the singular value decomposition, SIAM Rev., 35 (1993), pp. 551--566, https://doi.org/10.1137/1035134.
[108]
G. W. Stewart, QR Sometimes Beats Jacobi, Tech. Report CS-TR-3434, University of Maryland, 1995, http://drum.lib.umd.edu/handle/1903/709.
[109]
S. Tomov, R. Nath, and J. Dongarra, Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing, Parallel Comput., 36 (2010), pp. 645--654, https://doi.org/10.1016/j.parco.2010.06.001.
[110]
S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, Dense linear algebra solvers for multicore with GPU accelerators, in 2010 IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW), IEEE, 2010, pp. 1--8, https://doi.org/10.1109/IPDPSW.2010.5470941.
[111]
M. A. Turk and A. P. Pentland, Face recognition using eigenfaces, in Proceedings of 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, 1991, pp. 586--591, https://doi.org/10.1109/CVPR.1991.139758.
[112]
C. Van Loan, The Block Jacobi Method for Computing the Singular Value Decomposition, Tech. Report TR 85-680, Cornell University, 1985, https://ecommons.cornell.edu/handle/1813/6520.
[113]
F. G. Van Zee, R. A. Van de Geijn, and G. Quintana-Ortí, Restructuring the tridiagonal and bidiagonal QR algorithms for performance, ACM Trans. Math. Software, 40 (2014), p. 18, https://doi.org/10.1145/2535371.
[114]
F. G. Van Zee, R. A. Van De Geijn, G. Quintana-Ortí, and G. J. Elizondo, Families of algorithms for reducing a matrix to condensed form, ACM Trans. Math. Software, 39 (2012), art. 2, https://doi.org/10.1145/2382585.2382587.
[115]
R. C. Whaley and J. Dongarra, Automatically tuned linear algebra software, in Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, IEEE Computer Society, 1998, pp. 1--27, https://doi.org/10.1109/SC.1998.10004.
[116]
J. H. Wilkinson, Note on the quadratic convergence of the cyclic Jacobi process, Numer. Math., 4 (1962), pp. 296--300, https://doi.org/10.1007/BF01386321.
[117]
J. H. Wilkinson and C. Reinsch, Handbook for Automatic Computation: Volume II: Linear Algebra, Grundlehren Math. Wiss. 186, Springer-Verlag, New York, 1971, https://doi.org/10.1007/978-3-642-86940-2.
[118]
P. R. Willems and B. Lang, A framework for the $MR^3$ algorithm: Theory and implementation, SIAM J. Sci. Comput., 35 (2013), pp. A740--A766, https://doi.org/10.1137/110834020.
[119]
P. R. Willems, B. Lang, and C. Vömel, Computing the bidiagonal SVD using multiple relatively robust representations, SIAM J. Matrix Anal. Appl., 28 (2006), pp. 907--926, https://doi.org/10.1137/050628301.
[120]
B. B. Zhou and R. P. Brent, A parallel ring ordering algorithm for efficient one-sided Jacobi SVD computations, J. Parallel Distrib. Comput., 42 (1997), pp. 1--10, https://doi.org/10.1006/jpdc.1997.1304.

Cited By

View all

Index Terms

  1. The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image SIAM Review
        SIAM Review  Volume 60, Issue 4
        DOI:10.1137/siread.60.4
        Issue’s Table of Contents

        Publisher

        Society for Industrial and Applied Mathematics

        United States

        Publication History

        Published: 01 January 2018

        Author Tags

        1. singular value decomposition
        2. SVD
        3. bidiagonal matrix
        4. QR iteration
        5. divide and conquer
        6. bisection
        7. MRRR
        8. Jacobi method
        9. Kogbetliantz method
        10. Hestenes method

        Author Tags

        1. 15A18
        2. 15A23
        3. 65Y05

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 04 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media