skip to main content
research-article
Public Access

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines

Published: 26 June 2021 Publication History

Abstract

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.

References

[1]
Ahmad Abdelfattah, Marc Baboulin, Veselin Dobrev, Jack J. Dongarra, C. Earl, J. Falcou, Azzam Haidar, Ian Karlin, Tzanio Kolev, Ian Masliah, and Stanimire Tomov. 2016. High-performance Tensor Contractions for GPUs. Technical Report UT-EECS-16-738. University of Tennessee Computer Science.
[2]
Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, design, and autotuning of batched GEMM for GPUs. In Proceedings of the ISC High Performance Conference. Springer, 21–38.
[3]
Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Hatem Ltaief, Raymond Namyst, Samuel Thibault, and Stanimire Tomov. 2010. Faster, cheaper, better – A hybridization methodology to develop linear algebra software for GPUs. In GPU Computing Gems, Wen mei W. Hwu (Ed.). Vol. 2. Morgan Kaufmann, Burlington, MA.
[4]
Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Jack Langou, Haitem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Series 180, 1 (2009).
[5]
Ed Anderson, Z. Bai, C. Bischof, Susan L. Blackford, James W. Demmel, Jack J. Dongarra, J. Du Croz, A. Greenbaum, Sven Hammarling, A. McKenney, and Danny C. Sorensen. 1999. LAPACK User’s Guide (3 ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA.
[6]
Michael J. Anderson, David Sheffield, and Kurt Keutzer. 2012. A predictive model for solving small linear algebra problems in GPU registers. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium. IEEE, New York, 2–13.
[7]
Alexander A. Auer, Gerald Baumgartner, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert Harrison, Sriram Krishnamoorthy, Sandhya Krishnan, Chi-Chung Lam, Qingda Lu, Marcel Nooijen, Russell Pitzer, J. Ramanujam, P. Sadayappan, and Alexander Sibiryakov. 2006. Automatic code generation for many-body electronic structure methods: The tensor contraction engine. Molec. Phys. 104, 2 (Jan. 2006), 211–228.
[8]
Marc Baboulin, Veselin Dobrev, Jack J. Dongarra, C. Earl, J. Falcou, Azzam Haidar, Ian Karlin, Tzanio Kolev, Ian Masliah, and Stanimire Tomov. 2015. Towards a high-performance tensor algebra package for accelerators. In Proceedings of the Smoky Mountains Computational Sciences and Engineering Conference (SMC’15). HAL Archives. Retrieved from http://computing.ornl.gov/workshops/SMC15/presentations/.
[9]
Benjamin Brock, Andrew Belt, Jay J. Billings, and Mike Guidry. 2015. Explicit integration with GPU acceleration for large kinetic networks. J. Comput. Phys. 302, C (Jan.–Dec. 2015), 591–602. arXiv:1409.5826 (2015).
[10]
J. Demmel, M. Gates, G. Henry, X. S. Li, J. Riedy, and P. T. P. Tang. 2017. A Proposal for a Next-generation BLAS. Technical Report. Retrieved from http://goo.gl/D1UKnw.
[11]
Tingxing Dong, Veselin Dobrev, Tzanio Kolev, Robert Rieben, Stanimire Tomov, and Jack Dongarra. 2014. A step towards energy efficient computing: Redesigning a hydrodynamic application on CPU-GPU. In Proceedings of the IEEE 28th International Parallel Distributed Processing Symposium (IPDPS’14). IEEE, New York, 972–981.
[12]
Tingxing Dong, Azzam Haidar, Piotr Luszczek, A. Harris, Stanimire Tomov, and Jack J. Dongarra. 2014. LU factorization of small matrices: Accelerating batched DGETRF on the GPU. In Proceedings of 16th IEEE International Conference on High Performance and Communications (HPCC’14). IEEE, 157–160.
[13]
Jack J. Dongara, J. R. Bunch Cleve B. Moler, and G. W. Stewart. 1979. LINPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.
[14]
Jack Dongarra, Iain Duff, Mark Gates, Azzam Haidar, Sven Hammarling, Nicholas J. Higham, Jonathon Hogg, Pedro Valero-Lara, Samuel D. Relton, Stanimire Tomov, and Mawussi Zounon. 2016. A Proposed API for Batched Basic Linear Algebra Subprograms. MIMS EPrint 2016.25. Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Retrieved from http://eprints.ma.man.ac.uk/2464/.
[15]
Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Panruo Wu, Ichitaro Yamazaki, Asim Yarkhan, Maksims Abalenkovs, Negin Bagherpour, Sven Hammarling, Jakub Šístek, David Stevens, Mawussi Zounon, and Samuel D. Relton. 2019. PLASMA: Parallel linear algebra software for multicore using OpenMP. ACM Trans. Math. Softw. 45, 2 (May 2019).
[16]
Jack Dongarra, Sven Hammarling, Nicholas J. Higham, Samuel D. Relton, Pedro Valero-Lara, and Mawussi Zounon. 2016. Creating a standardised set of batched BLAS routines. In Proceedings of the 4th Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE’16), Gabrielle Allen, Jeffrey Carver, Sou-Cheng T. Choi et al. (Eds.), Vol. 1686. CEUR Workshop Proceedings, WSSSPE, 1–2. Retrieved from http://ceur-ws.org/Vol-1686/WSSSPE4_paper_3.pdf.
[17]
Jack Dongarra, Sven Hammarling, Nicholas J. Higham, Samuel D. Relton, Pedro Valero-Lara, and Mawussi Zounon. 2017. The design and performance of batched BLAS on modern high-performance computing systems. Proced. Comput. Sci. 108 (2017), 495–504.
[18]
Jack J. Dongarra, J. Du Croz, Iain S. Duff, and Sven Hammarling. 1990. Algorithm 679: A set of Level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16 (Mar. 1990), 1–17.
[19]
Jack J. Dongarra, J. Du Croz, Iain S. Duff, and Sven Hammarling. 1990. A set of Level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16 (Mar. 1990), 18–28.
[20]
Jack J. Dongarra, J. Du Croz, Sven Hammarling, and R. Hanson. 1988. Algorithm 656: An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Softw. 14 (Mar. 1988), 18–32.
[21]
Jack J. Dongarra, J. Du Croz, Sven Hammarling, and R. Hanson. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Softw. 14 (Mar. 1988), 1–17.
[22]
Mark Gates, Jakub Kurzak, Ali Charara, Asim YarKhan, and Jack Dongarra. 2019. SLATE: Design of a modern distributed and accelerated linear algebra library. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’19). ACM, New York, NY, 13.
[23]
Scott Gray. 2015. A full walk through of the SGEMM implementation. Retrieved from https://github.com/NervanaSystems/maxas/wiki/SGEMM.
[24]
Murat Guney, Sarah Knepper, Kazushige Goto, Vamsi Sripathi, Greg Henry, and Shane Story. 2015. Batched Matrix-Matrix Multiplication Operations for Intel Xeon Processor and Intel Xeon Phi Co-Processor. Retrieved from http://meetings.siam.org/sess/dsp talk.cfm?p=72187.
[25]
W. Hackbusch. 1999. A sparse matrix arithmetic based on H-matrices. Part I: Introduction to H-matrices. Computing 62, 2 (May 1999), 89–108.
[26]
Azzam Haidar, Chongxiao Cao, Asim Yarkhan, Piotr Luszczek, Stanimire Tomov, Khairul Kabir, and Jack Dongarra. 2014. Unified development for mixed multi-GPU and multi-coprocessor environments using a lightweight runtime environment. In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society, Washington, DC, 491–500.
[27]
Azzam Haidar, TingXing Dong, Piotr Luszczek, Stanimire Tomov, and Jack Dongarra. 2015. Optimization for performance and energy for batched matrix computations on GPUs. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs (GPGPU’15) co-located with PPOPP’15. ACM, 59–69.
[28]
Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stanimire Tomov, and Jack Dongarra. 2015. Towards batched linear solvers on accelerated hardware platforms. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). ACM, 261–262.
[29]
Sven Hammarling. 2016. Workshop on Batched, Reproducible, and Reduced Precision BLAS. Technical Report. The University of Manchester, Manchester, UK. Retrieved from eprints.maths.manchester.ac.uk/2494/.
[30]
Sven Hammarling. 2017. Second Workshop on Batched, Reproducible, and Reduced Precision BLAS. Technical Report. The University of Manchester, Manchester, UK. Retrieved from eprints.maths.manchester.ac.uk/2543/.
[31]
Nicholas J. Higham and Theo Mary. 2018. A New Approach to Probabilistic Rounding Error Analysis. Technical Report MIMS EPrint:2018.33. Manchester Institute for Mathematical Sciences, School of Mathematics, The University of Manchester. Retrieved from eprints.maths.manchester.ac.uk/2679/.
[32]
HSL. 2013. A collection of Fortran codes for large scale scientific computation. Retrieved from http://www.hsl.rl.ac.uk.
[33]
Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization framework for sparse matrix kernels. Int. J. High Perf. Comput. Appl. 18, 1 (Feb. 2004), 135–158.
[34]
Khairul Kabir, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2015. On the design, development, and analysis of optimized matrix-vector multiplication routines for coprocessors. In Proceedings of the ISC High Performance Conference. Springer, 58–73.
[35]
David Keyes and Valerie Taylor. March 2011. NSF-ACCI task force on software for science and engineering. Retrieved from https://www.nsf.gov/cise/aci/taskforces/TaskForceReport Software.pdf.
[36]
J. C. Liao Khodayari A., A. R. Zomorrodi, and C. D. Maranas. 2014. A kinetic model of Escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metab. Eng. 25C (2014), 50–62.
[37]
Junjie Lai and Andre Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE Computer Society, Washington, DC, 1–10.
[38]
C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for FORTRAN usage. ACM Trans. Math. Softw. 5 (1979), 308–323.
[39]
Ian Masliah, Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Marc Baboulin, J. Falcou, and Jack J. Dongarra. 2016. High-performance Matrix-matrix Multiplications of Very Small Matrices. Technical Report UT-EECS-16-740. University of Tennessee Computer Science.
[40]
O. E. B. Messer, J. A. Harris, S. Parete-Koon, and M. A. Chertkow. 2012. Multicore and accelerator development for a leadership-class stellar astrophysics code. In Proceedings of the State-of-the-Art in Scientific and Parallel Computing Conference (PARA’12). Springer-Verlag, Berlin, Germany, 92–106.
[41]
J. M. Molero, E. M. Garzón, I. García, E. S. Quintana-Ortí, and A. Plaza. 2013. Poster: A Batched Cholesky Solver for Local RX Anomaly Detection on GPUs. PUMPS. Retrieved from https://dl.acm.org/doi/10.1145/2716282.2716288.
[42]
Rajib Nath, Stanimire Tomov, and Jack J. Dongarra. 2010. An improved MAGMA GEMM for Fermi GPUs. Int. J. High Perf. Comput. Appl. 24, 4 (Nov. 2010), 511–515.
[43]
NVIDIA. 2016. cuBLAS 7.5. Available at Retrieved from http://docs.nvidia.com/cuda/cublas/.
[44]
Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui S̃un. 2011. Fast implementation of DGEMM on Fermi GPU. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, 35:1–35:11.
[45]
S. Tomov, J. Dongarra, and M. Baboulin. 2010. Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. Syst. Appl. 36, 5--6 (2010), 232–240.
[46]
Oreste Villa, Massimiliano Fatica, Nitin Gawande, and Antonino Tumeo. 2013. Power/performance trade-offs of small batched LU based solvers on GPUs. In Euro-Par 2013 Parallel Processing: 19th International Conference, Aachen, Germany, August 26–30, 2013. Proceedings, Felix Wolf, Bernd Mohr, and Dieter an Mey (Eds.). Springer Berlin, 813–825.
[47]
Oreste Villa, Nitin Gawande, and Antonino Tumeo. 2013. Accelerating subsurface transport simulation on heterogeneous clusters. In Proceedings of the IEEE International Conference on Cluster Computing. IEEE, New York, 1–8.
[48]
Sencer Nuri Yeralan, Timothy A. Davis, Wissam M. Sid-Lakhdar, and Sanjay Ranka. 2017. Algorithm 980: Sparse QR factorization on the GPU. ACM Trans. Math. Softw. 44, 2 (Aug. 2017).

Cited By

View all

Index Terms

  1. A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Mathematical Software
    ACM Transactions on Mathematical Software  Volume 47, Issue 3
    September 2021
    251 pages
    ISSN:0098-3500
    EISSN:1557-7295
    DOI:10.1145/3472960
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 June 2021
    Accepted: 01 October 2020
    Revised: 01 July 2020
    Received: 01 October 2019
    Published in TOMS Volume 47, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. BLAS
    2. batched BLAS

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)344
    • Downloads (Last 6 weeks)82
    Reflects downloads up to 15 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media