research-article

Public Access

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines

Authors:

Ahmad Abdelfattah,

Sven Hammarling,

Nicholas J. Higham,

Piotr Luszczek,

Stanimire Tomov,

Mawussi ZounonAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 47, Issue 3

Article No.: 21, Pages 1 - 23

https://doi.org/10.1145/3431921

Published: 26 June 2021 Publication History

All formats PDF

Abstract

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.

References

[1]

Ahmad Abdelfattah, Marc Baboulin, Veselin Dobrev, Jack J. Dongarra, C. Earl, J. Falcou, Azzam Haidar, Ian Karlin, Tzanio Kolev, Ian Masliah, and Stanimire Tomov. 2016. High-performance Tensor Contractions for GPUs. Technical Report UT-EECS-16-738. University of Tennessee Computer Science.

[2]

Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, design, and autotuning of batched GEMM for GPUs. In Proceedings of the ISC High Performance Conference. Springer, 21–38.

[3]

Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Hatem Ltaief, Raymond Namyst, Samuel Thibault, and Stanimire Tomov. 2010. Faster, cheaper, better – A hybridization methodology to develop linear algebra software for GPUs. In GPU Computing Gems, Wen mei W. Hwu (Ed.). Vol. 2. Morgan Kaufmann, Burlington, MA.

[4]

Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Jack Langou, Haitem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Series 180, 1 (2009).

[5]

Ed Anderson, Z. Bai, C. Bischof, Susan L. Blackford, James W. Demmel, Jack J. Dongarra, J. Du Croz, A. Greenbaum, Sven Hammarling, A. McKenney, and Danny C. Sorensen. 1999. LAPACK User’s Guide (3 ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA.

[6]

Michael J. Anderson, David Sheffield, and Kurt Keutzer. 2012. A predictive model for solving small linear algebra problems in GPU registers. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium. IEEE, New York, 2–13.

Digital Library

[7]

Alexander A. Auer, Gerald Baumgartner, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert Harrison, Sriram Krishnamoorthy, Sandhya Krishnan, Chi-Chung Lam, Qingda Lu, Marcel Nooijen, Russell Pitzer, J. Ramanujam, P. Sadayappan, and Alexander Sibiryakov. 2006. Automatic code generation for many-body electronic structure methods: The tensor contraction engine. Molec. Phys. 104, 2 (Jan. 2006), 211–228.

[8]

Marc Baboulin, Veselin Dobrev, Jack J. Dongarra, C. Earl, J. Falcou, Azzam Haidar, Ian Karlin, Tzanio Kolev, Ian Masliah, and Stanimire Tomov. 2015. Towards a high-performance tensor algebra package for accelerators. In Proceedings of the Smoky Mountains Computational Sciences and Engineering Conference (SMC’15). HAL Archives. Retrieved from http://computing.ornl.gov/workshops/SMC15/presentations/.

[9]

Benjamin Brock, Andrew Belt, Jay J. Billings, and Mike Guidry. 2015. Explicit integration with GPU acceleration for large kinetic networks. J. Comput. Phys. 302, C (Jan.–Dec. 2015), 591–602. arXiv:1409.5826 (2015).

Digital Library

[10]

J. Demmel, M. Gates, G. Henry, X. S. Li, J. Riedy, and P. T. P. Tang. 2017. A Proposal for a Next-generation BLAS. Technical Report. Retrieved from http://goo.gl/D1UKnw.

[11]

Tingxing Dong, Veselin Dobrev, Tzanio Kolev, Robert Rieben, Stanimire Tomov, and Jack Dongarra. 2014. A step towards energy efficient computing: Redesigning a hydrodynamic application on CPU-GPU. In Proceedings of the IEEE 28th International Parallel Distributed Processing Symposium (IPDPS’14). IEEE, New York, 972–981.

Digital Library

[12]

Tingxing Dong, Azzam Haidar, Piotr Luszczek, A. Harris, Stanimire Tomov, and Jack J. Dongarra. 2014. LU factorization of small matrices: Accelerating batched DGETRF on the GPU. In Proceedings of 16th IEEE International Conference on High Performance and Communications (HPCC’14). IEEE, 157–160.

[13]

Jack J. Dongara, J. R. Bunch Cleve B. Moler, and G. W. Stewart. 1979. LINPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.

[14]

Jack Dongarra, Iain Duff, Mark Gates, Azzam Haidar, Sven Hammarling, Nicholas J. Higham, Jonathon Hogg, Pedro Valero-Lara, Samuel D. Relton, Stanimire Tomov, and Mawussi Zounon. 2016. A Proposed API for Batched Basic Linear Algebra Subprograms. MIMS EPrint 2016.25. Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Retrieved from http://eprints.ma.man.ac.uk/2464/.

[15]

Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Panruo Wu, Ichitaro Yamazaki, Asim Yarkhan, Maksims Abalenkovs, Negin Bagherpour, Sven Hammarling, Jakub Šístek, David Stevens, Mawussi Zounon, and Samuel D. Relton. 2019. PLASMA: Parallel linear algebra software for multicore using OpenMP. ACM Trans. Math. Softw. 45, 2 (May 2019).

Digital Library

[16]

Jack Dongarra, Sven Hammarling, Nicholas J. Higham, Samuel D. Relton, Pedro Valero-Lara, and Mawussi Zounon. 2016. Creating a standardised set of batched BLAS routines. In Proceedings of the 4th Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE’16), Gabrielle Allen, Jeffrey Carver, Sou-Cheng T. Choi et al. (Eds.), Vol. 1686. CEUR Workshop Proceedings, WSSSPE, 1–2. Retrieved from http://ceur-ws.org/Vol-1686/WSSSPE4_paper_3.pdf.

[17]

Jack Dongarra, Sven Hammarling, Nicholas J. Higham, Samuel D. Relton, Pedro Valero-Lara, and Mawussi Zounon. 2017. The design and performance of batched BLAS on modern high-performance computing systems. Proced. Comput. Sci. 108 (2017), 495–504.

[18]

Jack J. Dongarra, J. Du Croz, Iain S. Duff, and Sven Hammarling. 1990. Algorithm 679: A set of Level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16 (Mar. 1990), 1–17.

[19]

Jack J. Dongarra, J. Du Croz, Iain S. Duff, and Sven Hammarling. 1990. A set of Level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16 (Mar. 1990), 18–28.

[20]

Jack J. Dongarra, J. Du Croz, Sven Hammarling, and R. Hanson. 1988. Algorithm 656: An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Softw. 14 (Mar. 1988), 18–32.

[21]

Jack J. Dongarra, J. Du Croz, Sven Hammarling, and R. Hanson. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Softw. 14 (Mar. 1988), 1–17.

[22]

Mark Gates, Jakub Kurzak, Ali Charara, Asim YarKhan, and Jack Dongarra. 2019. SLATE: Design of a modern distributed and accelerated linear algebra library. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’19). ACM, New York, NY, 13.

Digital Library

[23]

Scott Gray. 2015. A full walk through of the SGEMM implementation. Retrieved from https://github.com/NervanaSystems/maxas/wiki/SGEMM.

[24]

Murat Guney, Sarah Knepper, Kazushige Goto, Vamsi Sripathi, Greg Henry, and Shane Story. 2015. Batched Matrix-Matrix Multiplication Operations for Intel Xeon Processor and Intel Xeon Phi Co-Processor. Retrieved from http://meetings.siam.org/sess/dsp talk.cfm?p=72187.

[25]

W. Hackbusch. 1999. A sparse matrix arithmetic based on H-matrices. Part I: Introduction to H-matrices. Computing 62, 2 (May 1999), 89–108.

Digital Library

[26]

Azzam Haidar, Chongxiao Cao, Asim Yarkhan, Piotr Luszczek, Stanimire Tomov, Khairul Kabir, and Jack Dongarra. 2014. Unified development for mixed multi-GPU and multi-coprocessor environments using a lightweight runtime environment. In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society, Washington, DC, 491–500.

Digital Library

[27]

Azzam Haidar, TingXing Dong, Piotr Luszczek, Stanimire Tomov, and Jack Dongarra. 2015. Optimization for performance and energy for batched matrix computations on GPUs. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs (GPGPU’15) co-located with PPOPP’15. ACM, 59–69.

Digital Library

[28]

Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stanimire Tomov, and Jack Dongarra. 2015. Towards batched linear solvers on accelerated hardware platforms. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). ACM, 261–262.

Digital Library

[29]

Sven Hammarling. 2016. Workshop on Batched, Reproducible, and Reduced Precision BLAS. Technical Report. The University of Manchester, Manchester, UK. Retrieved from eprints.maths.manchester.ac.uk/2494/.

[30]

Sven Hammarling. 2017. Second Workshop on Batched, Reproducible, and Reduced Precision BLAS. Technical Report. The University of Manchester, Manchester, UK. Retrieved from eprints.maths.manchester.ac.uk/2543/.

[31]

Nicholas J. Higham and Theo Mary. 2018. A New Approach to Probabilistic Rounding Error Analysis. Technical Report MIMS EPrint:2018.33. Manchester Institute for Mathematical Sciences, School of Mathematics, The University of Manchester. Retrieved from eprints.maths.manchester.ac.uk/2679/.

[32]

HSL. 2013. A collection of Fortran codes for large scale scientific computation. Retrieved from http://www.hsl.rl.ac.uk.

[33]

Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization framework for sparse matrix kernels. Int. J. High Perf. Comput. Appl. 18, 1 (Feb. 2004), 135–158.

Digital Library

[34]

Khairul Kabir, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2015. On the design, development, and analysis of optimized matrix-vector multiplication routines for coprocessors. In Proceedings of the ISC High Performance Conference. Springer, 58–73.

[35]

David Keyes and Valerie Taylor. March 2011. NSF-ACCI task force on software for science and engineering. Retrieved from https://www.nsf.gov/cise/aci/taskforces/TaskForceReport Software.pdf.

[36]

J. C. Liao Khodayari A., A. R. Zomorrodi, and C. D. Maranas. 2014. A kinetic model of Escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metab. Eng. 25C (2014), 50–62.

[37]

Junjie Lai and Andre Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE Computer Society, Washington, DC, 1–10.

[38]

C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for FORTRAN usage. ACM Trans. Math. Softw. 5 (1979), 308–323.

Digital Library

[39]

Ian Masliah, Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Marc Baboulin, J. Falcou, and Jack J. Dongarra. 2016. High-performance Matrix-matrix Multiplications of Very Small Matrices. Technical Report UT-EECS-16-740. University of Tennessee Computer Science.

[40]

O. E. B. Messer, J. A. Harris, S. Parete-Koon, and M. A. Chertkow. 2012. Multicore and accelerator development for a leadership-class stellar astrophysics code. In Proceedings of the State-of-the-Art in Scientific and Parallel Computing Conference (PARA’12). Springer-Verlag, Berlin, Germany, 92–106.

[41]

J. M. Molero, E. M. Garzón, I. García, E. S. Quintana-Ortí, and A. Plaza. 2013. Poster: A Batched Cholesky Solver for Local RX Anomaly Detection on GPUs. PUMPS. Retrieved from https://dl.acm.org/doi/10.1145/2716282.2716288.

[42]

Rajib Nath, Stanimire Tomov, and Jack J. Dongarra. 2010. An improved MAGMA GEMM for Fermi GPUs. Int. J. High Perf. Comput. Appl. 24, 4 (Nov. 2010), 511–515.

Digital Library

[43]

NVIDIA. 2016. cuBLAS 7.5. Available at Retrieved from http://docs.nvidia.com/cuda/cublas/.

[44]

Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui S̃un. 2011. Fast implementation of DGEMM on Fermi GPU. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, 35:1–35:11.

Digital Library

[45]

S. Tomov, J. Dongarra, and M. Baboulin. 2010. Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. Syst. Appl. 36, 5--6 (2010), 232–240.

Digital Library

[46]

Oreste Villa, Massimiliano Fatica, Nitin Gawande, and Antonino Tumeo. 2013. Power/performance trade-offs of small batched LU based solvers on GPUs. In Euro-Par 2013 Parallel Processing: 19th International Conference, Aachen, Germany, August 26–30, 2013. Proceedings, Felix Wolf, Bernd Mohr, and Dieter an Mey (Eds.). Springer Berlin, 813–825.

Digital Library

[47]

Oreste Villa, Nitin Gawande, and Antonino Tumeo. 2013. Accelerating subsurface transport simulation on heterogeneous clusters. In Proceedings of the IEEE International Conference on Cluster Computing. IEEE, New York, 1–8.

[48]

Sencer Nuri Yeralan, Timothy A. Davis, Wissam M. Sid-Lakhdar, and Sanjay Ranka. 2017. Algorithm 980: Sparse QR factorization on the GPU. ACM Trans. Math. Softw. 44, 2 (Aug. 2017).

Digital Library

Cited By

Chang CDi Maio FBheemireddy RPosthoorn PGebremariam ARem P(2025)Enhancing quality inspection efficiency and reliability of unscreened recycled coarse aggregates (RCA) streams using innovative mobile sensor-based technologyDevelopments in the Built Environment10.1016/j.dibe.2025.10061121(100611)Online publication date: Mar-2025
https://doi.org/10.1016/j.dibe.2025.100611
Wei CJia HZhang YYao JLi CCao W(2024)IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343257935:9(1672-1689)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3432579
Luszczek PAbdelfattah AAnzt HSuzuki ATomov S(2024)Batched sparse and mixed-precision linear algebra interface for efficient use of GPU hardware accelerators in scientific applicationsFuture Generation Computer Systems10.1016/j.future.2024.06.004160:C(359-374)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1016/j.future.2024.06.004
Show More Cited By

Index Terms

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

OpenMP Task Generation for Batched Kernel APIs
OpenMP: Conquering the Full Hardware Spectrum
Abstract
The demand for calculating many small computation kernels is getting significantly important in the HPC area not only for the traditional numerical applications but also recent machine learning applications. While many-core accelerators such as ...
A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be ...
Fast Linear Algebra on GPU
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems

GPUs have been successfully used for acceleration of many mathematical functions and libraries. A common limitation of those libraries is a minimal size of primitives being handled in order to achieve significant speedups compared to their CPU versions. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 47, Issue 3

September 2021

251 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/3472960

Editors:
Zhaojun Bai
University of California at Davis, USA
,
Wolfgang Bangerth
Colorado State University, USA

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2021

Accepted: 01 October 2020

Revised: 01 July 2020

Received: 01 October 2019

Published in TOMS Volume 47, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
1,201
Total Downloads

Downloads (Last 12 months)572
Downloads (Last 6 weeks)87

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chang CDi Maio FBheemireddy RPosthoorn PGebremariam ARem P(2025)Enhancing quality inspection efficiency and reliability of unscreened recycled coarse aggregates (RCA) streams using innovative mobile sensor-based technologyDevelopments in the Built Environment10.1016/j.dibe.2025.10061121(100611)Online publication date: Mar-2025
https://doi.org/10.1016/j.dibe.2025.100611
Wei CJia HZhang YYao JLi CCao W(2024)IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343257935:9(1672-1689)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3432579
Luszczek PAbdelfattah AAnzt HSuzuki ATomov S(2024)Batched sparse and mixed-precision linear algebra interface for efficient use of GPU hardware accelerators in scientific applicationsFuture Generation Computer Systems10.1016/j.future.2024.06.004160:C(359-374)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1016/j.future.2024.06.004
Xu YYang JWang RLi H(2024)An effective two-stage channel pruning method based on two-dimensional information entropyApplied Intelligence10.1007/s10489-024-05615-754:17-18(8491-8504)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s10489-024-05615-7
Yesypenko AMartinsson P(2024)SlabLU: a two-level sparse direct solver for elliptic PDEsAdvances in Computational Mathematics10.1007/s10444-024-10176-x50:4Online publication date: 9-Aug-2024
https://dl.acm.org/doi/10.1007/s10444-024-10176-x
Abdelfattah ATomov SLuszczek PAnzt HDongarra J(2023)GPU-based LU Factorization and Solve on Batches of Matrices with Band StructureProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624247(1672-1679)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624247
Nguyen PNayak PAnzt H(2023)Porting Batched Iterative Solvers onto Intel GPUs with SYCLProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624181(1048-1058)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624181
Novaković V(2023)Vectorization of a Thread-Parallel Jacobi Singular Value Decomposition MethodSIAM Journal on Scientific Computing10.1137/22M147884745:3(C73-C100)Online publication date: 2-Jun-2023
https://doi.org/10.1137/22M1478847
Kashi ANayak PKulkarni DScheinberg ALin PAnzt H(2023)Integrating batched sparse iterative solvers for the collision operator in fusion plasma simulations on GPUsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.03.012178:C(69-81)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.jpdc.2023.03.012
Aggarwal INayak PKashi AAnzt H(2023)Preconditioners for Batched Iterative Linear Solvers on GPUsAccelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation10.1007/978-3-031-23606-8_3(38-53)Online publication date: 18-Jan-2023
https://doi.org/10.1007/978-3-031-23606-8_3
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents