research-article

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

Authors:

Ahmad Abdelfattah,

Hatem LtaiefAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 42, Issue 3

Article No.: 18, Pages 1 - 31

https://doi.org/10.1145/2818311

Published: 10 May 2016 Publication History

Abstract

KBLAS is an open-source, high-performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance of dense matrix-vector multiplication is hindered by the overhead of memory accesses, a double-buffering optimization technique is employed to overlap data motion with computation. After identifying a proper set of tuning parameters, KBLAS efficiently runs on various GPU architectures while avoiding code rewriting and retaining compliance with the standard BLAS API. Another optimization technique allows ensuring coalesced memory access when dealing with submatrices, especially for high-level dense linear algebra algorithms. All KBLAS kernels have been leveraged to a multi-GPU environment, which requires the introduction of new APIs. Considering general matrices, KBLAS is very competitive with existing state-of-the-art kernels and provides a smoother performance across a wide range of matrix dimensions. Considering symmetric and Hermitian matrices, the KBLAS performance outperforms existing state-of-the-art implementations on all matrix sizes and achieves asymptotically up to 50% and 60% speedup against the best competitor on single GPU and multi-GPUs systems, respectively. Performance results also validate our performance model. A subset of KBLAS high-performance kernels have been integrated into NVIDIA's standard BLAS implementation (cuBLAS) for larger dissemination, starting from version 6.0.

References

[1]

Ahmad Abdelfattah, Jack Dongarra, David Keyes, and Hatem Ltaief. 2013a. Optimizing memory-bound SYMV kernel on GPU hardware accelerators. In High Performance Computing for Computational Science (VECPAR'12), Michel Dayd, Osni Marques, and Kengo Nakajima (Eds.). Lecture Notes in Computer Science, Vol. 7851. Springer, Berlin, 72--79.

[2]

Ahmad Abdelfattah, Eric Gendron, Damien Gratadour, David Keyes, Hatem Ltaief, Arnaud Sevin, and Fabrice Vidal. 2014. High performance pseudo-analytical simulation of multi-object adaptive optics over multi-GPU systems. In Euro-Par 2014 Parallel Processing, Fernando Silva, I. Dutra, and V. Santos Costa (Eds.). Lecture Notes in Computer Science, Vol. 8632. Springer International Publishing, 704--715.

[3]

Ahmad Abdelfattah, David Keyes, and Hatem Ltaief. 2013b. Systematic approach in optimizing numerical memory-bound kernels on GPU. In Euro-Par 2012: Parallel Processing Workshops, Ioannis Caragiannis, Michael Alexander, RosaMaria Badia, Mario Cannataro, Alexandru Costan, Marco Danelutto, F. Desprez, Bettina Krammer, Julio Sahuquillo, StephenL. Scott, and Josef Weidendorfer (Eds.). Lecture Notes in Computer Science, Vol. 7640. Springer, Berlin, 207--216.

Digital Library

[4]

Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. Journal of Physics: Conference Series 180 (2009), 012037.

[5]

E. Anderson, Z. Bai, C. Bischof, Suzan L. Blackford, James W. Demmel, Jack J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and Danny C. Sorensen. 1999. LAPACK User's Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia.

Digital Library

[6]

BLAS. 1979. Basic Linear Algebra Subprograms. Retrieved from http://www.netlib.org/blas/.

[7]

Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. 2004. Brook for GPUs: Stream computing on graphics hardware. In ACM SIGGRAPH 2004 Papers (SIGGRAPH'04). ACM, New York, NY, 777--786.

Digital Library

[8]

cuBLAS-XT. 2014. Accelerate BLAS calls with multiple GPUs. Retrieved from https://developer.nvidia.com/cublasxt.

[9]

J. R. Humphrey, D. K. Price, K. E. Spagnoli, A. L. Paolini, and E. J. Kelmelis. 2010. CULA: Hybrid GPU accelerated linear algebra routines. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, Vol. 7705. 1.

[10]

KBLAS. 2014. KAUST Basic Linear Algebra Subprograms. Available at http://cec.kaust.edu.sa/Pages/kblas.aspx. (2014).

[11]

David B. Kirk and Wen-mei W. Hwu. 2010. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers, San Francisco, CA.

Digital Library

[12]

MAGMA. 2009. Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee. Retrieved from http://icl.cs.utk.edu/magma/.

[13]

John D. McCalpin. 1991-2007. STREAM: Sustainable Memory Bandwidth in High Performance Computers. Technical Report. University of Virginia, Charlottesville, Virginia. Retrieved from http://www.cs.virginia.edu/stream/.

[14]

John D. McCalpin. 1995. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter (Dec. 1995), 19--25.

[15]

Rajib Nath, Stanimire Tomov, Tingxing “Tim” Dong, and Jack Dongarra. 2011b. Optimizing symmetric dense matrix-vector multiplication on GPUs. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11). ACM, New York, NY, Article 6, 10 pages.

Digital Library

[16]

Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010a. An improved magma gemm for fermi graphics processing units. Internaitonal Journal on High Performance Computing Applications 24, 4 (Nov. 2010), 511--515.

Digital Library

[17]

Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010b. BLAS for GPUs. CRC Press, 57--80.

[18]

Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2011a. Accelerating GPU kernels for dense linear algebra. In Proceedings of the 9th International Conference on High Performance Computing for Computational Science (VECPAR'10). Springer-Verlag, Berlin, 83--92.

Digital Library

[19]

NVIDIA. 2009. NVIDIA Fermi Compute Architecture Whitepaper. Retrieved from http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.

[20]

NVIDIA. 2012. NVIDIA Kepler GK110 Architecture Whitepaper. Retrieved from http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.

[21]

NVIDIA. 2014a. CUDA C Programming Guide. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide/.

[22]

NVIDIA. 2014b. The NVIDIA CUDA Basic Linear Algebra Subroutines. Retrieved from https://developer.nvidia.com/cublas/.

[23]

NVIDIA. 2014c. cuBLAS::CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/cublas/#appendix-acknowledgements.

[24]

OPENACC. 2011. Directives for Accelerators. Retrieved from http://www.openacc-standard.org/.

[25]

OPENCL. 2009. The open standard for parallel programming of heterogeneous systems. Retrieved from http://www.khronos.org/opencl/.

[26]

Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast implementation of DGEMM on fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11). ACM, New York, NY, Article 35, 11 pages.

Digital Library

[27]

Stanimire Tomov, Rajib Nath, and Jack Dongarra. 2010. Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing. Parallel Computing 36, 12 (Dec. 2010), 645--654.

Digital Library

[28]

V. Volkov and J. W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2008 (SC'08). 1--11.

Digital Library

[29]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 4 (April 2009), 65--76.

Digital Library

[30]

Ichitaro Yamazaki, Tingxing Dong, Raffaele Solc, Stanimire Tomov, Jack Dongarra, and Thomas Schulthess. 2013. Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems. Concurrency and Computation: Practice and Experience 26, 16 (2013), 2652--2666.

Digital Library

Cited By

Heroux MClaus LGhysels PBoukaram WLi X(2025)A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compressionInternational Journal of High Performance Computing Applications10.1177/1094342024128856739:1(18-31)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1177/10943420241288567
Pan QAbdulah SGenton MKeyes DLtaief HSun Y(2024)GPU-Accelerated Vecchia Approximations of Gaussian Processes for Geospatial Data using Batched Matrix ComputationsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528930(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528930
ÖZ I(2024)Quantitative Performance Analysis of BLAS Libraries on GPU ArchitecturesBLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans AnaliziDeu Muhendislik Fakultesi Fen ve Muhendislik10.21205/deufmd.202426760626:76(40-48)Online publication date: 23-Jan-2024
https://doi.org/10.21205/deufmd.2024267606
Show More Cited By

Index Terms

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators
1. Mathematics of computing
  1. Mathematical software

Recommendations

An insightful program performance tuning chain for GPU computing
ICA3PP'12: Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I

It is challenging to optimize GPU kernels because this progress requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem of ...
Combining multi-core and GPU computing for solving combinatorial optimization problems

In this paper, we revisit the design and implementation of Branch-and-Bound (B&B) algorithms for solving large combinatorial optimization problems on GPU-enhanced multi-core machines. B&B is a tree-based optimization method that uses four operators (...
Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems

For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 42, Issue 3

June 2016

208 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/2935754

Editor:
Michael A. Heroux
Sandia National Laboratories, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2016

Accepted: 01 August 2015

Revised: 01 May 2015

Received: 01 September 2014

Published in TOMS Volume 42, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
412
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)10

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Heroux MClaus LGhysels PBoukaram WLi X(2025)A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compressionInternational Journal of High Performance Computing Applications10.1177/1094342024128856739:1(18-31)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1177/10943420241288567
Pan QAbdulah SGenton MKeyes DLtaief HSun Y(2024)GPU-Accelerated Vecchia Approximations of Gaussian Processes for Geospatial Data using Batched Matrix ComputationsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528930(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528930
ÖZ I(2024)Quantitative Performance Analysis of BLAS Libraries on GPU ArchitecturesBLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans AnaliziDeu Muhendislik Fakultesi Fen ve Muhendislik10.21205/deufmd.202426760626:76(40-48)Online publication date: 23-Jan-2024
https://doi.org/10.21205/deufmd.2024267606
Jangda AYadav MLee IChabbi MSteuwer M(2024)Fast Kronecker Matrix-Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638489(390-403)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638489
Ruys WLee HYou BTalati SPark JAlmgren-Bell JYan YFernando MBiros GErez MBurtscher MRossbach CPingali KGligoric M(2024)A Deep Dive into Task-Based Parallelism in Python2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00187(1147-1149)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00187
Haichour ABenfriha K(2024)Empowering Real-Time IoT Applications: A Brief Review on Leveraging GPU Acceleration for Latency ReductionInternet of Things. 7th IFIPIoT 2024 International IFIP WG 5.5 Workshops10.1007/978-3-031-82065-6_8(107-120)Online publication date: 29-Dec-2024
https://doi.org/10.1007/978-3-031-82065-6_8
Ghioldi FPiscaglia F(2024)A hybrid CPU‐GPU paradigm to accelerate reactive CFD simulationsInternational Journal for Numerical Methods in Fluids10.1002/fld.529796:8(1461-1488)Online publication date: 3-May-2024
https://doi.org/10.1002/fld.5297
Osama MMerrill DCecka CGarland MOwens JDehnavi MKulkarni MKrishnamoorthy S(2023)Stream-KProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577479(429-431)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577479
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Guo HXu BYang HLi BYue YZhao S(2022)CUDA-based parallelization of time-weighted dynamic time warping algorithm for time series analysis of remote sensing dataComputers & Geosciences10.1016/j.cageo.2022.105122164(105122)Online publication date: Jul-2022
https://doi.org/10.1016/j.cageo.2022.105122
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents