abstract

Public Access

Towards batched linear solvers on accelerated hardware platforms

Authors:

Jack DongarraAuthors Info & Claims

ACM SIGPLAN Notices, Volume 50, Issue 8

Pages 261 - 262

https://doi.org/10.1145/2858788.2688534

Published: 24 January 2015 Publication History

PDF eReader

Abstract

As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU's symmetric multiprocessors factorizes a single problem at a time. We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.

Cited By

View all

Deshmukh SYokota RBosilca G(2023)Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu ProcessorsACM Transactions on Mathematical Software10.1145/359517849:3(1-29)Online publication date: 19-Sep-2023
https://dl.acm.org/doi/10.1145/3595178
Abdelfattah AHaidar ATomov SDongarra J(2016)On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.190(1249-1258)Online publication date: May-2016
https://doi.org/10.1109/IPDPSW.2016.190
Abdelfattah ACosta TDongarra JGates MHaidar AHammarling SHigham NKurzak JLuszczek PTomov SZounon M(2021)A Set of Batched Basic Linear Algebra Subprograms and LAPACK RoutinesACM Transactions on Mathematical Software10.1145/343192147:3(1-23)Online publication date: 26-Jun-2021
https://dl.acm.org/doi/10.1145/3431921
Show More Cited By

Index Terms

Towards batched linear solvers on accelerated hardware platforms
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

Optimization for performance and energy for batched matrix computations on GPUs
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, ...
Towards batched linear solvers on accelerated hardware platforms
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for ...
High-performance Cholesky factorization for GPU-only execution
GPGPU-10: Proceedings of the General Purpose GPUs

We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that ...

Comments

Information & Contributors

Information

Published In

ACM SIGPLAN Notices Volume 50, Issue 8

PPoPP '15

August 2015

290 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2858788

Editor:
Andy Gill
University of Kansas, Lawrence, KS

Issue’s Table of Contents

PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2015
290 pages
ISBN:9781450332057
DOI:10.1145/2688500
General Chair:
Albert Cohen
INRIA, France
,
Program Chair:
David Grove
IBM Research, USA

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2015

Published in SIGPLAN Volume 50, Issue 8

Check for updates

Author Tags

Qualifiers

Abstract

Funding Sources

U.S. Department of Energy
NVIDIA Corporation
Russian Scientific Fund
National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
513
Total Downloads

Downloads (Last 12 months)85
Downloads (Last 6 weeks)19

Reflects downloads up to 07 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Deshmukh SYokota RBosilca G(2023)Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu ProcessorsACM Transactions on Mathematical Software10.1145/359517849:3(1-29)Online publication date: 19-Sep-2023
https://dl.acm.org/doi/10.1145/3595178
Abdelfattah AHaidar ATomov SDongarra J(2016)On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.190(1249-1258)Online publication date: May-2016
https://doi.org/10.1109/IPDPSW.2016.190
Abdelfattah ACosta TDongarra JGates MHaidar AHammarling SHigham NKurzak JLuszczek PTomov SZounon M(2021)A Set of Batched Basic Linear Algebra Subprograms and LAPACK RoutinesACM Transactions on Mathematical Software10.1145/343192147:3(1-23)Online publication date: 26-Jun-2021
https://dl.acm.org/doi/10.1145/3431921
Charara AKeyes DLtaief H(2019)Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUsACM Transactions on Mathematical Software10.1145/326710145:2(1-28)Online publication date: 3-May-2019
https://dl.acm.org/doi/10.1145/3267101
Haidar AAbdelfattah AZounon MTomov SDongarra J(2018)A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky FactorizationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.278392929:5(973-984)Online publication date: 1-May-2018
https://doi.org/10.1109/TPDS.2017.2783929
Dongarra JGates MKurzak JLuszczek PTsai Y(2018)Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware AcceleratorsProceedings of the IEEE10.1109/JPROC.2018.2868961106:11(2040-2055)Online publication date: Nov-2018
https://doi.org/10.1109/JPROC.2018.2868961
Abdelfattah AHaidar ATomov SDongarra J(2018)Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization2018 IEEE High Performance extreme Computing Conference (HPEC)10.1109/HPEC.2018.8547576(1-7)Online publication date: Sep-2018
https://doi.org/10.1109/HPEC.2018.8547576
Abdelfattah AHaidar ATomov SDongarra JGropp WBeckman PLi ZCazorla F(2017)Novel HPC techniques to batch execution of many variable size BLAS computations on GPUsProceedings of the International Conference on Supercomputing10.1145/3079079.3079103(1-10)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3079079.3079103
Gates MKurzak JLuszczek PYu Pei Dongarra J(2017)Autotuning batch Cholesky factorization in CUDA with interleaved layout of matrices2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.18(1408-1417)Online publication date: May-2017
https://doi.org/10.1109/IPDPSW.2017.18
Kurzak JAnzt HGates MDongarra J(2016)Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.248189027:7(2036-2048)Online publication date: 1-Jul-2016
https://dl.acm.org/doi/10.1109/TPDS.2015.2481890
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

Cited By

Index Terms

Recommendations

Optimization for performance and energy for batched matrix computations on GPUs