research-article

Open access

Optimization of Triangular and Banded Matrix Operations Using 2d-Packed Layouts

Authors:

Toufik Baroudi,

Vincent LoechnerAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 14, Issue 4

Article No.: 55, Pages 1 - 19

https://doi.org/10.1145/3162016

Published: 18 December 2017 Publication History

Abstract

Over the past few years, multicore systems have become increasingly powerful and thereby very useful in high-performance computing. However, many applications, such as some linear algebra algorithms, still cannot take full advantage of these systems. This is mainly due to the shortage of optimization techniques dealing with irregular control structures. In particular, the well-known polyhedral model fails to optimize loop nests whose bounds and/or array references are not affine functions. This is more likely to occur when handling sparse matrices in their packed formats. In this article, we propose using 2d-packed layouts and simple affine transformations to enable optimization of triangular and banded matrix operations. The benefit of our proposal is shown through an experimental study over a set of linear algebra benchmarks.

Supplementary Material

TACO1404-55 (taco1404-55.pdf)

Slide deck associated with this paper

Download
551.40 KB

References

[1]

Ramesh C. Agarwal, Fred G. Gustavson, Mahesh V. Joshi, and Mohammad Zubair. 1995. A scalable parallel block algorithm for band Cholesky factorization. In Proceedings of the 7th SIAM Conference on Parallel Processing for Scientific Computing’95), San Francisco, CA, February 15-17, 1995. 430--435.

[2]

Äke Björck. 2015. Numerical Methods in Matrix Computations. Springer International Publishing.

[3]

Bjarne Stig Andersen, Jerzy Waśniewski, and Fred G. Gustavson. 2001. A recursive formulation of Cholesky factorization of a matrix in packed storage. ACM Transactions on Mathematical Software 27, 2, 214--244.

Digital Library

[4]

E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide, 3rd ed. Society for Industrial and Applied Mathematics, Philadelphia, PA.

Digital Library

[5]

Howard Anton and Chris Rorres. 2014. Elementary Linear Algebra: Applications Version, 11th ed. Wiley, Hoboken, NJ.

[6]

Athanasios Athanasios Konstantinidis and Paul H. J. Kelly. 2011. More definite results from the Pluto scheduling algorithm. In 1st International Workshop on Polyhedral Compilation Techniques (IMPACT’11), C. Alias and C. Bastoul (eds.). Chamonix, France. http://perso.ens-lyon.fr/christophe.alias/impact2011/impact-02.pdf.

[7]

Muthu Manikandan Baskaran, Nagavijayalakshmi Vydyanathan, Uday Kumar Reddy Bondhugula, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2009. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). ACM, New York, NY, 219--228.

Digital Library

[8]

Cedric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT’04). IEEE Computer Society, Washington, DC, 7--16.

[9]

Nathan Bell and Michael Garland. 2008. Efficient Sparse Matrix-Vector Multiplication on CUDA. NVIDIA Technical Report NVR-2008-004. NVIDIA Corporation, Santa Clara, California, USA.

[10]

Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. 2010. The polyhedral model is more widely applicable than you think. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction (CC’10/ETAPS’10). Springer-Verlag, Berlin, 283--303.

Digital Library

[11]

Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 2014. Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, New York, NY, 253--260.

Digital Library

[12]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Notices 43, 6, 101--113.

Digital Library

[13]

Uday Bondhugula, J. Ramanujam, and P. Sadayappan. 2007. PLuTo: A Practical and Fully Automatic Polyhedral Parallelizer and Locality Optimizer. Technical Report OSU-CISRC-10/07-TR70. The Ohio State University, Columbus, OH.

[14]

Uday Kumar Reddy Bondhugula. 2008. Effective Automatic Parallelization and Locality Optimization Using the Polyhedral Model. Ph.D. Dissertation. Ohio State University, Columbus, OH. Advisor(s) Sadayappan, P. AAI3325799.

[15]

Aydin Buluc and John R. Gilbert. 2008. Challenges and advances in parallel sparse matrix-matrix multiplication. In Proceedings of the 37th International Conference on Parallel Processing (ICPP’08). IEEE Computer Society, Washington, DC, 503--510.

Digital Library

[16]

Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. 2009. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing 35, 1, 38--53.

Digital Library

[17]

Huimin Cui, Jingling Xue, Lei Wang, Yang Yang, Xiaobing Feng, and Dongrui Fan. 2012. Extendable pattern-oriented optimization directives. ACM Transactions on Architecture and Code Optimization 9, 3, Article 14, 37 pages.

Digital Library

[18]

Huimin Cui, Qing Yi, Jingling Xue, and Xiaobing Feng. 2013. Layout-oblivious compiler optimization for matrix computations. ACM Transactions on Architecture and Code Optimization 9, 4, Article 35, 20 pages.

Digital Library

[19]

J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software 16, 1, 1--17.

Digital Library

[20]

J. J. Dongarra, C. B. Moler, J. R. Bunch, and G.W. Stewart. 1979. LINPACK Users’ Guide. pub-SIAM. 320 pages.

[21]

P. Feautrier. 1992. Some efficient solutions to the affine scheduling problem, Part 1: One dimensional time. International Journal of Parallel Programming 21, 5, 313--348.

Digital Library

[22]

P. Feautrier. 1992. Some efficient solutions to the affine scheduling problem, Part 2: Multidimensional time. International Journal of Parallel Programming 21, 6.

[23]

Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal Parallel Program. 34, 3 (June 2006), 261--317.

Digital Library

[24]

Fred G. Gustavson, Jerzy Waśniewski, Jack J. Dongarra, and Julien Langou. 2010. Rectangular full packed format for Cholesky’s algorithm: Factorization, solution, and inversion. ACM Transactions on Mathematical Software 37, 2, Article 18, 21 pages.

Digital Library

[25]

Ziang Hu, Juan del Cuvillo, Weirong Zhu, and Guang R. Gao. 2006. Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences. Springer, Berlin, 134--144.

Digital Library

[26]

Richard M. Karp, Raymond E. Miller, and Shmuel Winograd. 1967. The organization of computations for uniform recurrence equations. Journal of the ACM 14, 3, 563--590.

Digital Library

[27]

H. T. Kung and Jaspal Subhlok. 1991. A new approach for automatic parallelization of blocked linear algebra computations. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing’91). ACM, New York, NY, 122--129.

Digital Library

[28]

Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, Alan J. Miller, and Michael Upton. 2002. Hyper-threading technology architecture and microarchitecture. Intel Technology Journal 6, 1, 4--15.

[29]

G. M. Megson and X. Chen. 1997. Automatic Parallelization for a Class of Regular Computations. World Scientific, Singapore.

Digital Library

[30]

A. P. Mullhaupt and K. S. Riedel. 2001. Banded matrix fraction representation of triangular input normal pairs. IEEE Transactions on Automatic Control 46, 12, 2018--2022.

[31]

Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2016. Adaptive multi-level blocking optimization for sparse matrix vector multiplication on GPU. Procedia Computer Science 80, 131--142.

Digital Library

[32]

Jeff Parkhurst, John Darringer, and Bill Grundmann. 2006. From single core to multi-core: Preparing for a new exponential. In Proceedings of the IEEE/ACM International Conference on Computer-aided Design (ICCAD’06). ACM, New York, NY, 67--72.

Digital Library

[33]

Diogo N. Sampaio, Louis-Noël Pouchet, and Fabrice Rastello. 2017. Simplification and runtime resolution of data dependence constraints for loop transformations. In Proceedings of the International Conference on Supercomputing (ICS’17). ACM, New York, NY, Article 10, 11 pages.

Digital Library

[34]

Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen. 2009. Polyhedral-model guided loop-nest auto-vectorization. In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT’09). IEEE Computer Society, Washington, DC, 327--337.

Digital Library

[35]

Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’08). IEEE Press, Piscataway, NJ, Article 31, 11 pages. http://dl.acm.org/citation.cfm?id=1413370.1413402.

Digital Library

[36]

Qing Yi. 2011. Automated programmable control and parameterization of compiler optimizations. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, Washington, DC, 97--106. http://dl.acm.org/citation.cfm?id=2190025.2190057.

Digital Library

[37]

Ling Zhuo and Viktor K. Prasanna. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM, New York, NY, 63--74.

Digital Library

Cited By

Bressan MLeucci SPanconesi A(2021)Faster Motif Counting via Succinct Color Coding and Adaptive SamplingACM Transactions on Knowledge Discovery from Data10.1145/344739715:6(1-27)Online publication date: 19-May-2021
https://dl.acm.org/doi/10.1145/3447397
Bressan MLeucci SPanconesi A(2019)MotivoProceedings of the VLDB Endowment10.14778/3342263.334264012:11(1651-1663)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.14778/3342263.3342640

Index Terms

Optimization of Triangular and Banded Matrix Operations Using 2d-Packed Layouts
1. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

Loop and data transformations for sparse matrix code
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation

This paper introduces three new compiler transformations for representing and transforming sparse matrix computations and their data representations. In cooperation with run-time inspection, our compiler derives transformed matrix representations and ...
Dynamic Supernodes in Sparse Cholesky Update/Downdate and Triangular Solves

The supernodal method for sparse Cholesky factorization represents the factor L as a set of supernodes, each consisting of a contiguous set of columns of L with identical nonzero pattern. A conventional supernode is stored as a dense submatrix. While ...
A sparse-sparse iteration for computing a sparse incomplete factorization of the inverse of an SPD matrix

In this paper, a method via sparse-sparse iteration for computing a sparse incomplete factorization of the inverse of a symmetric positive definite matrix is proposed. The resulting factorized sparse approximate inverse is used as a preconditioner for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 14, Issue 4

December 2017

600 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3154814

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 December 2017

Accepted: 01 November 2017

Revised: 01 November 2017

Received: 01 May 2017

Published in TACO Volume 14, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
921
Total Downloads

Downloads (Last 12 months)109
Downloads (Last 6 weeks)15

Reflects downloads up to 21 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bressan MLeucci SPanconesi A(2021)Faster Motif Counting via Succinct Color Coding and Adaptive SamplingACM Transactions on Knowledge Discovery from Data10.1145/344739715:6(1-27)Online publication date: 19-May-2021
https://dl.acm.org/doi/10.1145/3447397
Bressan MLeucci SPanconesi A(2019)MotivoProceedings of the VLDB Endowment10.14778/3342263.334264012:11(1651-1663)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.14778/3342263.3342640

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents