skip to main content
research-article
Open access

Optimization of Triangular and Banded Matrix Operations Using 2d-Packed Layouts

Published: 18 December 2017 Publication History

Abstract

Over the past few years, multicore systems have become increasingly powerful and thereby very useful in high-performance computing. However, many applications, such as some linear algebra algorithms, still cannot take full advantage of these systems. This is mainly due to the shortage of optimization techniques dealing with irregular control structures. In particular, the well-known polyhedral model fails to optimize loop nests whose bounds and/or array references are not affine functions. This is more likely to occur when handling sparse matrices in their packed formats. In this article, we propose using 2d-packed layouts and simple affine transformations to enable optimization of triangular and banded matrix operations. The benefit of our proposal is shown through an experimental study over a set of linear algebra benchmarks.

Supplementary Material

TACO1404-55 (taco1404-55.pdf)
Slide deck associated with this paper

References

[1]
Ramesh C. Agarwal, Fred G. Gustavson, Mahesh V. Joshi, and Mohammad Zubair. 1995. A scalable parallel block algorithm for band Cholesky factorization. In Proceedings of the 7th SIAM Conference on Parallel Processing for Scientific Computing’95), San Francisco, CA, February 15-17, 1995. 430--435.
[2]
Äke Björck. 2015. Numerical Methods in Matrix Computations. Springer International Publishing.
[3]
Bjarne Stig Andersen, Jerzy Waśniewski, and Fred G. Gustavson. 2001. A recursive formulation of Cholesky factorization of a matrix in packed storage. ACM Transactions on Mathematical Software 27, 2, 214--244.
[4]
E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide, 3rd ed. Society for Industrial and Applied Mathematics, Philadelphia, PA.
[5]
Howard Anton and Chris Rorres. 2014. Elementary Linear Algebra: Applications Version, 11th ed. Wiley, Hoboken, NJ.
[6]
Athanasios Athanasios Konstantinidis and Paul H. J. Kelly. 2011. More definite results from the Pluto scheduling algorithm. In 1st International Workshop on Polyhedral Compilation Techniques (IMPACT’11), C. Alias and C. Bastoul (eds.). Chamonix, France. http://perso.ens-lyon.fr/christophe.alias/impact2011/impact-02.pdf.
[7]
Muthu Manikandan Baskaran, Nagavijayalakshmi Vydyanathan, Uday Kumar Reddy Bondhugula, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2009. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). ACM, New York, NY, 219--228.
[8]
Cedric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT’04). IEEE Computer Society, Washington, DC, 7--16.
[9]
Nathan Bell and Michael Garland. 2008. Efficient Sparse Matrix-Vector Multiplication on CUDA. NVIDIA Technical Report NVR-2008-004. NVIDIA Corporation, Santa Clara, California, USA.
[10]
Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. 2010. The polyhedral model is more widely applicable than you think. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction (CC’10/ETAPS’10). Springer-Verlag, Berlin, 283--303.
[11]
Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 2014. Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, New York, NY, 253--260.
[12]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Notices 43, 6, 101--113.
[13]
Uday Bondhugula, J. Ramanujam, and P. Sadayappan. 2007. PLuTo: A Practical and Fully Automatic Polyhedral Parallelizer and Locality Optimizer. Technical Report OSU-CISRC-10/07-TR70. The Ohio State University, Columbus, OH.
[14]
Uday Kumar Reddy Bondhugula. 2008. Effective Automatic Parallelization and Locality Optimization Using the Polyhedral Model. Ph.D. Dissertation. Ohio State University, Columbus, OH. Advisor(s) Sadayappan, P. AAI3325799.
[15]
Aydin Buluc and John R. Gilbert. 2008. Challenges and advances in parallel sparse matrix-matrix multiplication. In Proceedings of the 37th International Conference on Parallel Processing (ICPP’08). IEEE Computer Society, Washington, DC, 503--510.
[16]
Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. 2009. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing 35, 1, 38--53.
[17]
Huimin Cui, Jingling Xue, Lei Wang, Yang Yang, Xiaobing Feng, and Dongrui Fan. 2012. Extendable pattern-oriented optimization directives. ACM Transactions on Architecture and Code Optimization 9, 3, Article 14, 37 pages.
[18]
Huimin Cui, Qing Yi, Jingling Xue, and Xiaobing Feng. 2013. Layout-oblivious compiler optimization for matrix computations. ACM Transactions on Architecture and Code Optimization 9, 4, Article 35, 20 pages.
[19]
J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software 16, 1, 1--17.
[20]
J. J. Dongarra, C. B. Moler, J. R. Bunch, and G.W. Stewart. 1979. LINPACK Users’ Guide. pub-SIAM. 320 pages.
[21]
P. Feautrier. 1992. Some efficient solutions to the affine scheduling problem, Part 1: One dimensional time. International Journal of Parallel Programming 21, 5, 313--348.
[22]
P. Feautrier. 1992. Some efficient solutions to the affine scheduling problem, Part 2: Multidimensional time. International Journal of Parallel Programming 21, 6.
[23]
Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal Parallel Program. 34, 3 (June 2006), 261--317.
[24]
Fred G. Gustavson, Jerzy Waśniewski, Jack J. Dongarra, and Julien Langou. 2010. Rectangular full packed format for Cholesky’s algorithm: Factorization, solution, and inversion. ACM Transactions on Mathematical Software 37, 2, Article 18, 21 pages.
[25]
Ziang Hu, Juan del Cuvillo, Weirong Zhu, and Guang R. Gao. 2006. Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences. Springer, Berlin, 134--144.
[26]
Richard M. Karp, Raymond E. Miller, and Shmuel Winograd. 1967. The organization of computations for uniform recurrence equations. Journal of the ACM 14, 3, 563--590.
[27]
H. T. Kung and Jaspal Subhlok. 1991. A new approach for automatic parallelization of blocked linear algebra computations. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing’91). ACM, New York, NY, 122--129.
[28]
Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, Alan J. Miller, and Michael Upton. 2002. Hyper-threading technology architecture and microarchitecture. Intel Technology Journal 6, 1, 4--15.
[29]
G. M. Megson and X. Chen. 1997. Automatic Parallelization for a Class of Regular Computations. World Scientific, Singapore.
[30]
A. P. Mullhaupt and K. S. Riedel. 2001. Banded matrix fraction representation of triangular input normal pairs. IEEE Transactions on Automatic Control 46, 12, 2018--2022.
[31]
Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2016. Adaptive multi-level blocking optimization for sparse matrix vector multiplication on GPU. Procedia Computer Science 80, 131--142.
[32]
Jeff Parkhurst, John Darringer, and Bill Grundmann. 2006. From single core to multi-core: Preparing for a new exponential. In Proceedings of the IEEE/ACM International Conference on Computer-aided Design (ICCAD’06). ACM, New York, NY, 67--72.
[33]
Diogo N. Sampaio, Louis-Noël Pouchet, and Fabrice Rastello. 2017. Simplification and runtime resolution of data dependence constraints for loop transformations. In Proceedings of the International Conference on Supercomputing (ICS’17). ACM, New York, NY, Article 10, 11 pages.
[34]
Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen. 2009. Polyhedral-model guided loop-nest auto-vectorization. In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT’09). IEEE Computer Society, Washington, DC, 327--337.
[35]
Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’08). IEEE Press, Piscataway, NJ, Article 31, 11 pages. http://dl.acm.org/citation.cfm?id=1413370.1413402.
[36]
Qing Yi. 2011. Automated programmable control and parameterization of compiler optimizations. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, Washington, DC, 97--106. http://dl.acm.org/citation.cfm?id=2190025.2190057.
[37]
Ling Zhuo and Viktor K. Prasanna. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM, New York, NY, 63--74.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 14, Issue 4
December 2017
600 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3154814
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 December 2017
Accepted: 01 November 2017
Revised: 01 November 2017
Received: 01 May 2017
Published in TACO Volume 14, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 2d-packed layouts
  2. Polyhedral model
  3. code optimization and parallelization
  4. sparse matrices

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)109
  • Downloads (Last 6 weeks)15
Reflects downloads up to 21 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media