research-article

Public Access

Custom High-Performance Vector Code Generation for Data-Specific Sparse Computations

Authors:

Louis-Noël Pouchet,

Gabriel Rodríguez,

Juan TouriñoAuthors Info & Claims

PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques

Pages 160 - 171

https://doi.org/10.1145/3559009.3569668

Published: 27 January 2023 Publication History

Abstract

Sparse computations, such as sparse matrix-dense vector multiplication, are notoriously hard to optimize due to their irregularity and memory-boundedness. Solutions to improve the performance of sparse computations have been proposed, ranging from hardware-based such as gather-scatter instructions, to software ones such as generalized and dedicated sparse formats, used together with specialized executor programs for different hardware targets. These sparse computations are often performed on read-only sparse structures: while the data themselves are variable, the sparsity structure itself does not change. Indeed, sparse formats such as CSR have a typically high cost to insert/remove nonzero elements in the representation. The typical use case is to not modify the sparsity during possibly repeated computations on the same sparse structure.

In this work, we exploit the possibility to generate a specialized executor program dedicated to the particular sparsity structure of an input matrix. It creates opportunities to remove indirection arrays and synthesize regular, vectorizable code for such computations. But, at the same time, it introduces challenges in code size and instruction generation, as well as efficient SIMD vectorization. We present novel techniques and extensive experimental results to efficiently generate SIMD vector code for data-specific sparse computations, and study the limits in terms of applicability and performance of our techniques compared to state-of-practice high-performance libraries like Intel MKL.

References

[1]

A. Abel and J. Reineke. 2019. uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures. In Intl. Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS. Providence, RI, USA, 673--686.

[2]

A. Abel and J. Reineke. 2022. uiCA: Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures. In Proceedings of the 36th ACM International Conference on Supercomputing, ICS. Virtual Event, USA, 33:1--33:14.

[3]

A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. 2014. Fast Sparse Matrix-vector Multiplication on GPUs for Graph Applications. In Intl. Conference for High Performance Computing, Networking, Storage and Analysis, SC. New Orleans, LA, USA, 781--792.

[4]

T. Augustine, J. Sarma, L.-N. Pouchet, and G. Rodríguez. 2019. Generating piecewise-regular code from irregular structures. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI. 625--639.

[5]

N. Bell and M. Garland. 2008. Efficient Sparse Matrix-Vector Multiplication on CUDA. NVIDIA Technical Report NVR-2008-004. NVIDIA Corporation.

[6]

N. Bell and M. Garland. 2009. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. In ACM/IEEE Conference on High Performance Computing, SC. Portland, OR, USA.

[7]

H. Bian, J. Huang, R. Dong, L. Liu, and X. Wang. 2020. CSR2: a new format for SIMD-accelerated SpMV. In 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID. Melbourne, Australia, 350--359.

[8]

X. Chen, P. Xie, L. Chi, J. Liu, and C. Gong. 2018. An efficient SIMD compression format for sparse matrix-vector multiplication. Concurrency and Computation: Practice and Experience 30, 23 (2018), e4800:1--10.

[9]

Chips and Cheese. 2021. How Zen 2's Op Cache Affects Performance. [Accessed: 01-03-2022].

[10]

J.W. Choi, A. Singh, and R.W. Vuduc. 2010. Model-Driven Autotuning of Sparse Matrix-Vector Multiply on GPUs. In 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP. Bangalore, India, 115--126.

[11]

S. Chou, F. Kjolstad, and S. Amarasinghe. 2018. Format abstraction for sparse tensor algebra compilers. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 123.

[12]

T. A. Davis and Y. Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Software 38 (2011), 1--25. Issue 1.

Digital Library

[13]

L. de Moura and N. Bjùrner. 2008. Z3: An efficient SMT solver. In Intl. Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS. 337--340.

[14]

E.F. D'Azevedo, M.R. Fahey, and R.T. Mills. 2005. Vectorized Sparse Matrix Multiply for Compressed Row Storage Format. In Intl. Conference on Computational Science, ICCS. Atlanta, GA, USA, 99--106.

[15]

A. Fog. [n. d.]. 4. Instruction Tables. Lists of Instruction Latencies, Throughputs and Micro-Operation Breakdowns for Intel, AMD, and VIA CPUs. [Accessed: 01-03-2022].

[16]

F. Franchetti, Y. Voronenko, P. A. Milder, S. Chellappa, M. R. Telgarsky, H. Shen, P. D'Alberto, F. de Mesmay, J. C. Hoe, J. MF Moura, et al. 2008. Domain-specific library generation for parallel software and hardware platforms. In Intl. Symposium on Parallel and Distributed Processing, IPDPS. 1--5.

[17]

GNU GCC. [n. d.]. Auto-Vectorization in GCC: Using the Vectorizer. [Accessed: 01-03-2022].

[18]

J. Godwin, J. Holewinski, and P. Sadayappan. 2012. High-performance Sparse Matrix-vector Multiplication on GPUs for Structured Grid Computations. In 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU. London, UK, 47--56.

[19]

T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste. 2021. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research 22, 241 (2021), 1--124.

[20]

J. Hofmann, J. Treibig, G. Hager, and G. Wellein. 2014. Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips. In Proceedings of the Workshop on Programming Models for SIMD/Vector Processing, WPMVP. Orlando, Florida, USA, 57--64.

[21]

M. Horro. 2022. Manycore Architectures and SIMD Optimizations for High Performance Computing. Ph. D. Dissertation. Universidade da Coruña, Spain.

[22]

M. Horro, L.-N. Pouchet, G. Rodríguez, and J. Touriño. 2022. MARTA: Multi-configuration Assembly pRofiler and Toolkit for performance Analysis. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS. Singapore, 79--89.

[23]

F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe. 2017. The tensor algebra compiler. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 77:1--29.

[24]

M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In Proceedings of the 34th ACM SIGPLAN conference on Programming Language Design and Implementation, PLDI. 127--138.

Digital Library

[25]

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423.

Digital Library

[26]

Y. Li, P. Xie, X. Chen, J. Liu, B. Yang, S. Li, C. Gong, X. Gan, and H. Xu. 2020. VBSF: a new storage format for SIMD Sparse Matrix-Vector multiplication on modern processors. The Journal of Supercomputing 76, 3 (2020), 2063--2081.

[27]

LLVM. [n. d.]. Auto-Vectorization in LLVM. [Accessed: 01-03-2022].

[28]

D. Nuzman, I. Rosen, and A. Zaks. 2006. Auto-vectorization of interleaved data for SIMD. ACM SIGPLAN Notices 41, 6 (2006), 132--143.

Digital Library

[29]

A. Pohl, B. Cosenza, and B. Juurlink. 2019. Portable Cost Modeling for Auto-Vectorizers. In Proceedings of the 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS. Rennes, France, 359--369.

[30]

V. Porpodas. 2017. Supergraph-SLP auto-vectorization. In 26th International Conference on Parallel Architectures and Compilation Techniques, PACT. 330--342.

[31]

V. Porpodas, R. CO Rocha, and L. FW Góes. 2018. Look-ahead SLP: Auto-vectorization in the presence of commutative operations. In Proceedings of the Intl. Symposium on Code Generation and Optimization, CGO. 163--174.

[32]

L.-N. Pouchet. 2011. PolyBench: The Polyhedral Benchmarking suite, version PolyBench/C 4.2.1. http://polybench.sf.net. Last accessed: May 2017.

[33]

M. Puschel, J. MF Moura, J. R Johnson, D. Padua, M. M Veloso, B. W Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, et al. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE 93, 2 (2005), 232--275.

[34]

I. Rosen, D. Nuzman, and A. Zaks. 2007. Loop-aware SLP in GCC. In GCC Developers Summit.

[35]

Y. Saad. 1990. SPARSKIT: A basic tool kit for sparse matrix computations. (1990).

[36]

N. Sedaghati, T. Mu, L.-N. Pouchet, S. Parthasarathy, and P. Sadayappan. 2015. Automatic selection of sparse matrix representation on GPUs. In Proceedings of the 29th ACM on Intl. Conference on Supercomputing, SC. 99--108.

[37]

B. Solomon, A. Mendelson, D. Orenstien, Y. Almog, and R. Ronen. 2001. Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA. In Proceedings of the Intl. Symposium on Low Power Electronics and Design, ISLPED. Huntington Beach, CA, USA, 4--9.

[38]

W.T. Tang, R. Zhao, M. Lu, Y. Liang, H.P. Huynh, X. Li, and R.S.M. Goh. 2015. Optimizing and Auto-tuning Scale-free Sparse Matrix-vector Multiplication on Intel Xeon Phi. In 13th Annual IEEE/ACM Intl. Symposium on Code Generation and Optimization, CGO. San Francisco, CA, USA, 136--145.

[39]

X. Tang, T. Schneider, S. Kamil, A. Panda, J. Li, and D. Panozzo. 2020. EGGS: Sparsity-Specific Code Generation. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 209--219.

[40]

E. Wang, Q. Zhang, B. Shen, G. Zhang, X. Lu, Q. Wu, and Y. Wang. 2014. Intel Math Kernel Library. In High-Performance Computing on the Intel Xeon Phi. Springer, 167--188.

[41]

Z. Wegner. [n. d.]. SPRDPL: Simple Python Recursive-Descent Parsing Library. [Accessed: 01-03-2022].

[42]

Z. Wegner. [n. d.]. x86-sat. [Accessed: 01-03-2022].

[43]

B. Xie, J. Zhan, X. Liu, W. Gao, Z. Jia, X. He, and L. Zhang. 2018. Cvr: Efficient vectorization of spmv on x86 processors. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO. Vösendorf / Vienna, Austria, 149--162.

[44]

S. Yan, C. Li, Y. Zhang, and H. Zhou. 2014. yaSpMV: Yet Another SpMV Framework on GPUs. In 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP. ACM, Orlando, FL, USA, 107--118.

[45]

C. Yang, A. Buluç, and J. D Owens. 2022. GraphBLAST: A high-performance linear algebra-based graph framework on the GPU. ACM Transactions on Mathematical Software, TOMS 48, 1 (2022), 1--51.

Digital Library

[46]

X. Yang, S. Parthasarathy, and P. Sadayappan. 2011. Fast Sparse Matrix-vector Multiplication on GPUs: Implications for Graph Mining. Proc. VLDB Endow. 4, 4 (2011), 231--242.

Digital Library

Cited By

Liu ZMada SRegehr J(2024)Minotaur: A SIMD-Oriented Synthesizing SuperoptimizerProceedings of the ACM on Programming Languages10.1145/36897668:OOPSLA2(1561-1585)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689766
Liu PRoot AXu ALi YKjolstad FBik A(2024)Compiler Support for Sparse Tensor ConvolutionsProceedings of the ACM on Programming Languages10.1145/36897218:OOPSLA2(275-303)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689721
Wilkinson LCheshmi KDehnavi M(2023)Register Tiling for Unstructured Sparsity in Neural Network InferenceProceedings of the ACM on Programming Languages10.1145/35913027:PLDI(1995-2020)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591302

Index Terms

Custom High-Performance Vector Code Generation for Data-Specific Sparse Computations
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Vector / streaming algorithms
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Generating piecewise-regular code from irregular structures
PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

Irregular data structures, as exemplified with sparse matrices, have proved to be essential in modern computing. Numerous sparse formats have been investigated to improve the overall performance of Sparse Matrix-Vector multiply (SpMV). But in this work ...
Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-iteration
This article presents a code generator for sparse tensor contraction computations. It leverages a mathematical representation of loop nest computations in the sparse polyhedral framework (SPF), which extends the polyhedral model to support non-affine ...
Vectorizing sparse matrix computations with partially-strided codelets
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

The compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize strided computation regions of a sparse code. In this work, we propose ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques

October 2022

569 pages

ISBN:9781450398688

DOI:10.1145/3559009

General Chair:
Andreas Kloeckner
University of Illinois
,
Program Chair:
José Moreira
IBM

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IFIP WG 10.3: IFIP WG 10.3
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation
Xunta de Galicia
Ministry of Science and Innovation, Spain
Ministry of Education, Spain

Conference

PACT '22

Sponsor:

SIGARCH

PACT '22: International Conference on Parallel Architectures and Compilation Techniques

October 8 - 12, 2022

Illinois, Chicago

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
356
Total Downloads

Downloads (Last 12 months)230
Downloads (Last 6 weeks)32

Reflects downloads up to 29 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu ZMada SRegehr J(2024)Minotaur: A SIMD-Oriented Synthesizing SuperoptimizerProceedings of the ACM on Programming Languages10.1145/36897668:OOPSLA2(1561-1585)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689766
Liu PRoot AXu ALi YKjolstad FBik A(2024)Compiler Support for Sparse Tensor ConvolutionsProceedings of the ACM on Programming Languages10.1145/36897218:OOPSLA2(275-303)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689721
Wilkinson LCheshmi KDehnavi M(2023)Register Tiling for Unstructured Sparsity in Neural Network InferenceProceedings of the ACM on Programming Languages10.1145/35913027:PLDI(1995-2020)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591302

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten