skip to main content
10.1145/3559009.3569668acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Public Access

Custom High-Performance Vector Code Generation for Data-Specific Sparse Computations

Published: 27 January 2023 Publication History

Abstract

Sparse computations, such as sparse matrix-dense vector multiplication, are notoriously hard to optimize due to their irregularity and memory-boundedness. Solutions to improve the performance of sparse computations have been proposed, ranging from hardware-based such as gather-scatter instructions, to software ones such as generalized and dedicated sparse formats, used together with specialized executor programs for different hardware targets. These sparse computations are often performed on read-only sparse structures: while the data themselves are variable, the sparsity structure itself does not change. Indeed, sparse formats such as CSR have a typically high cost to insert/remove nonzero elements in the representation. The typical use case is to not modify the sparsity during possibly repeated computations on the same sparse structure.
In this work, we exploit the possibility to generate a specialized executor program dedicated to the particular sparsity structure of an input matrix. It creates opportunities to remove indirection arrays and synthesize regular, vectorizable code for such computations. But, at the same time, it introduces challenges in code size and instruction generation, as well as efficient SIMD vectorization. We present novel techniques and extensive experimental results to efficiently generate SIMD vector code for data-specific sparse computations, and study the limits in terms of applicability and performance of our techniques compared to state-of-practice high-performance libraries like Intel MKL.

References

[1]
A. Abel and J. Reineke. 2019. uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures. In Intl. Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS. Providence, RI, USA, 673--686.
[2]
A. Abel and J. Reineke. 2022. uiCA: Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures. In Proceedings of the 36th ACM International Conference on Supercomputing, ICS. Virtual Event, USA, 33:1--33:14.
[3]
A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. 2014. Fast Sparse Matrix-vector Multiplication on GPUs for Graph Applications. In Intl. Conference for High Performance Computing, Networking, Storage and Analysis, SC. New Orleans, LA, USA, 781--792.
[4]
T. Augustine, J. Sarma, L.-N. Pouchet, and G. Rodríguez. 2019. Generating piecewise-regular code from irregular structures. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI. 625--639.
[5]
N. Bell and M. Garland. 2008. Efficient Sparse Matrix-Vector Multiplication on CUDA. NVIDIA Technical Report NVR-2008-004. NVIDIA Corporation.
[6]
N. Bell and M. Garland. 2009. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. In ACM/IEEE Conference on High Performance Computing, SC. Portland, OR, USA.
[7]
H. Bian, J. Huang, R. Dong, L. Liu, and X. Wang. 2020. CSR2: a new format for SIMD-accelerated SpMV. In 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID. Melbourne, Australia, 350--359.
[8]
X. Chen, P. Xie, L. Chi, J. Liu, and C. Gong. 2018. An efficient SIMD compression format for sparse matrix-vector multiplication. Concurrency and Computation: Practice and Experience 30, 23 (2018), e4800:1--10.
[9]
Chips and Cheese. 2021. How Zen 2's Op Cache Affects Performance. [Accessed: 01-03-2022].
[10]
J.W. Choi, A. Singh, and R.W. Vuduc. 2010. Model-Driven Autotuning of Sparse Matrix-Vector Multiply on GPUs. In 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP. Bangalore, India, 115--126.
[11]
S. Chou, F. Kjolstad, and S. Amarasinghe. 2018. Format abstraction for sparse tensor algebra compilers. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 123.
[12]
T. A. Davis and Y. Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Software 38 (2011), 1--25. Issue 1.
[13]
L. de Moura and N. Bjùrner. 2008. Z3: An efficient SMT solver. In Intl. Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS. 337--340.
[14]
E.F. D'Azevedo, M.R. Fahey, and R.T. Mills. 2005. Vectorized Sparse Matrix Multiply for Compressed Row Storage Format. In Intl. Conference on Computational Science, ICCS. Atlanta, GA, USA, 99--106.
[15]
A. Fog. [n. d.]. 4. Instruction Tables. Lists of Instruction Latencies, Throughputs and Micro-Operation Breakdowns for Intel, AMD, and VIA CPUs. [Accessed: 01-03-2022].
[16]
F. Franchetti, Y. Voronenko, P. A. Milder, S. Chellappa, M. R. Telgarsky, H. Shen, P. D'Alberto, F. de Mesmay, J. C. Hoe, J. MF Moura, et al. 2008. Domain-specific library generation for parallel software and hardware platforms. In Intl. Symposium on Parallel and Distributed Processing, IPDPS. 1--5.
[17]
GNU GCC. [n. d.]. Auto-Vectorization in GCC: Using the Vectorizer. [Accessed: 01-03-2022].
[18]
J. Godwin, J. Holewinski, and P. Sadayappan. 2012. High-performance Sparse Matrix-vector Multiplication on GPUs for Structured Grid Computations. In 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU. London, UK, 47--56.
[19]
T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste. 2021. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research 22, 241 (2021), 1--124.
[20]
J. Hofmann, J. Treibig, G. Hager, and G. Wellein. 2014. Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips. In Proceedings of the Workshop on Programming Models for SIMD/Vector Processing, WPMVP. Orlando, Florida, USA, 57--64.
[21]
M. Horro. 2022. Manycore Architectures and SIMD Optimizations for High Performance Computing. Ph. D. Dissertation. Universidade da Coruña, Spain.
[22]
M. Horro, L.-N. Pouchet, G. Rodríguez, and J. Touriño. 2022. MARTA: Multi-configuration Assembly pRofiler and Toolkit for performance Analysis. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS. Singapore, 79--89.
[23]
F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe. 2017. The tensor algebra compiler. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 77:1--29.
[24]
M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In Proceedings of the 34th ACM SIGPLAN conference on Programming Language Design and Implementation, PLDI. 127--138.
[25]
M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423.
[26]
Y. Li, P. Xie, X. Chen, J. Liu, B. Yang, S. Li, C. Gong, X. Gan, and H. Xu. 2020. VBSF: a new storage format for SIMD Sparse Matrix-Vector multiplication on modern processors. The Journal of Supercomputing 76, 3 (2020), 2063--2081.
[27]
LLVM. [n. d.]. Auto-Vectorization in LLVM. [Accessed: 01-03-2022].
[28]
D. Nuzman, I. Rosen, and A. Zaks. 2006. Auto-vectorization of interleaved data for SIMD. ACM SIGPLAN Notices 41, 6 (2006), 132--143.
[29]
A. Pohl, B. Cosenza, and B. Juurlink. 2019. Portable Cost Modeling for Auto-Vectorizers. In Proceedings of the 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS. Rennes, France, 359--369.
[30]
V. Porpodas. 2017. Supergraph-SLP auto-vectorization. In 26th International Conference on Parallel Architectures and Compilation Techniques, PACT. 330--342.
[31]
V. Porpodas, R. CO Rocha, and L. FW Góes. 2018. Look-ahead SLP: Auto-vectorization in the presence of commutative operations. In Proceedings of the Intl. Symposium on Code Generation and Optimization, CGO. 163--174.
[32]
L.-N. Pouchet. 2011. PolyBench: The Polyhedral Benchmarking suite, version PolyBench/C 4.2.1. http://polybench.sf.net. Last accessed: May 2017.
[33]
M. Puschel, J. MF Moura, J. R Johnson, D. Padua, M. M Veloso, B. W Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, et al. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE 93, 2 (2005), 232--275.
[34]
I. Rosen, D. Nuzman, and A. Zaks. 2007. Loop-aware SLP in GCC. In GCC Developers Summit.
[35]
Y. Saad. 1990. SPARSKIT: A basic tool kit for sparse matrix computations. (1990).
[36]
N. Sedaghati, T. Mu, L.-N. Pouchet, S. Parthasarathy, and P. Sadayappan. 2015. Automatic selection of sparse matrix representation on GPUs. In Proceedings of the 29th ACM on Intl. Conference on Supercomputing, SC. 99--108.
[37]
B. Solomon, A. Mendelson, D. Orenstien, Y. Almog, and R. Ronen. 2001. Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA. In Proceedings of the Intl. Symposium on Low Power Electronics and Design, ISLPED. Huntington Beach, CA, USA, 4--9.
[38]
W.T. Tang, R. Zhao, M. Lu, Y. Liang, H.P. Huynh, X. Li, and R.S.M. Goh. 2015. Optimizing and Auto-tuning Scale-free Sparse Matrix-vector Multiplication on Intel Xeon Phi. In 13th Annual IEEE/ACM Intl. Symposium on Code Generation and Optimization, CGO. San Francisco, CA, USA, 136--145.
[39]
X. Tang, T. Schneider, S. Kamil, A. Panda, J. Li, and D. Panozzo. 2020. EGGS: Sparsity-Specific Code Generation. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 209--219.
[40]
E. Wang, Q. Zhang, B. Shen, G. Zhang, X. Lu, Q. Wu, and Y. Wang. 2014. Intel Math Kernel Library. In High-Performance Computing on the Intel Xeon Phi. Springer, 167--188.
[41]
Z. Wegner. [n. d.]. SPRDPL: Simple Python Recursive-Descent Parsing Library. [Accessed: 01-03-2022].
[42]
Z. Wegner. [n. d.]. x86-sat. [Accessed: 01-03-2022].
[43]
B. Xie, J. Zhan, X. Liu, W. Gao, Z. Jia, X. He, and L. Zhang. 2018. Cvr: Efficient vectorization of spmv on x86 processors. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO. Vösendorf / Vienna, Austria, 149--162.
[44]
S. Yan, C. Li, Y. Zhang, and H. Zhou. 2014. yaSpMV: Yet Another SpMV Framework on GPUs. In 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP. ACM, Orlando, FL, USA, 107--118.
[45]
C. Yang, A. Buluç, and J. D Owens. 2022. GraphBLAST: A high-performance linear algebra-based graph framework on the GPU. ACM Transactions on Mathematical Software, TOMS 48, 1 (2022), 1--51.
[46]
X. Yang, S. Parthasarathy, and P. Sadayappan. 2011. Fast Sparse Matrix-vector Multiplication on GPUs: Implications for Graph Mining. Proc. VLDB Endow. 4, 4 (2011), 231--242.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
October 2022
569 pages
ISBN:9781450398688
DOI:10.1145/3559009
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IFIP WG 10.3: IFIP WG 10.3
  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. data-specific compilation
  2. sparse data structure
  3. vectorization

Qualifiers

  • Research-article

Funding Sources

Conference

PACT '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)230
  • Downloads (Last 6 weeks)32
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media