research-article

AlphaSparse: generating high performance SpMV codes directly from sparse matrices

Authors:

Ninghui SunAuthors Info & Claims

SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 66, Pages 1 - 15

Published: 18 November 2022 Publication History

Abstract

Sparse Matrix-Vector multiplication (SpMV) is an essential computational kernel in many application scenarios. Te ns of sparse matrix formats and implementations have been proposed to compress the memory storage and speed up SpMV performance. We develop AlphaSparse, a superset of all existing works that goes beyond the scope of human-designed format(s) and implementation(s). AlphaSparse automatically creates novel machine-designed formats and SpMV kernel implementations entirely from the knowledge of input sparsity patterns and hardware architectures. Based on our proposed Operator Graph that expresses the path of SpMV format and kernel design, AlphaS-parse consists of three main components: Designer, F ormat & Kernel Generator, and Search Engine. It takes an arbitrary sparse matrix as input while outputs the performance machine-designed format and SpMV implementation. By extensively evaluating 843 matrices from SuiteSparse Matrix Collection, AlphaSparse achieves significant performance improvement by 3.2× on average compared to five state-of-the-art artificial formats and 1.5× on average (up to 2.7×) over the up-to-date implementation of traditional auto-tuning philosophy.

Supplementary Material

MP4 File (SC22_Presentation_Du.mp4)

Presentation at SC '22

Download
148.77 MB

References

[1]

M. Tillenius, E. Larsson, E. Lehto, and N. Flyer, "A task parallel implementation of a scattered node stencil-based solver for the shallow water equations," in Proc. 6th Swedish Workshop on Multi-Core Computing, Halmstad University, Halmstad, Sweden, 2013, pp. 33--36.

[2]

D. Weber, J. Bender, M. Schnoes, A. Stork, and D. Fellner, "Efficient gpu data structures and methods to solve sparse linear systems in dynamics applications," in Computer graphics forum, vol. 32, no. 1. Wiley Online Library, 2013, pp. 16--26.

[3]

M. Zheng, X. Li, and L. Guo, "Algorithms of gpu-enabled reactive force field (reaxff) molecular dynamics," Journal of Molecular Graphics and Modelling, vol. 41, pp. 1--11, 2013.

[4]

S. B. Kylasa, H. M. Aktulga, and A. Y. Grama, "Puremd-gpu: A reactive molecular dynamics simulation package for gpus," Journal of Computational Physics, vol. 272, pp. 343--359, 2014.

[5]

C. Chen, K. Li, A. Ouyang, Z. Zeng, and K. Li, "Gflink: An in-memory computing architecture on heterogeneous cpu-gpu clusters for big data," IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 6, pp. 1275--1288, 2018.

[6]

A. Kyrola, G. Blelloch, and C. Guestrin, "Graphchi: Large-scale graph computation on just a {PC}," in 10th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 12), 2012, pp. 31--46.

[7]

M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. G ai et al., "Deep graph library: A graph-centric, highly-performant package for graph neural networks," arXiv preprint arXiv:1909.01315, 2019.

[8]

S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, "Cambricon-x: An accelerator for sparse neural networks," in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1--12.

[9]

L. Dagum and R. Menon, "Openmp: an industry standard api for shared-memory programming," IEEE Computational Science and Engineering, vol. 5, no. 1, pp. 46--55, 1998.

Digital Library

[10]

J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "Gpu computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, 2008.

[11]

A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, "Knights landing: Second-generation intel xeon phi product," Ieee micro, vol. 36, no. 2, pp. 34--46, 2016.

Digital Library

[12]

I. Kuon, R. Tessier, and J. Rose, FPGA architecture: Survey and challenges. Now Publishers Inc, 2008.

[13]

D. Langr and P. Tvrdik, "Evaluation criteria for sparse matrix storage formats," IEEE Transactions on parallel and distributed systems, vol. 27, no. 2, pp. 428--440, 2015.

Digital Library

[14]

S. Filippone, V. Cardellini, D. Barbieri, and A. Fanfarillo, "Sparse matrix-vector multiplication on gpgpus," ACM Transactions on Mathematical Software (TOMS), vol. 43, no. 4, pp. 1--49, 2017.

Digital Library

[15]

J. Li, G. Tan, M. Chen, and N. Sun, "Smat: an input adaptive auto-tuner for sparse matrix-vector multiplication," in Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation, 2013, pp. 117--126.

[16]

Z. Xie, G. Tan, W. Liu, and N. Sun, "Ia-spgemm: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication," in Proceedings of the ACM International Conference on Supercomputing, 2019, pp. 94--105.

Digital Library

[17]

E.-J. Im, K. Yelick, and R. Vuduc, "Sparsity: Optimization framework for sparse matrix kernels," The International Journal of High Performance Computing Applications, vol. 18, no. 1, pp. 135--158, 2004.

Digital Library

[18]

W. Liu and B. Vinter, "Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication," in Proceedings of the 29th ACM on International Conference on Supercomputing, 2015, pp. 339--350.

[19]

B.-Y. Su and K. Keutzer, "clspmv: A cross-platform opencl spmv framework on gpus," in Proceedings of the 26th ACM international conference on Supercomputing, 2012, pp. 353--364.

[20]

K. Kourtis, V. K arakasis, G. Goumas, and N. Koziris, "Csx: an extended compression format for spmv on shared memory systems," ACM SIGPLAN Notices, no. 8, pp. 247--256, 2011.

Digital Library

[21]

Y. Zhao, J. Li, C. Liao, and X. Shen, "Bridging the gap between deep learning and sparse matrix format selection," in Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, 2018, pp. 94--108.

[22]

M. Daga and J. L. Greathouse, "Structural agnostic spmv: Adapting csr-adaptive for irregular matrices," in 2015 IEEE 22nd International conference on high performance computing (HiPC). IEEE, 2015, pp. 64--74.

[23]

T. A. Davis and Y. Hu, "The university of florida sparse matrix collection," ACM Trans. Math. Softw., vol. 38, no. 1, Dec. 2011. [Online].

Digital Library

[24]

A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarath, and P. Sadayappan, "Fast sparse matrix-vector multiplication on gpus for graph applications," in SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 781--792.

[25]

S. Yan, C. Li, Y. Zhang, and H. Zhou, "yaspmv: Yet another spmv framework on gpus," Acm Sigplan Notices, vol. 49, no. 8, pp. 107--118, 2014.

[26]

C. Gómez, F. Mantovani, E. Focht, and M. Casas, "Efficiently running spmv on long vector architectures," in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP '21. New York, NY, USA: Association for Computing Machinery, 2021, p. 292--303. [Online].

Digital Library

[27]

D. Merrill and M. Garland, "Merge-based parallel sparse matrix-vector multiplication," in SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2016, pp. 678--689.

[28]

T. Oberhuber, A. Suzuki, and J. Vacata, "New row-grouped csr format for storing the sparse matrices on gpu with implementation in cuda," arXiv preprint arXiv:1012.2270, 2010.

[29]

T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. H u, L. Ceze et al., "{TVM}: An automated end-to-end optimizing compiler for deep learning," in 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), 2018, pp. 578--594.

[30]

F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe, "The tensor algebra compiler," Proceedings of the ACM on Programming Languages, vol. 1, no. OOPSLA, pp. 1--29, 2017.

Digital Library

[31]

S. Chou, F. Kjolstad, and S. Amarasinghe, "Format abstraction for sparse tensor algebra compilers," Proc. ACM Program. Lang., vol. 2, no. OOPSLA, pp. 123:1--123:30, Oct. 2018. [Online].

Digital Library

[32]

S. Chou, "Unified sparse formats for tensor algebra compilers," Cambridge, MA, Feb 2018. [Online]. Available: http://groups.csail.mit.edu/commit/papers/2018/chou-18-sm-thesis.pdf

[33]

J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko et al., "Highly accurate protein structure prediction with alphafold," Nature, vol. 596, no. 7873, pp. 583--589, 2021.

[34]

J. L. Greathouse and M. Daga, "Efficient sparse matrix-vector multiplication on gpus using the csr storage format," in SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 769--780.

[35]

M. Heller and T. Oberhuber, "Adaptive row-grouped csr format for storing of sparse matrices on gpu," arXiv preprint arXiv:1203.5737, 2012.

[36]

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, A. Basermann, and A. R. Bishop, "Sparse matrix-vector multiplication on gpgpu clusters: A new storage format and a scalable implementation," in 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, 2012, pp. 1696--1702.

[37]

S. Chou, F. Kjolstad, and S. Amarasinghe, "Automatic generation of efficient sparse tensor format conversion routines," in Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, 2020, pp. 823--838.

[38]

H. Anzt, S. Tomov, and J. Dongarra, "Implementing a sparse matrix vector product for the sell-c/sell-c-σ formats on nvidia gpus," University of Tennessee, Tech. Rep. ut-eecs-14-727, 2014.

[39]

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, "A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide simd units," SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. C401--C423, 2014.

Digital Library

[40]

X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, "Efficient sparse matrix-vector multiplication on x86-based many-core processors," in Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ser. ICS '13. New York, NY, USA: Association for Computing Machinery, 2013, p. 273--282. [Online].

Digital Library

[41]

Y. Liang, W. T. Tang, R. Zhao, M. Lu, H. P. Huynh, and R. S. M. Goh, "Scale-free sparse matrix-vector multiplication on many-core architectures," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 12, pp. 2106--2119, 2017.

[42]

W. Cao, L. Yao, Z. Li, Y. W ang, and Z. Wang, "Implementing sparse matrix-vector multiplication using cuda based on a hybrid sparse matrix format," in 2010 International Conference on Computer Application and System Modeling (ICCASM 2010), vol. 11. IEEE, 2010, pp. V11--161.

[43]

C. Zheng, S. Gu, T.-X. Gu, B. Yang, and X.-P. Liu, "Biell: A bisection ellpack-based storage format for optimizing spmv on gpus," Journal of Parallel and Distributed Computing, vol. 74, no. 7, pp. 2639--2647, 2014.

[44]

K. Hou, W.-c. Feng, and S. Che, "Auto-tuning strategies for parallelizing sparse matrix-vector (spmv) multiplication on multi-and many-core processors," in 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2017, pp. 713--722.

[45]

M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, "Cusparse library," in GPU Technology Conference, 2010.

[46]

A. Ashari, N. Sedaghati, J. Eisenlohr, and P. Sadayappan, "An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on gpus," in Proceedings of the 28th ACM international conference on Supercomputing, 2014, pp. 273--282.

[47]

M. Maggioni and T. Berger-Wolf, "Adell: An adaptive warp-balancing ell format for efficient sparse matrix-vector multiplication on gpus," in 2013 42nd international conference on parallel processing. IEEE, 2013, pp. 11--20.

[48]

M. Yang, C. Sun, Z. Li, and D. Cao, "An improved sparse matrix-vector multiplication kernel for solving modified equation in large scale power flow calculation on cuda," in Proceedings of The 7th International Power Electronics and Motion Control Conference, vol. 3. IEEE, 2012, pp. 2028--2031.

[49]

Y. Liu and B. Schmidt, "Lightspmv: Faster csr-based sparse matrix-vector multiplication on cuda-enabled gpus," in 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 2015, pp. 82--89.

[50]

X. Feng, H. Jin, R. Zheng, K. Hu, J. Zeng, and Z. Shao, "Optimization of sparse matrix-vector multiplication with variant csr on gpus," in 2011 IEEE 17th International Conference on Parallel and Distributed Systems. IEEE, 2011, pp. 165--172.

[51]

N. Bell and M. Garland, "Efficient sparse matrix-vector multiplication on cuda," Citeseer, Tech. Rep., 2008.

[52]

G. E. Blelloch, M. A. Heroux, and M. Zagha, "Segmented operations for sparse matrix computation on vector multiprocessors," CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, Tech. Rep., 1993.

[53]

Nvidia, "Cuda toolkit v11.6.1 programming guide," https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions Accessed March 12, 2022.

[54]

A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks," in Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, 2009, pp. 233--244.

[55]

R. Li and Y. Saad, "Gpu-accelerated preconditioned iterative linear solvers," The Journal of Supercomputing, vol. 63, no. 2, pp. 443--466, 2013.

Digital Library

[56]

T. Chen, T. Moreau, Z. Jiang, H. Shen, E. Q. Yan, L. Wang, Y. H u, L. Ceze, C. Guestrin, and A. Krishnamurthy, "Tvm: end-to-end optimization stack for deep learning," arXiv preprint arXiv:1802.04799, vol. 11, p. 20, 2018.

[57]

T. Augustine, J. Sarma, L.-N. Pouchet, and G. Rodríguez, "Generating piecewise-regular code from irregular structures," in Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2019, pp. 625--639.

[58]

T. Chen and C. Guestrin, "Xgboost: A scalable tree boosting system," in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785--794.

[59]

K. Rupp, F. Rudolf, and J. Weinbub, "Viennacl-a high level linear algebra library for gpus and multi-core cpus," in Intl. Workshop on GPUs and Scientific Applications, 2010, pp. 51--56.

[60]

K. Rupp, "viennacl," [EB/OL], http://viennacl.sourceforge.net/viennacl-about.html Accessed March 26, 2022.

[61]

G. Tan, J. Liu, and J. Li, "Design and implementation of adaptive spmv library for multicore and many-core architecture," ACM Trans. Math. Softw., vol. 44, no. 4, aug 2018. [Online].

Digital Library

[62]

N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proceedings of the conference on high performance computing networking, storage and analysis, 2009, pp. 1--11.

[63]

S. Williams, A. Waterman, and D. Patterson, "Roofline: an insightful visual performance model for multicore architectures," Communications of the ACM, vol. 52, no. 4, pp. 65--76, 2009.

Digital Library

[64]

M. Christen, O. Schenk, and H. Burkhart, "Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures," in 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 2011, pp. 676--687.

[65]

T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan, "A stencil compiler for short-vector simd architectures," in Proceedings of the 27th international ACM conference on International conference on supercomputing, 2013, pp. 13--24.

[66]

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, "Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines," Acm Sigplan Notices, vol. 48, no. 6, pp. 519--530, 2013.

Digital Library

[67]

T. Lutz, C. Fensch, and M. Cole, "Partans: An autotuning framework for stencil computation on multi-gpu systems," ACM Transactions on Architecture and Code Optimization (TACO), vol. 9, no. 4, pp. 1--24, 2013.

Digital Library

[68]

D. Langr, I. Šimeček, P. Tvrdík, T. Dytrych, and J. P. Draayer, "Adaptive-blocking hierarchical storage format for sparse matrices," in 2012 Fe derated Conference on Computer Science and Information Systems (FedCSIS). IEEE, 2012, pp. 545--551.

[69]

J. W. Choi, A. Singh, and R. W. Vuduc, "Model-driven autotuning of sparse matrix-vector multiply on gpus," ACM sigplan notices, vol. 45, no. 5, pp. 115--126, 2010.

Digital Library

[70]

V. Karakasis, G. Goumas, and N. Koziris, "A comparative study of blocking storage methods for sparse matrices on multicore architectures," in 2009 International Conference on Computational Science and Engineering, vol. 1. IEEE, 2009, pp. 247--256.

[71]

R. C. Whaley and J. J. Dongarra, "A utomatically tuned linear algebra software," in SC'98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE, 1998, pp. 38--38.

[72]

M. Frigo and S. G. Johnson, "The design and implementation of fftw3," Proceedings of the IEEE, vol. 93, no. 2, pp. 216--231, 2005.

[73]

M. Puschel, J. M. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko et al., "Spiral: Code generation for dsp transforms," Proceedings of the IEEE, vol. 93, no. 2, pp. 232--275, 2005.

[74]

R. Vuduc, J. W. Demmel, and K. A. Yelick, "Oski: A library of automatically tuned sparse matrix kernels," in Journal of Physics: Conference Series, vol. 16, no. 1. IOP Publishing, 2005, p. 071.

[75]

Y. Niu, Z. Lu, M. Dong, Z. Jin, W. Liu, and G. Tan, "Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus," in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2021, pp. 68--78.

[76]

N. Sedaghati, T. Mu, L.-N. Pouchet, S. Parthasarathy, and P. Sadayappan, "A utomatic selection of sparse matrix representation on gpus," in Proceedings of the 29th ACM on International Conference on Supercomputing, 2015, pp. 99--108.

[77]

Z. Xie, G. Tan, W. Liu, and N. Sun, "Ia-spgemm: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication," in Proceedings of the ACM International Conference on Supercomputing, ser. ICS '19. New York, NY, USA: Association for Computing Machinery, 2019, p. 94--105. [Online].

Digital Library

[78]

L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y. W ang, J. Yang, D. Zhuo, K. Sen, J. E. Gonzalez, and I. Stoica, Ansor: Generating High-Performance Tensor Programs for Deep Learning, 2020.

[79]

J. Kim, A. Sukumaran-Rajam, V. Thumma, S. Krishnamoorthy, A. Panyala, L.-N. Pouchet, A. Rountev, and P. Sadayappan, "A code generator for high-performance tensor contractions on gpus," in 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2019, pp. 85--95.

[80]

G. Xiao, K. Li, Y. Chen, W. He, A. Y. Zomaya, and T. Li, "Caspmv: a customized and accelerative spmv framework for the sunway taihulight," IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 1, pp. 131--146, 2019.

[81]

A. Rasch and S. Gorlatch, "Atf: A generic directive-based auto-tuning framework," Concurrency and Computation: Practice and Experience, vol. 31, no. 5, p. e4423, 2019.

[82]

J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O'Reilly, and S. Amarasinghe, "Opentuner: An extensible framework for program autotuning," in Proceedings of the 23rd international conference on Parallel architectures and compilation, 2014, pp. 303--316.

[83]

C. Nugteren and V. Codreanu, "Cltune: A generic auto-tuner for opencl kernels," in 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip. IEEE, 2015, pp. 195--202.

[84]

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, "Optuna: A next-generation hyperparameter optimization framework," in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2623--2631.

[85]

P. Balaprakash, S. M. Wild, and P. D. Hovland, "Can search algorithms save large-scale automatic performance tuning?" Procedia Computer Science, vol. 4, pp. 2136--2145, 2011.

[86]

M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, "A compiler framework for optimization of affine loop nests for gpgpus," in Proceedings of the 22nd annual international conference on Supercomputing, 2008, pp. 225--234.

[87]

A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth, "A scalable auto-tuning framework for compiler optimization," in 2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE, 2009, pp. 1--12.

[88]

V. Sreenivasan, R. Javali, M. Hall, P. Balaprakash, T. R. Scogland, and B. R. de Supinski, "A framework for enabling openmp autotuning," in International Workshop on OpenMP. Springer, 2019, pp. 50--60.

[89]

F. Hutter, H. H. Hoos, and K. Leyton-Brown, "Sequential model-based optimization for general algorithm configuration," in International conference on learning and intelligent optimization. Springer, 2011, pp. 507--523.

[90]

M. Shah and V. Patel, "An efficient sparse matrix multiplication for skewed matrix on gpu," in 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems. IEEE, 2012, pp. 1301--1306.

[91]

H.-V. Dang and B. Schmidt, "Cuda-enabled sparse matrix-vector multiplication on gpus using atomic operations," Parallel Computing, vol. 39, no. 11, pp. 737--750, 2013.

Digital Library

[92]

W. T. Tang, W. J. Tan, R. S. M. Goh, S. J. Turner, and W.-F. Wong, "A family of bit-representation-optimized formats for fast sparse matrix-vector multiplication on the gpu," IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 9, pp. 2373--2385, 2014.

Digital Library

[93]

F. Vázquez, J.-J. Fernández, and E. M. Garzón, "A new approach for sparse matrix vector product on nvidia gpus," Concurrency and Computation: Practice and Experience, vol. 23, no. 8, pp. 815--826, 2011.

Digital Library

[94]

W. Yang, K. Li, Y. Liu, L. Shi, and L. Wan, "Optimization of quasi-diagonal matrix-vector multiplication on gpu," The international journal of high performance computing applications, vol. 28, no. 2, pp. 183--195, 2014.

[95]

H. Liu, S. Yu, Z. Chen, B. Hsieh, and L. Shao, "Sparse matrix-vector multiplication on nvidia gpu," International Journal of Numerical Analysis & Modeling, Series B, vol. 3, no. 2, pp. 185--191, 2012.

[96]

G. Arnold, J. Hölzl, A. S. Köksal, R. Bodík, and M. Sagiv, "Specifying and verifying sparse matrix codes," SIGPLAN Not., vol. 45, no. 9, p. 249--260, sep 2010. [Online].

Digital Library

[97]

N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, "Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions," arXiv preprint arXiv:1802.04730, 2018.

Recommendations

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, ...
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation

Sparse Matrix Vector multiplication (SpMV) is an important kernel in both traditional high performance computing and emerging data-intensive applications. By far, SpMV libraries are optimized by either application-specific or architecture-specific ...
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication
PLDI '13

Sparse Matrix Vector multiplication (SpMV) is an important kernel in both traditional high performance computing and emerging data-intensive applications. By far, SpMV libraries are optimized by either application-specific or architecture-specific ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2022

1277 pages

ISBN:9784665454445

Conference Chairs:
Felix Wolf,
Sameer Shende,
General Chair:
Candace Culhane,
Program Chairs:
Sadaf Alam,
Heike Jagode

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 18 November 2022

Check for updates

Badges

Artifacts Available / v1.1

Author Tags

Qualifiers

Research-article

Conference

SC '22

Sponsor:

SIGHPC

SC '22: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2022

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
307
Total Downloads

Downloads (Last 12 months)114
Downloads (Last 6 weeks)7

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents