research-article

Open access

Sparse MTTKRP Acceleration for Tensor Decomposition on GPU

Authors:

Sasindu Wijeratne,

Rajgopal Kannan,

Viktor PrasannaAuthors Info & Claims

CF '24: Proceedings of the 21st ACM International Conference on Computing Frontiers

Pages 88 - 96

https://doi.org/10.1145/3649153.3649187

Published: 02 July 2024 Publication History

Abstract

Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the bottleneck kernel of sparse tensor decomposition. In this work, we propose a GPU-based algorithm design to address the key challenges in accelerating spMTTKRP computation, including (1) eliminating global atomic operations across GPU thread blocks, (2) avoiding the intermediate values being communicated between GPU thread blocks and GPU global memory, and (3) ensuring a balanced distribution of workloads across GPU thread blocks. Our approach also supports dynamic tensor remapping, enabling the above optimizations in all the modes of the input tensor. Our approach achieves a geometric mean speedup of 1.5×, 2.0×, and 21.7× in total execution time across widely used datasets compared with the state-of-the-art GPU implementations. Our work is the only GPU implementation that can support tensors with modes greater than 4 since the state-of-the-art works have implementation constraints for tensors with a large number of modes.

References

[1]

Richard Ansorge. 2022. Programming in parallel with CUDA: a practical guide. Cambridge University Press.

[2]

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD '08). Association for Computing Machinery, New York, NY, USA, 1247--1250. https://doi.org/10.1145/1376616.1376746

Digital Library

[3]

Rohit Chandra. 2001. Parallel programming in OpenMP. Morgan kaufmann.

[4]

Zhiyu Cheng, Baopu Li, Yanwen Fan, and Yingze Bao. 2020. A novel rank selection scheme in tensor ring decomposition based on reinforcement learning for deep neural networks. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3292--3296.

[5]

Shane Cook. 2012. CUDA programming: a developer's guide to parallel computing with GPUs. Newnes.

[6]

Massimiliano Fatica. 2008. CUDA toolkit and libraries. In 2008 IEEE hot chips 20 symposium (HCS). IEEE, 1--22.

[7]

Gérard Favier and André LF de Almeida. 2014. Overview of constrained PARAFAC models. EURASIP Journal on Advances in Signal Processing 2014, 1 (2014), 1--25.

[8]

Sofia Fernandes, Hadi Fanaee-T, and João Gama. 2020. Tensor decomposition for analysing time-evolving social networks: An overview. Artificial Intelligence Review (2020), 1--26.

[9]

Ronald L. Graham. 1969. Bounds on multiprocessing timing anomalies. SIAM journal on Applied Mathematics 17, 2 (1969), 416--429.

Digital Library

[10]

Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In Proceedings of the 25th International Conference on World Wide Web (Montréal, Québec, Canada) (WWW '16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 507--517. https://doi.org/10.1145/2872427.2883037

Digital Library

[11]

Kumar Iyer and Jeffrey Kiel. 2016. GPU debugging and Profiling with NVIDIA Parallel Nsight. Game Development Tools (2016), 303--324.

[12]

Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455--500.

Digital Library

[13]

Jiajia Li, Yuchen Ma, and Richard Vuduc. 2018. ParTI!: A parallel tensor infrastructure for multicore CPUs and GPUs. A parallel tensor infrastructure for multicore CPUs and GPUs (2018).

[14]

Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical Storage of Sparse Tensors. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 238--252. https://doi.org/10.1109/SC.2018.00022

Digital Library

[15]

Jiajia Li, Bora Uçar, Ümit V. Çatalyürek, Jimeng Sun, Kevin Barker, and Richard Vuduc. 2019. Efficient and Effective Sparse Tensor Reordering. https://github.com/hpcgarage/ParTI

[16]

Bangtian Liu, Chengyao Wen, Anand D. Sarwate, and Maryam Mehri Dehnavi. 2017. A Unified Optimization Approach for Sparse Tensor Operations on GPUs. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 47--57. https://doi.org/10.1109/CLUSTER.2017.75

[17]

Julian McAuley. 2021. Recommender Systems and Personalization Datasets. https://cseweb.ucsd.edu/~jmcauley/datasets.html#

[18]

Marco Mondelli and Andrea Montanari. 2019. On the connection between learning two-layer neural networks and tensor decomposition. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 1051--1060.

[19]

Andy Nguyen, Ahmed E. Helal, Fabio Checconi, Jan Laukemann, Jesmin Jahan Tithi, Yongseok Soh, Teresa Ranadive, Fabrizio Petrini, and Jee W. Choi. 2022. Efficient, out-of-Memory Sparse MTTKRP on Massively Parallel Architectures. In Proceedings of the 36th ACM International Conference on Supercomputing (Virtual Event) (ICS '22). Association for Computing Machinery, New York, NY, USA, Article 26, 13 pages. https://doi.org/10.1145/3524059.3532363

Digital Library

[20]

Andy Nguyen, Ahmed E Helal, Fabio Checconi, Jan Laukemann, Jesmin Jahan Tithi, Yongseok Soh, Teresa Ranadive, Fabrizio Petrini, and Jee W Choi. 2022. Efficient, out-of-memory sparse MTTKRP on massively parallel architectures. https://github.com/jeewhanchoi/blocked-linearized-coordinate

[21]

Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Prasant Singh Rawat, Sriram Krishnamoorthy, and P. Sadayappan. 2019. An Efficient Mixed-Mode Representation of Sparse Tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '19). Association for Computing Machinery, New York, NY, USA, Article 49, 25 pages. https://doi.org/10.1145/3295500.3356216

Digital Library

[22]

Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Prasant Singh Rawat, Sriram Krishnamoorthy, and Ponnuswamy Sadayappan. 2019. An Efficient Mixed-Mode Representation of Sparse Tensors. https://github.com/isratnisa/MM-CSF

[23]

Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Richard Vuduc, and P. Sadayappan. 2019. Load-Balanced Sparse MTTKRP on GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 123--133. https://doi.org/10.1109/IPDPS.2019.00023

[24]

Takashi Nishitsuji. 2023. Basics of OpenCL. In Hardware Acceleration of Computational Holography. Springer, 83--95.

[25]

NVIDIA. 2023. DEVELOPER TOOLS Documentation. https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#

[26]

Eric T. Phipps and Tamara G. Kolda. 2019. Software for Sparse Tensor Decomposition on Emerging Computing Architectures. SIAM Journal on Scientific Computing 41, 3 (2019), C269-C290. https://doi.org/10.1137/18M1210691 arXiv:https://doi.org/10.1137/18M1210691

Digital Library

[27]

Jérémie Rappaz, Julian McAuley, and Karl Aberer. 2021. Recommendation on Live-Streaming Platforms: Dynamic Availability and Repeat Consumption. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys '21). Association for Computing Machinery, New York, NY, USA, 390--399. https://doi.org/10.1145/3460231.3474267

Digital Library

[28]

Boris Schäling. 2014. The boost C++ libraries. Vol. 3. XML press Laguna Hills.

[29]

Nicholas D. Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, and Christos Faloutsos. 2017. Tensor Decomposition for Signal Processing and Machine Learning. IEEE Transactions on Signal Processing 65, 13 (2017), 3551--3582. https://doi.org/10.1109/TSP.2017.2690524

Digital Library

[30]

Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. http://frostt.io/

[31]

Fuxi Wen, Hing Cheung So, and Henk Wymeersch. 2020. Tensor decomposition-based beamspace esprit algorithm for multidimensional harmonic retrieval. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 4572--4576.

[32]

Sasindu Wijeratne, Rajgopal Kannan, and Viktor Prasanna. 2023. Dynasor: A Dynamic Memory Layout for Accelerating Sparse MTTKRP for Tensor Decomposition on Multi-core CPU. arXiv:2309.09131 [cs.DC]

[33]

Sasindu Wijeratne, Ta-Yang Wang, Rajgopal Kannan, and Viktor Prasanna. 2023. Accelerating Sparse MTTKRP for Tensor Decomposition on FPGA. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, CA, USA) (FPGA '23). Association for Computing Machinery, New York, NY, USA, 259--269. https://doi.org/10.1145/3543622.3573179

Digital Library

[34]

Cyril Zeller. 2011. CUDA C/C++ Basics. (2011).

Index Terms

Sparse MTTKRP Acceleration for Tensor Decomposition on GPU
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent algorithms
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms
      2. Shared memory algorithms

Recommendations

Accelerating Sparse MTTKRP for Tensor Decomposition on FPGA
FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the most computationally intensive kernel in sparse tensor decomposition. In this paper, we propose a hardware-algorithm co-design on FPGA to minimize the execution time of spMTTKRP along ...
Efficient, out-of-memory sparse MTTKRP on massively parallel architectures
ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

Tensor decomposition (TD) is an important method for extracting latent information from high-dimensional (multi-modal) sparse data. This study presents a novel framework for accelerating fundamental TD operations on massively parallel GPU architectures. ...
High-performance dense tucker decomposition on GPU clusters
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

The dense Tucker decomposition method is one of the most popular algorithms for analyzing and compressing data with multi-way relationship. Its execution time is typically dominated by dense matrix multiplication operations, which makes it well-suited ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '24: Proceedings of the 21st ACM International Conference on Computing Frontiers

May 2024

345 pages

ISBN:9798400705977

DOI:10.1145/3649153

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSF
DARPA

Conference

CF '24

Sponsor:

SIGMICRO

CF '24: 21st ACM International Conference on Computing Frontiers

May 7 - 9, 2024

Ischia, Italy

Acceptance Rates

CF '24 Paper Acceptance Rate 33 of 105 submissions, 31%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
190
Total Downloads

Downloads (Last 12 months)190
Downloads (Last 6 weeks)42

Reflects downloads up to 04 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten