research-article

Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly

Authors:

Bagus Hanindhito,

Lizy K. JohnAuthors Info & Claims

ICPE '24: Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering

Pages 178 - 189

https://doi.org/10.1145/3629526.3653835

Published: 07 May 2024 Publication History

Abstract

Machine Learning (ML) workloads generally contain a significant amount of matrix computations; hence, hardware accelerators for ML have been incorporating support for matrix accelerators. With the popularity of GPUs as hardware accelerators for ML, specialized matrix accelerators are embedded into GPUs (e.g., Tensor Cores on NVIDIA GPUs) to significantly improve the performance and energy efficiency of ML workloads. NVIDIA Tensor Cores and other matrix accelerators have been designed to support General Matrix-Matrix Multiplication (GEMM) for many data types. While previous research has demonstrated impressive performance gains with Tensor Cores, they primarily focused on Convolutional Neural Networks (CNNs).

This paper explores Tensor Cores' performance on various workloads, including Graph Convolutional Networks (GCNs), on NVIDIA H100 and A100 GPUs. In our experiments with NVIDIA GPUs, CNNs can achieve 1.91x (TF32) and 2.42x (FP16) end-to-end performance improvements with the use of Tensor Cores, whereas GCNs struggle to surpass a 1.03x (FP16) boost. Some implementations even experience slowdowns despite software transformation. Additionally, we explore the potential of Tensor Cores in non-GEMM-like kernels, providing insights into how software techniques can map diverse computation patterns onto Tensor Cores. Our investigation encompasses several kernels and end-to-end applications, aiming to comprehend the nuanced performance impact of Tensor Cores. Furthermore, we are among the first to present third-party evaluations of H100 GPU performance over the prior A100 GPU.

References

[1]

Advanced Micro Devices. 2020. AMD CDNA Architecture. Whitepaper. Advanced Micro Devices, US.

[2]

Pedro Martins Basso, Fernando Fernandes dos Santos, and Paolo Rech. 2020. Impact of Tensor Cores and Mixed Precision on the Reliability of Matrix Multiplication in GPUs. IEEE Transactions on Nuclear Science, Vol. 67, 7 (2020), 1560--1565. https://doi.org/10.1109/TNS.2020.2977583

[3]

Harun Bayraktar. 2020. How CUDA Math Libraries Can Help You Unleash The Power of The New NVIDIA A100 GPU. NVIDIA GPU Technology Conference (GTC), Vol. s21681 (May 2020).

[4]

Davis Blalock and John Guttag. 2021. Multiplying Matrices Without Multiplying. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, Virtual, 992--1004. https://proceedings.mlr.press/v139/blalock21a.html

[5]

Yukuo Cen, Zhenyu Hou, Yan Wang, Qibin Chen, Yizhen Luo, Zhongming Yu, Hengrui Zhang, Xingcheng Yao, Aohan Zeng, Shiguang Guo, Yuxiao Dong, Yang Yang, Peng Zhang, Guohao Dai, Yu Wang, Chang Zhou, Hongxia Yang, and Jie Tang. 2023. CogDL: A Comprehensive Library for Graph Deep Learning. In Proceedings of the ACM Web Conference 2023 (Austin, TX, USA) (WWW '23). ACM, New York, NY, USA, 747--758. https://doi.org/10.1145/3543507.3583472

Digital Library

[6]

Junkyeong Choi, Hyucksung Kwon, Woongkyu Lee, Jieun Lim, and Jungwook Choi. 2022. Understanding and Optimizing INT4 Convolution for Accelerated DNN Inference on Tensor Cores. In 2022 IEEE Workshop on Signal Processing Systems (SiPS). IEEE, Rennes, France, 1--6. https://doi.org/mdrz

[7]

Jack Choquette. 2023. NVIDIA Hopper H100 GPU: Scaling Performance. IEEE Micro, Vol. 43, 3 (2023), 9--17. https://doi.org/10.1109/MM.2023.3256796

Digital Library

[8]

Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. NVIDIA A100 Tensor Core GPU: Performance and Innovation. IEEE Micro, Vol. 41, 2 (2021), 29--35. https://doi.org/10.1109/MM.2021.3061394

[9]

Jack Choquette, Olivier Giroux, and Denis Foley. 2018. Volta: Performance and Programmability. IEEE Micro, Vol. 38, 2 (2018), 42--52. https://doi.org/mdrs

[10]

Rezaul Chowdhury, Francesco Silvestri, and Flavio Vella. 2020. A Computational Model for Tensor Core Units. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (Virtual Event, USA) (SPAA '20). ACM, New York, NY, USA, 519--521. https://doi.org/10.1145/3350755.3400252

Digital Library

[11]

Yi-Hua Chung, Cheng-Jhih Shih, and Shih-Hao Hung. 2022. Accelerating Simulated Quantum Annealing with GPU and Tensor Cores. In High Performance Computing, Ana-Lucia Varbanescu, Abhinav Bhatele, Piotr Luszczek, and Baboulin Marc (Eds.). Springer International Publishing, Cham, 174--191.

[12]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Miami, Florida, USA, 248--255. https://doi.org/10.1109/CVPR.2009.5206848

[13]

Sultan Durrani, Muhammad Saad Chughtai, Abdul Dakkak, Wen-mei Hwu, and Lawrence Rauchwerger. 2021. FFT Blitz: The Tensor Cores Strike Back. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Virtual Event, Republic of Korea) (PPoPP '21). ACM, New York, NY, USA, 488--489. https://doi.org/10.1145/3437801.3441623

Digital Library

[14]

Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, and Yufei Ding. 2021. EGEMM-TC: Accelerating Scientific Computing on Tensor Cores with Extended Precision. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Virtual Event, Republic of Korea) (PPoPP '21). ACM, New York, NY, USA, 278--291. https://doi.org/mdr8

Digital Library

[15]

Jesun Sahariar Firoz, Ang Li, Jiajia Li, and Kevin Barker. 2020. On the Feasibility of Using Reduced-Precision Tensor Core Operations for Graph Analytics. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, Waltham, MA, USA, 1--7. https://doi.org/10.1109/HPEC43674.2020.9286152

[16]

Brian Gaide, Dinesh Gaitonde, Chirag Ravishankar, and Trevor Bauer. 2019. Xilinx Adaptive Compute Acceleration Platform: VersalTM Architecture. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Seaside, CA, USA) (FPGA '19). ACM, New York, NY, USA, 84--93. https://doi.org/10.1145/3289602.3293906

Digital Library

[17]

B. Gallet and M. Gowanlock. 2022. Leveraging GPU Tensor Cores for Double Precision Euclidean Distance Calculations. In 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE Computer Society, Los Alamitos, CA, USA, 135--144. https://doi.org/mdrr

[18]

Chris Gottbrath. 2018. Using TensorRT to Unlock Tensor Core Performance for Inference. NVIDIA GPU Technology Conference (GTC), Vol. dc8169 (Oct 2018).

[19]

Azzam Haidar, Harun Bayraktar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. 2020. Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 476, 2243 (2020), 20200110. https://doi.org/10.1098/rspa.2020.0110

[20]

William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 1025--1035.

Digital Library

[21]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv: 1512.03385 [cs.CV]

[22]

Negar Heidari, Lukas Hedegaard, and Alexandros Iosifidis. 2022. Chapter 4 - Graph convolutional networks. In Deep Learning for Robot Perception and Cognition, Alexandros Iosifidis and Anastasios Tefas (Eds.). Academic Press, Cambridge, Massachusetts, United States, 71--99. https://doi.org/mdrq

[23]

Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. 2020. GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Atlanta, GA, USA, 1--12. https://doi.org/10.1109/SC41405.2020.00076

[24]

H. Jiang. 2022. Intel's Ponte Vecchio GPU : Architecture, Systems & Software. In 2022 IEEE Hot Chips 34 Symposium (HCS). IEEE Computer Society, Los Alamitos, CA, USA, 1--29. https://doi.org/10.1109/HCS55958.2022.9895631

[25]

Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. 2021. Ten Lessons From Three Generations Shaped Google's TPUv4i : Industrial Product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, Valencia, Spain, 1--14. https://doi.org/gnqdc9

Digital Library

[26]

Pau San Juan, Pedro Alonso-Jordá, and Enrique S. Quintana-Ortí. 2021. High Performance and Energy Efficient Integer Matrix Multiplication for Deep Learning. In 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). IEEE, Valladolid, Spain, 122--125. https://doi.org/10.1109/PDP52278.2021.00027

[27]

Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. 2009. A characterization and analysis of PTX kernels. In 2009 IEEE International Symposium on Workload Characterization (IISWC). IEEE, Austin, TX, USA, 3--12. https://doi.org/10.1109/IISWC.2009.5306801

Digital Library

[28]

Andrew Kerr, Duane Merrill, Julien Demouth, John Tran, Naila Farooqui, Markus Tavenrath, Vince Schuster, Eddie Gornish, Jerry Zheng, and Bageshri Sathe. 2018. CUTLASS: CUDA Template Library for Dense Linear Algebra at All Levels and Scales. NVIDIA GPU Technology Conference (GTC), Vol. s8854 (Mar 2018).

[29]

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. arxiv: 1609.02907 [cs.LG]

[30]

Takumi Kondo, Yoshihiro Maeda, and Norishige Fukushima. 2021. Accelerating Finite Impulse Response Filtering Using Tensor Cores. In 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, Tokyo, Japan, 74--79.

[31]

Thorsten Kurth, Shashank Subramanian, Peter Harrington, Jaideep Pathak, Morteza Mardani, David Hall, Andrea Miele, Karthik Kashinath, and Animashree Anandkumar. 2022. FourCastNet: Accelerating Global High-Resolution Weather Forecasting using Adaptive Fourier Neural Operators. arxiv: 2208.05419 [physics.ao-ph]

[32]

Ang Li and Simon Su. 2021. Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs. IEEE Transactions on Parallel and Distributed Systems, Vol. 32, 7 (2021), 1878--1891. https://doi.org/10.1109/TPDS.2020.3045828

[33]

Cheng Li, Abdul Dakkak, Jinjun Xiong, and Wen-mei Hwu. 2020. Benanza: Automatic μBenchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, New Orleans, LA, USA, 440--450. https://doi.org/10.1109/IPDPS47924.2020.00053

[34]

Guangli Li, Jingling Xue, Lei Liu, Xueying Wang, Xiu Ma, Xiao Dong, Jiansong Li, and Xiaobing Feng. 2021. Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, Seoul, South Korea, 90--102. https://doi.org/10.1109/CGO51591.2021.9370335

Digital Library

[35]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2018. Focal Loss for Dense Object Detection. arxiv: 1708.02002 [cs.CV]

[36]

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2015. Microsoft COCO: Common Objects in Context. arxiv: 1405.0312 [cs.CV]

[37]

Sébastien Marcel and Yann Rodriguez. 2010. Torchvision the Machine-Vision Package of Torch. In Proceedings of the 18th ACM International Conference on Multimedia (Firenze, Italy) (MM '10). ACM, New York, NY, USA, 1485--1488. https://doi.org/10.1145/1873951.1874254

Digital Library

[38]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, Vancouver, BC, Canada, 522--531. https://doi.org/10.1109/IPDPSW.2018.00091

[39]

Matt Martineau, Patrick Atkinson, and Simon McIntosh-Smith. 2019. Benchmarking the NVIDIA V100 GPU and Tensor Cores. In Euro-Par 2018: Parallel Processing Workshops, Gabriele Mencagli, Dora B. Heras, Valeria Cardellini, Emiliano Casalicchio, Emmanuel Jeannot, Felix Wolf, Antonio Salis, Claudio Schifanella, Ravi Reddy Manumachu, Laura Ricci, Marco Beccuti, Laura Antonelli, José Daniel Garcia Sanchez, and Stephen L. Scott (Eds.). Springer International Publishing, Cham, 444--455.

[40]

Paulius Micikevicius. 2018. Training Neural Networks with Mixed Precision: Theory and Practice. NVIDIA GPU Technology Conference, Vol. s8923 (Mar 2018).

[41]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. In International Conference on Learning Representations. Open Review, Vancouver, BC, Canada, bibinfonumpages12 pages. https://openreview.net/forum?id=r1gs9JgRZ

[42]

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. 2022. FP8 Formats for Deep Learning. arxiv: 2209.05433 [cs.LG]

[43]

Cristóbal A. Navarro, Roberto Carrasco, Ricardo J. Barrientos, Javier A. Riquelme, and Raimundo Vega. 2021. GPU Tensor Cores for Fast Arithmetic Reductions. IEEE Transactions on Parallel and Distributed Systems, Vol. 32, 1 (2021), 72--84. https://doi.org/10.1109/TPDS.2020.3011893

[44]

NVIDIA Corporation. 2016. NVIDIA Tesla P100: The Most Advanced Datacenter Accelerator Ever Built Featuring Pascal GP100, the World's Fastest GPU. Whitepaper. NVIDIA Corporation, US.

[45]

NVIDIA Corporation. 2017. NVIDIA Tesla V100 GPU Architecture: The World's Most Advanced Data Center GPU. Whitepaper. NVIDIA Corporation, US.

[46]

NVIDIA Corporation. 2018. NVIDIA Turing GPU Architecture: Graphics Reinvented. Whitepaper. NVIDIA Corporation, US.

[47]

NVIDIA Corporation. 2019. NVIDIA CUDA Toolkit Profiler User's Guide. https://docs.nvidia.com/cuda/profiler-users-guide/#nvprof.

[48]

NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture: Unprecedented Acceleration at Every Scale. Whitepaper. NVIDIA Corporation, US.

[49]

NVIDIA Corporation. 2022. NVIDIA H100 Tensor Core GPU Architecture: Exceptional Performance, Scalability, and Security for The Data Center. Whitepaper. NVIDIA Corporation, US.

[50]

NVIDIA Corporation. 2023 a. cublasAxpyEx(). https://docs.nvidia.com/cuda/cublas/#cublasaxpyex.

[51]

NVIDIA Corporation. 2023 b. GNMT v2 For PyTorch. https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/GNMT

[52]

NVIDIA Corporation. 2023 c. Kernel Profiling Guide. https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#kernel-replay.

[53]

NVIDIA Corporation. 2023 d. Nsight Compute CLI. https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html.

[54]

NVIDIA Corporation. 2023 e. NVIDIA Ada GPU Architecture: Designed to deliver outstanding gaming and creating, professional graphics, AI, and compute performance. Whitepaper. NVIDIA Corporation, US.

[55]

NVIDIA Corporation. 2023 f. NVIDIA L40S Unparalleled AI and graphics performance for the data center. Datasheet. NVIDIA Corporation, US.

[56]

NVIDIA Corporation. 2023 g. Parallel Thread Execution ISA Version 8.2. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.

[57]

NVIDIA Corporation. 2023 h. ResNet-50 v1.5 for MXNet. https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5

[58]

NVIDIA Corporation. 2023 i. UNet Industrial Defect Segmentation for TensorFlow. https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Segmentation/UNet_Industrial

[59]

Hiroyuki Ootomo and Rio Yokota. 2022. Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance. The International Journal of High Performance Computing Applications, Vol. 36, 4 (2022), 475--491. https://doi.org/10.1177/10943420221090256

Digital Library

[60]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arxiv: 1912.01703 [cs.LG]

[61]

Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, San Diego, CA, USA, 58--70. https://doi.org/ggtwps

[62]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arxiv: 1506.01497 [cs.CV]

[63]

Benedek Rozemberczki, Carl Allen, and Rik Sarkar. 2021. Multi-Scale Attributed Node Embedding. Journal of Complex Networks, Vol. 9, 2 (2021), bibinfonumpages22 pages.

[64]

Valerie Sarge. 2020. Tensor Core Performance: The Ultimate Guide. NVIDIA GPU Technology Conference (GTC), Vol. s21929 (May 2020).

[65]

Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective Classification in Network Data. AI Magazine, Vol. 29, 3 (Sep. 2008), 93. https://doi.org/10.1609/aimag.v29i3.2157

Digital Library

[66]

Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O'Connor, and Stephen W. Keckler. 2015. Flexible Software Profiling of GPU Architectures. SIGARCH Comput. Archit. News, Vol. 43, 3S (jun 2015), 185--197. https://doi.org/10.1145/2872887.2750375

Digital Library

[67]

Wei Sun, Ang Li, Tong Geng, Sander Stuijk, and Henk Corporaal. 2023. Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors. IEEE Transactions on Parallel and Distributed Systems, Vol. 34, 1 (2023), 246--261. https://doi.org/10.1109/TPDS.2022.3217824

[68]

Yufei Sun, Long Zheng, Qinggang Wang, Xiangyu Ye, Yu Huang, Pengcheng Yao, Xiaofei Liao, and Hai Jin. 2022. Accelerating Sparse Deep Neural Network Inference Using GPU Tensor Cores. In 2022 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, Waltham, MA, USA, 1--7. https://doi.org/10.1109/HPEC55821.2022.9926300

[69]

Mingxing Tan and Quoc V. Le. 2020. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arxiv: 1905.11946 [cs.LG]

[70]

Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. 2023. CUTLASS.

[71]

The Linux Foundation. 2023. Reproducibility. https://pytorch.org/docs/stable/notes/randomness.html.

[72]

Philippe Tillet and David Cox. 2017. Input-Aware Auto-Tuning of Compute-Bound HPC Kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '17). ACM, New York, NY, USA, Article 43, bibinfonumpages12 pages. https://doi.org/mdr3

Digital Library

[73]

Gaurav Verma, Yashi Gupta, Abid M. Malik, and Barbara Chapman. 2021. Performance Evaluation of Deep Learning Compilers for Edge Inference. In 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, Portland, OR, USA, 858--865. https://doi.org/mdr2

[74]

Shibo Wang and Pankaj Kanwar. 2019. BFloat16: The secret to high performance on Cloud TPUs. Google Cloud Blog, Vol. 4 (2019), bibinfonumpages1 pages.

[75]

Yuke Wang, Boyuan Feng, and Yufei Ding. 2022. QGTC: Accelerating Quantized Graph Neural Networks via GPU Tensor Core. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Seoul, Republic of Korea) (PPoPP '22). ACM, New York, NY, USA, 107--119. https://doi.org/10.1145/3503221.3508408

Digital Library

[76]

Yuxin Wang, Qiang Wang, Shaohuai Shi, Xin He, Zhenheng Tang, Kaiyong Zhao, and Xiaowen Chu. 2020. Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, Melbourne, Australia, 744--751. https://doi.org/10.1109/CCGrid49817.2020.00--15

[77]

Zhihao Wen, Yuan Fang, and Zemin Liu. 2021. Meta-Inductive Node Classification across Graphs. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, Canada) (SIGIR '21). ACM, New York, NY, USA, 1219--1228. https://doi.org/mdr4

Digital Library

[78]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM, Vol. 52, 4 (apr 2009), 65--76. https://doi.org/10.1145/1498765.1498785

Digital Library

[79]

Pavan Yalamanchili, Umar Arshad, Zakiuddin Mohammed, Pradeep Garigipati, Peter Entschev, Brian Kloppenborg, James Malcolm, and John Melonakos. 2015. ArrayFire - A high performance software library for parallel computing with an easy-to-use API. https://github.com/arrayfire/arrayfire

[80]

Charlene Yang. 2015. Berkeley CS Roofline Toolkit. https://bitbucket.org/berkeleylab/cs-roofline-toolkit.

[81]

Zhiwei Yang, Lu Lu, and Ruimin Wang. 2022. A batched GEMM optimization framework for deep learning. The Journal of Supercomputing, Vol. 78, 11 (March 2022), 13393--13408. https://doi.org/10.1007/s11227-022-04336--3

Digital Library

[82]

Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. 2020. GraphSAINT: Graph Sampling Based Inductive Learning Method. arxiv: 1907.04931 [cs.LG]

[83]

Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. 2018. Graph Convolutional Networks: Algorithms, Applications and Open Challenges. In Computational Data and Social Networks, Xuemin Chen, Arunabha Sen, Wei Wayne Li, and My T. Thai (Eds.). Springer International Publishing, Cham, 79--91.

[84]

Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. 2019. Graph convolutional networks: a comprehensive review. Computational Social Networks, Vol. 6, 1 (Nov. 2019), bibinfonumpages23 pages. https://doi.org/10.1186/s40649-019-0069-y

[85]

Zhihui Zhang, Jingwen Leng, Lingxiao Ma, Youshan Miao, Chao Li, and Minyi Guo. 2020. Architectural Implications of Graph Neural Networks. IEEE Computer Architecture Letters, Vol. 19, 1 (2020), 59--62. https://doi.org/10.1109/LCA.2020.2988991

[86]

Yangjie Zhou, Mengtian Yang, Cong Guo, Jingwen Leng, Yun Liang, Quan Chen, Minyi Guo, and Yuhao Zhu. 2021. Characterizing and Demystifying the Implicit Convolution Algorithm on Commercial Matrix-Multiplication Accelerators. In 2021 IEEE International Symposium on Workload Characterization (IISWC). IEEE, Storrs, CT, USA, 214--225. https://doi.org/10.1109/IISWC53511.2021.00029

[87]

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Deming Chen, Jason Cong, and Peipei Zhou. 2023. CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, CA, USA) (FPGA '23). ACM, New York, NY, USA, 153--164. https://doi.org/10.1145/3543622.3573210 io

Digital Library

Cited By

Zhao HDeng JCui WChen QZhang YZeng DGuo M(2025)Adaptive Kernel Fusion for Improving the GPU Utilization While Ensuring QoSIEEE Transactions on Computers10.1109/TC.2024.347799574:2(386-400)Online publication date: Feb-2025
https://doi.org/10.1109/TC.2024.3477995
Han YKim IKim JMoon G(2024)Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural NetworksElectronics10.3390/electronics1320398113:20(3981)Online publication date: 10-Oct-2024
https://doi.org/10.3390/electronics13203981
Hanindhito BPatel BJohn L(2024)Bandwidth Characterization of DeepSpeed on Distributed Large Language Model Training2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00031(241-256)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00031
Show More Cited By

Index Terms

Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly
1. Computing methodologies
  1. Machine learning
2. General and reference
  1. Cross-computing tools and techniques

Recommendations

GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation
Graphics processing units (GPU), due to their massive computational power with up to thousands of concurrent threads and general-purpose GPU (GPGPU) programming models such as CUDA and OpenCL, have opened up new opportunities for speeding up general-...
Accelerating PQMRCGSTAB algorithm on GPU
UCHPC-MAW '09: Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop

The general computations on GPU are becoming more and more popular because of GPU's powerful computing ability. In this paper, how to use GPU to accelerate sparse linear system solver, preconditioned QMRCGSTAB (PQMRCGSTAB for short), is our concern. We ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICPE '24: Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering

May 2024

310 pages

ISBN:9798400704444

DOI:10.1145/3629526

General Chairs:
Simonetta Balsamo
Ca'Foscari University of Venice, Italy
,
William Knottenbelt
Imperial College London, UK
,
Program Chairs:
Cristina L. Abad
Escuela Superior Politecnica del Litoral, Ecuador
,
Weiyi Shang
University of Waterloo, Canada

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 May 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NVIDIA Applied Research Accelerator Program
Semiconductor Research Corporation
National Science Foundation
Semiconductor Research Corporation

Conference

ICPE '24

Sponsor:

ICPE '24: 15th ACM/SPEC International Conference on Performance Engineering

May 7 - 11, 2024

London, United Kingdom

Acceptance Rates

Overall Acceptance Rate 252 of 851 submissions, 30%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
461
Total Downloads

Downloads (Last 12 months)461
Downloads (Last 6 weeks)30

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao HDeng JCui WChen QZhang YZeng DGuo M(2025)Adaptive Kernel Fusion for Improving the GPU Utilization While Ensuring QoSIEEE Transactions on Computers10.1109/TC.2024.347799574:2(386-400)Online publication date: Feb-2025
https://doi.org/10.1109/TC.2024.3477995
Han YKim IKim JMoon G(2024)Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural NetworksElectronics10.3390/electronics1320398113:20(3981)Online publication date: 10-Oct-2024
https://doi.org/10.3390/electronics13203981
Hanindhito BPatel BJohn L(2024)Bandwidth Characterization of DeepSpeed on Distributed Large Language Model Training2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00031(241-256)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00031
Njanko ARawat D(2024)Bilinear Attention Based Learning on Graphs2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825196(6078-6082)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825196
Tabbakh AAl Amin LIslam MMahmud GChowdhury IMukta M(2024)Towards sustainable AI: a comprehensive framework for Green AIDiscover Sustainability10.1007/s43621-024-00641-45:1Online publication date: 15-Nov-2024
https://doi.org/10.1007/s43621-024-00641-4

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents