skip to main content
10.1145/3613424.3614299acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Improving Data Reuse in NPU On-chip Memory with Interleaved Gradient Order for DNN Training

Published: 08 December 2023 Publication History

Abstract

During training tasks for machine learning models with neural processing units (NPUs), the most time-consuming part is the backward pass, which incurs significant overheads due to off-chip memory accesses. For NPUs, to mitigate the long latency and limited bandwidth of such off-chip DRAM accesses, the software-managed on-chip scratchpad memory (SPM) plays a crucial role. As the backward pass computation must be optimized to improve the effectiveness of SPM, this study identifies a new data reuse pattern specific to the backward computation. The backward pass includes independent input and weight gradient computations sharing the same output gradient in each layer. Conventional sequential processing does not exploit the potential inter-operation data reuse opportunity within SPM. With this new opportunity of data reuse in the backward pass, this study proposes a novel data flow transformation scheme called interleaved gradient order, consisting of three techniques to enhance the utilization of NPU scratchpad memory. The first technique shuffles the input and weight gradient computations by interleaving two operations into a single fused operation to reduce redundant output gradient accesses. The second technique adjusts the tile access order for the interleaved gradient computations to maximize the potential data locality. However, since the best order is not fixed for all tensors, we propose a selection algorithm to find the most suitable order based on the tensor dimensions. The final technique further improves data reuse chances by using the best partitioning and mapping scheme for two gradient computations for single-core and multi-core NPUs. The simulation-based evaluation with single-core edge and server NPUs shows that the combined techniques can improve performance by 29.3% and 14.5% for edge and server NPUs respectively. Furthermore, with a quad-core server NPU, the proposed techniques reduce the execution time by 23.7%.

References

[1]
Ankur Agrawal, Sae Kyu Lee, Joel Silberman, Matthew Ziegler, Mingu Kang, Swagath Venkataramani, Nianzheng Cao, Bruce Fleischer, Michael Guillorn, Matthew Cohen, 2021. 9.1 a 7nm 4-core AI chip with 25.6 TFLOPS hybrid FP8 training, 102.4 TOPS INT4 inference and workload-aware throttling. In International Solid-State Circuits Conference (ISSCC).
[2]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In IEEE/ACM International Symposium on Microarchitecture (MICRO).
[3]
ARM. 2019. Powering The Edge: Driving optimal performance with Ethos-N77 processor. Technical Report. ARM.
[4]
Eunjin Baek, Dongup Kwon, and Jangwoo Kim. 2020. A multi-neural network acceleration architecture. In ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).
[5]
Simon Boehm. 2022. How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog. https://siboehm.com/articles/22/CUDA-MMM
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. In Conference on Neural Information Processing Systems (NIPS).
[7]
Prasanth Chatarasi, Hyoukjun Kwon, Angshuman Parashar, Michael Pellauer, Tushar Krishna, and Vivek Sarkar. 2021. Marvel: A data-centric approach for mapping deep learning operators on spatial accelerators. In Journal of ACM Transactions on Architecture and Code Optimization (TACO).
[8]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM SIGARCH Computer Architecture News.
[9]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[10]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. In ArXiv Preprint ArXiv:1604.06174.
[11]
Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In IEEE journal of solid-state circuits (JSSC).
[12]
Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. 2022. Serving heterogeneous machine learning models on Multi-GPU servers with Spatio-Temporal sharing. In USENIX Annual Technical Conference (ATC).
[13]
Seungkyu Choi, Jaehyeong Sim, Myeonggu Kang, Yeongjae Choi, Hyeonuk Kim, and Lee-Sup Kim. 2020. An energy-efficient deep convolutional neural network training accelerator for in Situ personalization on smart devices. In Journal of Solid-State Circuits (JSSC).
[14]
Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning. In arXiv preprint arXiv:2307.08691.
[15]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NIPS).
[16]
Mario Drumond, Louis Coulon, Arash Pourhabibi, Ahmet Caner Yüzügüler, Babak Falsafi, and Martin Jaggi. 2021. Equinox: Training (for Free) on a custom inference accelerator. In International Symposium on Microarchitecture (MICRO).
[17]
Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA).
[18]
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. Tetris: Scalable and efficient neural network acceleration with 3D memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[19]
Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, 2021. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In Design Automation Conference (DAC).
[20]
Google. 2023. Tensorflow XLA. https://www.tensorflow.org/xla
[21]
Google. 2023. Understanding data sharding (data parallelism) : Google Cloud TPU. https://cloud.google.com/tpu/docs/troubleshooting/trouble-tf#data-parallelism
[22]
Google Cloud. 2023. Cloud TPU programming model. https://cloud.google.com/tpu/docs/tpus#programming_model
[23]
Google Cloud. 2023. System Architecture: TPU Chip. https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu_chip
[24]
Google Cloud. 2023. System Architecture: TPU v4. https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu_v4
[25]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. In ArXiv Preprint ArXiv:1706.02677.
[26]
Andreas Griewank and Andrea Walther. 2000. Algorithm 799: Revolve: An implementation of checkpointing for the reverse or adjoint mode of computational differentiation. In Transactions on Mathematical Software (TOMS).
[27]
Stefan Hadjis, Firas Abuzaid, Ce Zhang, and Christopher Ré. 2015. Caffe con troll: Shallow ideas to speed up deep learning. In Proceedings of the Fourth Workshop on Data analytics in the Cloud (DanaC).
[28]
Myeonggyun Han, Jihoon Hyun, Seongbeom Park, Jinsu Park, and Woongki Baek. 2019. Mosaic: Heterogeneity-, communication-, and constraint-aware model slicing and execution for accurate and efficient inference. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[29]
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, 2018. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. In ArXiv Preprint ArXiv: 1807.11205.
[30]
Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. 2020. A domain-specific supercomputer for training deep neural networks. In Communications of the ACM (CACM).
[31]
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture (ISCA).
[32]
Sanghoon Kang, Donghyeon Han, Juhyoung Lee, Dongseok Im, Sangyeob Kim, Soyeon Kim, and Hoi-Jun Yoo. 2020. 7.4 GANPU: A 135TFLOPS/W multi-DNN training processor for GANs with speculative dual-sparsity exploitation. In International Solid-State Circuits Conference (ISSCC).
[33]
Sheng-Chun Kao and Tushar Krishna. 2020. GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm. In Proceedings of the 39th International Conference on Computer-Aided Design (ICCAD).
[34]
Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Amir Yazdanbakhsh, and Tushar Krishna. 2023. FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[35]
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2017. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations (ICLR).
[36]
Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, and Angshuman Parashar. 2020. Maestro: A data-centric approach to understand reuse, performance, and hardware cost of DNN mappings. In IEEE micro.
[37]
Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. 2021. Heterogeneous dataflow accelerators for multi-DNN workloads. In IEEE International Symposium on High-Performance Computer Architecture (HPCA).
[38]
Jinsu Lee, Juhyoung Lee, Donghyeon Han, Jinmook Lee, Gwangtae Park, and Hoi-Jun Yoo. 2019. 7.7 LNPU: A 25.3 TFLOPS/W sparse deep-neural-network learning processor with fine-grained mixed precision of FP8-FP16. In International Solid-State Circuits Conference (ISSCC).
[39]
Sunho Lee, Jungwoo Kim, Seonjin Na, Jongse Park, and Jaehyuk Huh. 2022. TNPU: Supporting trusted execution with tree-less integrity protection for neural processing unit. In IEEE International Symposium on High-Performance Computer Architecture (HPCA).
[40]
Xiaqing Li, Guangyan Zhang, H Howie Huang, Zhufan Wang, and Weimin Zheng. 2016. Performance analysis of GPU-based convolutional neural networks. In International Conference on Parallel Processing (ICPP).
[41]
Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang Liang, Qiang Yang, Dusit Niyato, and Chunyan Miao. 2020. Federated learning in mobile edge networks: A comprehensive survey. In IEEE Communications Surveys & Tutorials.
[42]
Dominic Masters and Carlo Luschi. 2018. Revisiting small batch training for deep neural networks. In ArXiv Preprint ArXiv:1804.07612.
[43]
Gordon Euhyun Moon, Hyoukjun Kwon, Geonhwa Jeong, Prasanth Chatarasi, Sivasankaran Rajamanickam, and Tushar Krishna. 2021. Evaluating spatial accelerator architectures with tiled matrix-matrix multiplication. In IEEE Transactions on Parallel and Distributed Systems (TPDS).
[44]
Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNNFusion: Accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI).
[45]
Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman Jouppi, and David Patterson. 2021. The design process for Google’s training chips: TPUv2 and TPUv3. In IEEE Micro.
[46]
NVIDIA. 2023. FasterTransformer. https://github.com/NVIDIA/FasterTransformer
[47]
Hyungjun Oh, Junyeol Lee, Hyeongju Kim, and Jiwon Seo. 2022. Out-of-order backprop: An effective scheduling technique for deep learning. In European Conference on Computer Systems (EuroSys).
[48]
Jinwook Oh, Sae Kyu Lee, Mingu Kang, Matthew Ziegler, Joel Silberman, Ankur Agrawal, Swagath Venkataramani, Bruce Fleischer, Michael Guillorn, Jungwook Choi, 2020. A 3.0 TFLOPS 0.62 V scalable processor core for high compute utilization AI training and inference. In Symposium on VLSI Circuits (VLSI-circuits).
[49]
Young H Oh, Seonghak Kim, Yunho Jin, Sam Son, Jonghyun Bae, Jongsung Lee, Yeonhong Park, Dong Uk Kim, Tae Jun Ham, and Jae W Lee. 2021. Layerweaver: Maximizing resource utilization of neural processing units via layer-wise scheduling. In International Symposium on High Performance Computer Architecture (HPCA).
[50]
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In IEEE international symposium on performance analysis of systems and software (ISPASS).
[51]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Conference on Neural Information Processing Systems (NIPS).
[52]
Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training. In International Symposium on High Performance Computer Architecture (HPCA).
[53]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Acm Sigplan Notices.
[54]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized evolution for image classifier architecture search. In Conference on Artificial Intelligence (AAAI).
[55]
Ananda Samajdar, Jan Moritz Joseph, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2020. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim. In International Symposium on Performance Analysis of Systems and Software (ISPASS).
[56]
Wonik Seo, Sanghoon Cha, Yeonjae Kim, Jaehyuk Huh, and Jongse Park. 2021. SLO-aware inference scheduler for heterogeneous processors in edge platforms. (2021).
[57]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism. In arXiv preprint arXiv: 1909.08053.
[58]
Mingcong Song, Kan Zhong, Jiaqi Zhang, Yang Hu, Duo Liu, Weigong Zhang, Jing Wang, and Tao Li. 2018. In-Situ AI: Towards autonomous and incremental deep learning for IoT systems. In International Symposium on High Performance Computer Architecture (HPCA).
[59]
Danilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard, James J Clark, Brett H Meyer, and Warren J Gross. 2022. Efficient fine-tuning of BERT models on the edge. In International Symposium on Circuits and Systems (ISCAS).
[60]
Yang Wang, Dazheng Deng, Leibo Liu, Shaojun Wei, and Shouyi Yin. 2022. PL-NPU: An energy-efficient edge-device DNN training processor with posit-based logarithm-domain computing. In IEEE Transactions on Circuits and Systems (TCAS).
[61]
Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, 2020. Interstellar: Using halide’s scheduling language to analyze DNN accelerators. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[62]
Shouyi Yin, Shibin Tang, Xinhan Lin, Peng Ouyang, Fengbin Tu, Jishen Zhao, Cong Xu, Shuangcheng Li, Yuan Xie, ShaoJun Wei, 2018. Parana: A parallel neural architecture considering thermal problem of 3D stacked memory. In IEEE Transactions on Parallel and Distributed Systems (TPDS).
[63]
Zhilu Zhang and Mert Sabuncu. 2018. Generalized cross entropy loss for training deep neural networks with noisy labels. In Conference on Neural Information Processing Systems (NIPS).

Index Terms

  1. Improving Data Reuse in NPU On-chip Memory with Interleaved Gradient Order for DNN Training

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture
      October 2023
      1528 pages
      ISBN:9798400703294
      DOI:10.1145/3613424
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 December 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. DNN training
      2. accelerators
      3. on-chip memory
      4. scheduling

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      MICRO '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 484 of 2,242 submissions, 22%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 652
        Total Downloads
      • Downloads (Last 12 months)561
      • Downloads (Last 6 weeks)51
      Reflects downloads up to 13 Jan 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media