skip to main content
research-article

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

Published: 06 June 2022 Publication History

Abstract

Conventionally, DNN models are trained once in the cloud and deployed in edge devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time inference. However, there are many cases that require the models to adapt to new environments, domains, or users. In order to realize such domain adaption or personalization, the models on devices need to be continuously trained on the device. In this work, we design EF-Train, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs. It is challenging to implement on-device training on resource-limited FPGAs due to the low efficiency caused by different memory access patterns among forward and backward propagation and weight update. Therefore, we developed a data reshaping approach with intra-tile continuous memory allocation and weight reuse. An analytical model is established to automatically schedule computation and memory resources to achieve high energy efficiency on edge FPGAs. The experimental results show that our design achieves 46.99 GFLOPS and 6.09 GFLOPS/W in terms of throughput and energy efficiency, respectively.

References

[1]
Scott R. Granter, Andrew H. Beck, and David J. Papke Jr. 2017. AlphaGo, deep learning, and the future of the human microscopist. Archives of Pathology & Laboratory Medicine 141 (2017), 619–621.
[2]
Tengchan Zeng, Omid Semiari, Mohammad Mozaffari, Mingzhe Chen, Walid Saad, and Mehdi Bennis. 2020. Federated learning in the sky: Joint power allocation and scheduling with UAV swarms. In 2020 IEEE International Conference on Communications (ICC’20). IEEE, 1–6.
[3]
Cong Hao, Xiaofan Zhang, Yuhong Li, Sitao Huang, Jinjun Xiong, Kyle Rupnow, Wen-mei Hwu, and Deming Chen. 2019. FPGA/DNN co-design: An efficient design methodology for 1ot intelligence on the edge. In 2019 56th ACM/IEEE Design Automation Conference (DAC’19). IEEE, 1–6.
[5]
Xilinx. ([n.d.]). Pony.ai Sensor Fusion using multiple Xilinx devices. https://www.xilinx.com/applications/automotive/automated-driving.html.
[6]
[7]
Ahmed Sanaullah, Chen Yang, Yuri Alexeev, Kazutomo Yoshii, and Martin C. Herbordt. 2018. Real-time data analysis for medical diagnosis using FPGA-accelerated neural networks. BMC Bioinformatics 19, 18 (2018), 19–31.
[8]
Alwyn Burger, Chao Qian, Gregor Schiele, and Domenik Helms. 2020. An embedded CNN implementation for on-device ECG analysis. In 2020 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops’20). IEEE, 1–6.
[9]
Corey Lammie, Alex Olsen, Tony Carrick, and Mostafa Rahimi Azghadi. 2019. Low-power and high-speed deep FPGA inference engines for weed classification at the edge. IEEE Access 7 (2019), 51171–51184.
[10]
Xiaofan Zhang, Cong Hao, Haoming Lu, Jiachen Li, Yuhong Li, Yuchen Fan, Kyle Rupnow, Jinjun Xiong, Thomas Huang, Honghui Shi, et al. 2019. Skynet: A champion model for DAC-SDC on low power object detection. arXiv preprint arXiv:1906.10327 (2019).
[11]
Jianfei Yang, Han Zou, Shuxin Cao, Zhenghua Chen, and Lihua Xie. 2020. Mobileda: Toward edge-domain adaptation. IEEE Internet of Things Journal 7, 8 (2020), 6909–6918.
[12]
Md Abdullah Al Hafiz Khan, Nirmalya Roy, and Archan Misra. 2018. Scaling human activity recognition via deep learning-based domain adaptation. In 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom’18). IEEE, 1–9.
[13]
Fábio Mendonça, Sheikh Shanawaz Mostafa, Fernando Morgado-Dias, and Antonio G. Ravelo-García. 2021. A method based on cardiopulmonary coupling analysis for sleep quality assessment with FPGA implementation. Artificial Intelligence in Medicine 112 (2021), 102019.
[14]
Amrita Rana and Kyung Ki Kim. 2020. Comparison of artificial neural networks for low-power ECG-classification system. Journal of Sensor Science and Technology 29, 1 (2020), 19–26.
[15]
Mengwei Xu, Feng Qian, Qiaozhu Mei, Kang Huang, and Xuanzhe Liu. 2018. Deeptype: On-device deep learning for input personalization service with minimal privacy concern. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2 (2018), 1–26.
[16]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 161–170.
[17]
Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018. DNNbuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). IEEE, 1–8.
[18]
Xuechao Wei, Yun Liang, and Jason Cong. 2019. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. In 2019 56th ACM/IEEE Design Automation Conference (DAC’19). IEEE, 1–6.
[19]
Seungkyu Choi, Jaehyeong Sim, Myeonggu Kang, and Lee-Sup Kim. 2018. TrainWare: A memory optimized weight update architecture for on-device convolutional neural network training. In Proceedings of the International Symposium on Low Power Electronics and Design. 1–6.
[20]
Yudong Tao, Rui Ma, Mei-Ling Shyu, and Shu-Ching Chen. 2020. Challenges in energy-efficient deep neural network training with FPGA. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 400–401.
[21]
Wenlai Zhao, Haohuan Fu, Wayne Luk, Teng Yu, Shaojun Wang, Bo Feng, Yuchun Ma, and Guangwen Yang. 2016. F-CNN: An FPGA-based framework for training convolutional neural networks. In 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP’16). IEEE, 107–114.
[22]
Shreyas Kolala Venkataramanaiah, Yufei Ma, Shihui Yin, Eriko Nurvithadhi, Aravind Dasu, Yu Cao, and Jae-sun Seo. 2019. Automatic compiler based FPGA accelerator for CNN training. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 166–172.
[23]
Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, and Ce Guo. 2020. Towards efficient deep neural network training by FPGA-based batch-level parallelism. Journal of Semiconductors 41, 2 (2020), 022403.
[24]
Weiwen Jiang, Edwin H.-M. Sha, Xinyi Zhang, Lei Yang, Qingfeng Zhuge, Yiyu Shi, and Jingtong Hu. 2019. Achieving super-linear speedup across multi-FPGA for real-time DNN inference. ACM Transactions on Embedded Computing Systems (TECS) 18, 5s (2019), 1–231.
[25]
Sheng-Chun Kao, Geonhwa Jeong, and Tushar Krishna. 2020. Confuciux: Autonomous hardware resource assignment for DNN accelerators using reinforcement learning. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). IEEE, 622–636.
[26]
Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In 2017 IEEE 25th Annual International Symposium on Field-programmable Custom Computing Machines (FCCM’17). IEEE, 152–159.
[27]
Rachmad Vidya Wicaksana Putra, Muhammad Abdullah Hanif, and Muhammad Shafique. 2021. ROMANet: Fine-grained reuse-driven off-chip memory access management and data organization for deep neural network accelerators. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 29, 4 (2021), 702–715.
[28]
Rachmad Vidya Wicaksana Putra, Muhammad Abdullah Hanif, and Muhammad Shafique. 2020. DRMap: A generic DRAM data mapping policy for energy-efficient processing of convolutional neural networks. In 2020 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 1–6.
[29]
Duseok Kang, Donghyun Kang, and Soonhoi Ha. 2021. Multi-bank on-chip memory management techniques for CNN accelerators. IEEE Transactions on Computers (2021), 1–1. DOI:
[30]
Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2018. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 11 (2018), 2072–2085.
[31]
Atefeh Sohrabizadeh, Jie Wang, and Jason Cong. 2020. End-to-end optimization of deep learning applications. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 133–139.
[32]
Tianqi Wang, Tong Geng, Ang Li, Xi Jin, and Martin Herbordt. 2020. FPDeep: Scalable acceleration of CNN training on deeply-pipelined FPGA clusters. IEEE Transactions on Computers 69, 8 (2020), 1143–1158.
[33]
Hiroki Nakahara, Youki Sada, Masayuki Shimoda, Kouki Sayama, Akira Jinguji, and Shimpei Sato. 2019. FPGA-based training accelerator utilizing sparseness of convolutional neural network. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 180–186.
[34]
Sean Fox, Julian Faraone, David Boland, Kees Vissers, and Philip H. W. Leong. 2019. Training deep neural networks in low-precision with high accuracy using FPGAs. In 2019 International Conference on Field-programmable Technology (ICFPT’19). IEEE, 1–9.
[35]
Jinming Lu, Jun Lin, and Zhongfeng Wang. 2020. A reconfigurable DNN training accelerator on FPGA. In 2020 IEEE Workshop on Signal Processing Systems (SiPS’20). IEEE, 1–6.
[36]
Zhiqiang Liu, Yong Dou, Jingfei Jiang, Qiang Wang, and Paul Chow. 2017. An FPGA-based processor for training convolutional neural networks. In 2017 International Conference on Field Programmable Technology (ICFPT’17). IEEE, 207–210.
[37]
Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2017. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems 37, 1 (2017), 35–47.
[38]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. CUDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[40]
Shreyas K. Venkataramanaiah, Han-Sok Suh, Shihui Yin, Eriko Nurvitadhi, Aravind Dasu, Yu Cao, and Jae-sun Seo. 2020. FPGA-based low-batch training accelerator for modern CNNs featuring high bandwidth memory. In Proceedings of the 39th International Conference on Computer-aided Design. 1–8.
[41]
Ke He, Bo Liu, Yu Zhang, Andrew Ling, and Dian Gu. 2020. FeCaffe: FPGA-enabled Caffe with OpenCL for deep learning training and inference on Intel Stratix 10. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 314–314.
[42]
Yuhong Li, Cong Hao, Xiaofan Zhang, Xinheng Liu, Yao Chen, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2020. AEdd: Efficient differentiable DNN architecture and implementation co-search for embedded ai solutions. In 2020 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 1–6.

Cited By

View all
  • (2025)Highly Parallel CNN Accelerator for RepVGG-Like Network Training on FPGAsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344526444:2(554-558)Online publication date: Feb-2025
  • (2024)FiberFlex: Real-time FPGA-based Intelligent and Distributed Fiber Sensor System for Pedestrian RecognitionACM Transactions on Reconfigurable Technology and Systems10.1145/369038917:4(1-30)Online publication date: 28-Aug-2024
  • (2024)SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer AccelerationProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637569(55-66)Online publication date: 2-Apr-2024
  • Show More Cited By

Index Terms

  1. EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Design Automation of Electronic Systems
    ACM Transactions on Design Automation of Electronic Systems  Volume 27, Issue 5
    September 2022
    274 pages
    ISSN:1084-4309
    EISSN:1557-7309
    DOI:10.1145/3540253
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 06 June 2022
    Online AM: 24 February 2022
    Accepted: 01 December 2021
    Revised: 01 December 2021
    Received: 01 June 2021
    Published in TODAES Volume 27, Issue 5

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. On-device training
    2. edge FPGAs
    3. data reshaping

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)397
    • Downloads (Last 6 weeks)29
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media