research-article

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

Authors:

Jingtong HuAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 27, Issue 5

Article No.: 49, Pages 1 - 36

https://doi.org/10.1145/3505633

Published: 06 June 2022 Publication History

Abstract

Conventionally, DNN models are trained once in the cloud and deployed in edge devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time inference. However, there are many cases that require the models to adapt to new environments, domains, or users. In order to realize such domain adaption or personalization, the models on devices need to be continuously trained on the device. In this work, we design EF-Train, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs. It is challenging to implement on-device training on resource-limited FPGAs due to the low efficiency caused by different memory access patterns among forward and backward propagation and weight update. Therefore, we developed a data reshaping approach with intra-tile continuous memory allocation and weight reuse. An analytical model is established to automatically schedule computation and memory resources to achieve high energy efficiency on edge FPGAs. The experimental results show that our design achieves 46.99 GFLOPS and 6.09 GFLOPS/W in terms of throughput and energy efficiency, respectively.

References

[1]

Scott R. Granter, Andrew H. Beck, and David J. Papke Jr. 2017. AlphaGo, deep learning, and the future of the human microscopist. Archives of Pathology & Laboratory Medicine 141 (2017), 619–621.

[2]

Tengchan Zeng, Omid Semiari, Mohammad Mozaffari, Mingzhe Chen, Walid Saad, and Mehdi Bennis. 2020. Federated learning in the sky: Joint power allocation and scheduling with UAV swarms. In 2020 IEEE International Conference on Communications (ICC’20). IEEE, 1–6.

[3]

Cong Hao, Xiaofan Zhang, Yuhong Li, Sitao Huang, Jinjun Xiong, Kyle Rupnow, Wen-mei Hwu, and Deming Chen. 2019. FPGA/DNN co-design: An efficient design methodology for 1ot intelligence on the edge. In 2019 56th ACM/IEEE Design Automation Conference (DAC’19). IEEE, 1–6.

Digital Library

[4]

Xilinx. ([n. d.]). Corazon AI. http://www.xilinx.com/products/boards-and-kits/1-1bua5s3.html.

[5]

Xilinx. ([n.d.]). Pony.ai Sensor Fusion using multiple Xilinx devices. https://www.xilinx.com/applications/automotive/automated-driving.html.

[6]

Xilinx. ([n.d.]). ZF ProAI Gen 3 using Xilinx Zynq UltraScale+ MPSoC. https://www.xilinx.com/applications/automotive/automated-driving.html.

[7]

Ahmed Sanaullah, Chen Yang, Yuri Alexeev, Kazutomo Yoshii, and Martin C. Herbordt. 2018. Real-time data analysis for medical diagnosis using FPGA-accelerated neural networks. BMC Bioinformatics 19, 18 (2018), 19–31.

[8]

Alwyn Burger, Chao Qian, Gregor Schiele, and Domenik Helms. 2020. An embedded CNN implementation for on-device ECG analysis. In 2020 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops’20). IEEE, 1–6.

[9]

Corey Lammie, Alex Olsen, Tony Carrick, and Mostafa Rahimi Azghadi. 2019. Low-power and high-speed deep FPGA inference engines for weed classification at the edge. IEEE Access 7 (2019), 51171–51184.

[10]

Xiaofan Zhang, Cong Hao, Haoming Lu, Jiachen Li, Yuhong Li, Yuchen Fan, Kyle Rupnow, Jinjun Xiong, Thomas Huang, Honghui Shi, et al. 2019. Skynet: A champion model for DAC-SDC on low power object detection. arXiv preprint arXiv:1906.10327 (2019).

[11]

Jianfei Yang, Han Zou, Shuxin Cao, Zhenghua Chen, and Lihua Xie. 2020. Mobileda: Toward edge-domain adaptation. IEEE Internet of Things Journal 7, 8 (2020), 6909–6918.

[12]

Md Abdullah Al Hafiz Khan, Nirmalya Roy, and Archan Misra. 2018. Scaling human activity recognition via deep learning-based domain adaptation. In 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom’18). IEEE, 1–9.

[13]

Fábio Mendonça, Sheikh Shanawaz Mostafa, Fernando Morgado-Dias, and Antonio G. Ravelo-García. 2021. A method based on cardiopulmonary coupling analysis for sleep quality assessment with FPGA implementation. Artificial Intelligence in Medicine 112 (2021), 102019.

[14]

Amrita Rana and Kyung Ki Kim. 2020. Comparison of artificial neural networks for low-power ECG-classification system. Journal of Sensor Science and Technology 29, 1 (2020), 19–26.

[15]

Mengwei Xu, Feng Qian, Qiaozhu Mei, Kang Huang, and Xuanzhe Liu. 2018. Deeptype: On-device deep learning for input personalization service with minimal privacy concern. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2 (2018), 1–26.

Digital Library

[16]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 161–170.

Digital Library

[17]

Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018. DNNbuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). IEEE, 1–8.

Digital Library

[18]

Xuechao Wei, Yun Liang, and Jason Cong. 2019. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. In 2019 56th ACM/IEEE Design Automation Conference (DAC’19). IEEE, 1–6.

Digital Library

[19]

Seungkyu Choi, Jaehyeong Sim, Myeonggu Kang, and Lee-Sup Kim. 2018. TrainWare: A memory optimized weight update architecture for on-device convolutional neural network training. In Proceedings of the International Symposium on Low Power Electronics and Design. 1–6.

Digital Library

[20]

Yudong Tao, Rui Ma, Mei-Ling Shyu, and Shu-Ching Chen. 2020. Challenges in energy-efficient deep neural network training with FPGA. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 400–401.

[21]

Wenlai Zhao, Haohuan Fu, Wayne Luk, Teng Yu, Shaojun Wang, Bo Feng, Yuchun Ma, and Guangwen Yang. 2016. F-CNN: An FPGA-based framework for training convolutional neural networks. In 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP’16). IEEE, 107–114.

[22]

Shreyas Kolala Venkataramanaiah, Yufei Ma, Shihui Yin, Eriko Nurvithadhi, Aravind Dasu, Yu Cao, and Jae-sun Seo. 2019. Automatic compiler based FPGA accelerator for CNN training. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 166–172.

[23]

Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, and Ce Guo. 2020. Towards efficient deep neural network training by FPGA-based batch-level parallelism. Journal of Semiconductors 41, 2 (2020), 022403.

[24]

Weiwen Jiang, Edwin H.-M. Sha, Xinyi Zhang, Lei Yang, Qingfeng Zhuge, Yiyu Shi, and Jingtong Hu. 2019. Achieving super-linear speedup across multi-FPGA for real-time DNN inference. ACM Transactions on Embedded Computing Systems (TECS) 18, 5s (2019), 1–231.

Digital Library

[25]

Sheng-Chun Kao, Geonhwa Jeong, and Tushar Krishna. 2020. Confuciux: Autonomous hardware resource assignment for DNN accelerators using reinforcement learning. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). IEEE, 622–636.

[26]

Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In 2017 IEEE 25th Annual International Symposium on Field-programmable Custom Computing Machines (FCCM’17). IEEE, 152–159.

[27]

Rachmad Vidya Wicaksana Putra, Muhammad Abdullah Hanif, and Muhammad Shafique. 2021. ROMANet: Fine-grained reuse-driven off-chip memory access management and data organization for deep neural network accelerators. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 29, 4 (2021), 702–715.

[28]

Rachmad Vidya Wicaksana Putra, Muhammad Abdullah Hanif, and Muhammad Shafique. 2020. DRMap: A generic DRAM data mapping policy for energy-efficient processing of convolutional neural networks. In 2020 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 1–6.

[29]

Duseok Kang, Donghyun Kang, and Soonhoi Ha. 2021. Multi-bank on-chip memory management techniques for CNN accelerators. IEEE Transactions on Computers (2021), 1–1. DOI:

[30]

Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2018. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 11 (2018), 2072–2085.

Digital Library

[31]

Atefeh Sohrabizadeh, Jie Wang, and Jason Cong. 2020. End-to-end optimization of deep learning applications. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 133–139.

Digital Library

[32]

Tianqi Wang, Tong Geng, Ang Li, Xi Jin, and Martin Herbordt. 2020. FPDeep: Scalable acceleration of CNN training on deeply-pipelined FPGA clusters. IEEE Transactions on Computers 69, 8 (2020), 1143–1158.

Digital Library

[33]

Hiroki Nakahara, Youki Sada, Masayuki Shimoda, Kouki Sayama, Akira Jinguji, and Shimpei Sato. 2019. FPGA-based training accelerator utilizing sparseness of convolutional neural network. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 180–186.

[34]

Sean Fox, Julian Faraone, David Boland, Kees Vissers, and Philip H. W. Leong. 2019. Training deep neural networks in low-precision with high accuracy using FPGAs. In 2019 International Conference on Field-programmable Technology (ICFPT’19). IEEE, 1–9.

[35]

Jinming Lu, Jun Lin, and Zhongfeng Wang. 2020. A reconfigurable DNN training accelerator on FPGA. In 2020 IEEE Workshop on Signal Processing Systems (SiPS’20). IEEE, 1–6.

[36]

Zhiqiang Liu, Yong Dou, Jingfei Jiang, Qiang Wang, and Paul Chow. 2017. An FPGA-based processor for training convolutional neural networks. In 2017 International Conference on Field Programmable Technology (ICFPT’17). IEEE, 207–210.

[37]

Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2017. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems 37, 1 (2017), 35–47.

[38]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. CUDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).

[39]

OpenVINO. ([n. d.]). Optimization Guide. https://docs.openvino.ai/2020.2/_docs_optimization_guide_dldt_optimization_guide.html.

[40]

Shreyas K. Venkataramanaiah, Han-Sok Suh, Shihui Yin, Eriko Nurvitadhi, Aravind Dasu, Yu Cao, and Jae-sun Seo. 2020. FPGA-based low-batch training accelerator for modern CNNs featuring high bandwidth memory. In Proceedings of the 39th International Conference on Computer-aided Design. 1–8.

Digital Library

[41]

Ke He, Bo Liu, Yu Zhang, Andrew Ling, and Dian Gu. 2020. FeCaffe: FPGA-enabled Caffe with OpenCL for deep learning training and inference on Intel Stratix 10. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 314–314.

Digital Library

[42]

Yuhong Li, Cong Hao, Xiaofan Zhang, Xinheng Liu, Yao Chen, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2020. AEdd: Efficient differentiable DNN architecture and implementation co-search for embedded ai solutions. In 2020 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 1–6.

Cited By

Guo CLou BBoland DLeong P(2025)Highly Parallel CNN Accelerator for RepVGG-Like Network Training on FPGAsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344526444:2(554-558)Online publication date: Feb-2025
https://doi.org/10.1109/TCAD.2024.3445264
Li YZhao KZhao JWang QZhong SLalam NWright RZhou PChen K(2024)FiberFlex: Real-time FPGA-based Intelligent and Distributed Fiber Sensor System for Pedestrian RecognitionACM Transactions on Reconfigurable Technology and Systems10.1145/369038917:4(1-30)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3690389
Zhuang JYang ZJi SHuang HJones AHu JShi YZhou PZhang ZPutnam A(2024)SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer AccelerationProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637569(55-66)Online publication date: 2-Apr-2024
https://doi.org/10.1145/3626202.3637569
Show More Cited By

Index Terms

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization
1. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Compressed CNN Training with FPGA-based Accelerator
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Training convolutional neural network (CNN) usually requires large amount of computation resource, time and power. Researchers and cloud service providers in this region needs fast and efficient training system. GPU is currently the best candidate for ...
Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster
ISLPED '16: Proceedings of the 2016 International Symposium on Low Power Electronics and Design

Recently, FPGA-based CNN accelerators have demonstrated superior energy efficiency compared to high-performance devices like GPGPUs. However, due to the constrained on-chip resource and many other factors, single-board FPGA designs may have difficulties ...
An Energy Efficient 3D-Heterogeneous Main Memory Architecture for Mobile Devices
MEMSYS '20: Proceedings of the International Symposium on Memory Systems

The demand for main memory capacity is ever increasing in mobile devices and embedded systems. Dynamic Random Access Memories (DRAMs) can not keep pace with the required main memory capacities because of the restrictions in improving the cell density ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 27, Issue 5

September 2022

274 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/3540253

Editor:
X. Sharon Hu
University of Notre Dame, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 06 June 2022

Online AM: 24 February 2022

Accepted: 01 December 2021

Revised: 01 December 2021

Received: 01 June 2021

Published in TODAES Volume 27, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
1,122
Total Downloads

Downloads (Last 12 months)397
Downloads (Last 6 weeks)29

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Guo CLou BBoland DLeong P(2025)Highly Parallel CNN Accelerator for RepVGG-Like Network Training on FPGAsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344526444:2(554-558)Online publication date: Feb-2025
https://doi.org/10.1109/TCAD.2024.3445264
Li YZhao KZhao JWang QZhong SLalam NWright RZhou PChen K(2024)FiberFlex: Real-time FPGA-based Intelligent and Distributed Fiber Sensor System for Pedestrian RecognitionACM Transactions on Reconfigurable Technology and Systems10.1145/369038917:4(1-30)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3690389
Zhuang JYang ZJi SHuang HJones AHu JShi YZhou PZhang ZPutnam A(2024)SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer AccelerationProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637569(55-66)Online publication date: 2-Apr-2024
https://doi.org/10.1145/3626202.3637569
Ney JWehn N(2024)Achieving High Throughput with a Trainable Neural-Network-Based Equalizer for Communications on FPGA2024 27th Euromicro Conference on Digital System Design (DSD)10.1109/DSD64264.2024.00023(106-113)Online publication date: 28-Aug-2024
https://doi.org/10.1109/DSD64264.2024.00023
Fang CSun WZhou AWang Z(2023)CEST: Computation-Efficient N:M Sparse Training for Deep Neural Networks2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137121(1-2)Online publication date: Apr-2023
https://doi.org/10.23919/DATE56975.2023.10137121
Chen JQiu DXie YHuo LChen X(2023)FPGA architecture for convolutional neural network trainingThird International Conference on Advanced Algorithms and Signal Image Processing (AASIP 2023)10.1117/12.3006102(159)Online publication date: 10-Oct-2023
https://doi.org/10.1117/12.3006102
Aggarwal SBinici KMitra T(2023)Chameleon: Dual Memory Replay for Online Continual Learning on Edge DevicesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334764043:6(1663-1676)Online publication date: 28-Dec-2023
https://dl.acm.org/doi/10.1109/TCAD.2023.3347640
Fang CSun WZhou AWang Z(2023)Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-DesignIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.331778943:2(506-519)Online publication date: 20-Sep-2023
https://dl.acm.org/doi/10.1109/TCAD.2023.3317789
Ollivier SLi STang YCahoon SCaginalp RChaudhuri CZhou PTang XHu JJones A(2023)Sustainable AI Processing at the EdgeIEEE Micro10.1109/MM.2022.322039943:1(19-28)Online publication date: 1-Jan-2023
https://doi.org/10.1109/MM.2022.3220399
Ni YChen HPoduval PZou ZMercati PImani M(2023)Brain-Inspired Trustworthy Hyperdimensional Computing with Efficient Uncertainty Quantification2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323657(01-09)Online publication date: 28-Oct-2023
https://doi.org/10.1109/ICCAD57390.2023.10323657
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents