research-article

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Authors:

Abinash Mohanty,

Sarma Vrudhula,

Yu CaoAuthors Info & Claims

FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 16 - 25

https://doi.org/10.1145/2847263.2847276

Published: 21 February 2016 Publication History

Abstract

Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute-/memory-intensive, it is difficult to perform real-time classification with low power consumption on today?s computing systems. FPGAs have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodologies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAs are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two representative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classification on P395-D8 board.

References

[1]

Y. LeCun, et al. Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems, 396--404, 1990.

Digital Library

[2]

O. Russakovsky, et al. ImageNet large-scale visual recognition challenge. In Int. J. Computer Vision, 2015.

Digital Library

[3]

A. Karpathy, et al. Large-scale video classification with convolutional neural networks. In CVPR, 1725--1732, 2014.

Digital Library

[4]

H. Li, Z. Lin, X. Shen, J. Brandt and G. Hua. A convolutional neural network cascade for face detection. In CVPR, 5325--5334, 2015.

[5]

P. Barros, S. Magg, C. Weber and S. Wermter. A multichannel convolutional neural network for hand posture recognition. In Int. Conf. on Artificial Neural Networks (ICANN), 403--410, 2014.

[6]

O. Abdel-Hamid, et al. Convolutional neural networks for speech recognition. In IEEE Trans. on Audio, Speech and Language Processing, 1533--1545, Oct 2014.

Digital Library

[7]

R. Collobert and J. Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In Int. Conf. on Machine Learning, 160--167, 2008.

Digital Library

[8]

S. Lai, L. Xu, K. Liu and J. Zhao. Recurrent convolutional neural networks for text classification. In AAAI Conf. on Artificial Intelligence, 2267--2273, 2015.

Digital Library

[9]

C. Szegedy, et al. Going deeper with convolutions. In CVPR, 1--9, 2015.

[10]

C. Farabet, et al. Hardware accelerated convolutional neural networks for synthetic vision systems. In ISCAS, 257--260, 2010.

[11]

S. Chakradhar, et al. A dynamically configurable coprocessor for convolutional neural networks. In ISCA, 247--257, 2010.

Digital Library

[12]

M. Peemen, et al. Memory-centric accelerator design for convolutional neural networks. In ICCD, 13--19, 2013.

[13]

C. Zhang, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM Int. Symp. On Field-Programmable Gate Arrays, 161--170, 2015.

Digital Library

[14]

V. Gokhale, et al. A 240 G-ops/s mobile coprocessor for deep neural networks. In CVPR Workshops, 696--701, 2014.

Digital Library

[15]

Y. Chen, et al. DaDianNao: A machine-learning supercomputer. In IEEE/ACM Int. Symp. on Microarchitecture, 602--622, 2014.

Digital Library

[16]

A. Krizhevsky, et al. ImageNet classification with deep convolutional neural networks. In NIPS, 1097--1105, 2012.

[17]

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

[18]

Y.L. Boureau, et al. A Theoretical Analysis of Feature Pooling in Visual Recognition. In Int. Conf. on Machine Learning, 2010.

[19]

M. Denil, et al. Predicting parameters in deep learning. In NIPS, 2148--2156, 2013.

[20]

Y. Jia, et al. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.

[21]

Khronos OpenCL Working Group. The OpenCL Specification, version 1.1.44, 2011.

[22]

M. S. Abdelfattah, et al. Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL. In Int. Workshop on OpenCL 2014.

Digital Library

[23]

K. Chellapilla, S. Puri and P. Simard. High performance convolutional neural networks for document processing. In Int. Workshop on Frontiers in Handwriting Recognition, 2006.

[24]

Altera OpenCL design examples. Available online at https://www.altera.com/support/support-resources/design-examples/design-software/opencl.html

[25]

Nallatech P395-D8 OpenCL FPGA accelerator cards. http://www.nallatech.com/wp-content/uploads/openclcardspb_v1_51.pdf

[26]

DE5-Net FPGA kit user manual. Available online at ftp://ftp.altera.com/up/pub/Altera_Material/Boards/DE5/DE5_User_Manual.pdf

[27]

R.C. Whaley and J.J. Dongarra. Automatically tuned linear algebra software. In Proc. SuperComputing 1998: High Performance Networking and Computing, 2001.

Digital Library

Cited By

Mouri Zadeh Khaki AChoi A(2025)Optimizing Deep Learning Acceleration on FPGA for Real-Time and Resource-Efficient Image ClassificationApplied Sciences10.3390/app1501042215:1(422)Online publication date: 5-Jan-2025
https://doi.org/10.3390/app15010422
Dai KXie ZLiu S(2025)DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343599644:2(540-553)Online publication date: Feb-2025
https://doi.org/10.1109/TCAD.2024.3435996
Xu JZhang FJin WYang KWang ZJiang WHa Y(2025)A Deep Investigation on Stealthy DVFS Fault Injection Attacks at DNN Hardware AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.342636444:1(39-51)Online publication date: Jan-2025
https://doi.org/10.1109/TCAD.2024.3426364
Show More Cited By

Index Terms

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
1. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the ...
Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks

Deep convolutional neural networks (CNNs) have gained great success in various computer vision applications. State-of-the-art CNN models for large-scale applications are computation intensive and memory expensive and, hence, are mainly processed on high-...
Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 2016

298 pages

ISBN:9781450338561

DOI:10.1145/2847263

General Chair:
Deming Chen
University of Illinois at Urbana-Champaign, USA
,
Program Chair:
Jonathan Greene
Microsemi, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

FPGA'16

Sponsor:

SIGDA

FPGA'16: The 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 21 - 23, 2016

California, Monterey, USA

Acceptance Rates

FPGA '16 Paper Acceptance Rate 20 of 111 submissions, 18%;

Overall Acceptance Rate 75 of 430 submissions, 17%

Upcoming Conference

FPGA '25

Sponsor:
sigda

The 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 27 - March 1, 2025

Monterey , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

413
Total Citations
View Citations
6,711
Total Downloads

Downloads (Last 12 months)242
Downloads (Last 6 weeks)12

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mouri Zadeh Khaki AChoi A(2025)Optimizing Deep Learning Acceleration on FPGA for Real-Time and Resource-Efficient Image ClassificationApplied Sciences10.3390/app1501042215:1(422)Online publication date: 5-Jan-2025
https://doi.org/10.3390/app15010422
Dai KXie ZLiu S(2025)DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343599644:2(540-553)Online publication date: Feb-2025
https://doi.org/10.1109/TCAD.2024.3435996
Xu JZhang FJin WYang KWang ZJiang WHa Y(2025)A Deep Investigation on Stealthy DVFS Fault Injection Attacks at DNN Hardware AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.342636444:1(39-51)Online publication date: Jan-2025
https://doi.org/10.1109/TCAD.2024.3426364
S. C. VSabareeshwari VK. C. S(2024)Artificial Intelligence in CommunicationsConvergence of Antenna Technologies, Electronics, and AI10.4018/979-8-3693-3775-2.ch008(209-238)Online publication date: 27-Sep-2024
https://doi.org/10.4018/979-8-3693-3775-2.ch008
Mohd BAhmad Yousef KAlMajali AHayajneh T(2024)Quantization-Based Optimization Algorithm for Hardware Implementation of Convolution Neural NetworksElectronics10.3390/electronics1309172713:9(1727)Online publication date: 30-Apr-2024
https://doi.org/10.3390/electronics13091727
Goel SKedia RSen RBalakrishnan M(2024)EXPRESS: A Framework for Execution Time Prediction of Concurrent CNNs on Xilinx DPU AcceleratorACM Transactions on Embedded Computing Systems10.1145/369783524:1(1-31)Online publication date: 3-Oct-2024
https://dl.acm.org/doi/10.1145/3697835
Feng FWei XDong Z(2024)Reconfigurable Hardware Accelerator for Convolution Operations in Convolutional Neural NetworksProceedings of the 2024 12th International Conference on Communications and Broadband Networking10.1145/3688636.3688655(20-26)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3688636.3688655
Liu HQian YLiang YZhang BLiu ZHe TZhao WLu JYu B(2024)A High-Performance Accelerator for Real-Time Super-Resolution on Edge FPGAsACM Transactions on Design Automation of Electronic Systems10.1145/365285529:3(1-25)Online publication date: 16-Mar-2024
https://dl.acm.org/doi/10.1145/3652855
Zhou KQiu K(2024)REC: REtime Convolutional Layers to Fully Exploit Harvested Energy for ReRAM-based CNN AcceleratorsACM Transactions on Embedded Computing Systems10.1145/365259323:6(1-25)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3652593
Miliadis PTheodoropoulos DPnevmatikatos DKoziris N(2024)Architectural Support for Sharing, Isolating and Virtualizing FPGA ResourcesACM Transactions on Architecture and Code Optimization10.1145/364847521:2(1-26)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3648475
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten