skip to main content
10.1145/3020078.3021727acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article
Public Access

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Published: 22 February 2017 Publication History

Abstract

We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks (CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational requirements of the convolutional layer. We map the frequency domain algorithms onto a highly-parallel OaA-based 2D convolver design on the FPGA. Then, we propose a novel data layout in shared memory for efficient data communication between the CPU and the FPGA. To reduce the memory access latency and sustain peak performance of the FPGA, our design employs double buffering. To reduce the inter-layer data remapping latency, we exploit concurrent processing on the CPU and the FPGA. Our approach can be applied to any kernel size less than the chosen FFT size with appropriate zero-padding leading to acceleration of a wide range of CNN models. We exploit the data parallelism of OaA-based 2D convolver and task parallelism to scale the overall system performance. By using OaA, the number of floating point operations is reduced by 39.14% ~54.10% for the state-of-art CNNs. We implement VGG16, AlexNet and GoogLeNet on Intel QuickAssist QPI FPGA Platform. These designs sustain 123.48 GFLOPs/sec, 83.00 GFLOPs/sec and 96.60 GFLOPs/sec, respectively. Compared with the state-of-the-art AlexNet implementation, our design achieves 1.35x GFLOPs/sec improvement using 3.33x less multipliers and 1.1x less memory. Compared with the state-of-art VGG16 implementation, our design has 0.66x GFLOPs/sec using 3.48x less multipliers without impacting the classification accuracy. For GoogLeNet implementation, our design achieves 5.56x improvement in performance compared with 16 threads running on a 10 Core Intel Xeon Processor at 2.8 GHz.

References

[1]
B. Bosi, G. Bois, and Y. Savaria. Reconfigurable Pipelined 2D Convolvers for Fast Digital Signal Processing. IEEE Trans. On Very Large Scale Integration (VLSI) Systems, 1999.
[2]
R. Chen and V. K. Prasanna. Energy Optimizations for FPGA-based 2-D FFT Architecture. In High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pages 1--6, Sept 2014.
[3]
R. Chen, S. Siriyal, and V. K. Prasanna. Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA '15, pages 240--249, New York, NY, USA, 2015. ACM.
[4]
C. Farabet, Y. Lecun, K. Kavukcuoglu, B. Martini, P. Akselrod, S. Talay, and E. Culurciello. Large-Scale FPGA-Based Convolutional Networks. In R. Bekkerman, M. Bilenko, and J. Langford, editors, Scaling Up Machine Learning, pages 399--419. Cambridge University Press, 2011. Cambridge Books.
[5]
M. Hemnani, S. Palekar, P. Dixit, and P. Joshi. Hardware optimization of complex multiplication scheme for DSP application. In Computer, Communication and Control (IC4), 2015 International Conference on, pages 1--4, Sept 2015.
[6]
T. Highlander and A. Rodriguez. Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add. CoRR, abs/1601.06815, 2016.
[7]
F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and textless 0.5MB model size. CoRR, abs/1602.07360, 2016.
[8]
Intel Inc. Xeon
[9]
FPGA Platform for the Data Center. https://www.ece.cmu.edu/calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf.
[10]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097--1105. Curran Associates, Inc., 2012.
[11]
M. Mathieu, M. Henaff, and Y. LeCun. Fast Training of Convolutional Networks through FFTs. CoRR, abs/1312.5851, 2013.
[12]
Micron Technology, Inc. The Convey HC-2 Computer. https://www.micron.com/about/about-the-convey-computer-acquisition.
[13]
Y. Qiao, J. Shen, T. Xiao, Q. Yang, M. Wen, and C. Zhang. FPGA-accelerated deep convolutional neural networks for high throughput and energy efficiency. Concurrency and Computation: Practice and Experience, pages n/a--n/a, 2016. cpe.3850.
[14]
J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA'16. ACM, 2016.
[15]
D. Scherer, H. Schulz, and S. Behnke. Accelerating Large-Scale Convolutional Neural Networks with Parallel Graphics Multiprocessors, pages 82--91. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
[16]
K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556, 2014.
[17]
N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, and Y. Cao. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA '16, pages 16--25, New York, NY, USA, 2016.
[18]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. CoRR, abs/1409.4842, 2014.
[19]
Wikipedia. https://en.wikipedia.org/wiki/Multidimensional_discrete_convolution#Overlap_and_Add.
[20]
Xilinx Inc. Zynq-7000 All Programmable SoC. http://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html.
[21]
M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. CoRR, abs/1311.2901, 2013.
[22]
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA '15, pages 161--170, New York, NY, USA, 2015.
[23]
X. Zhang, J. Zou, X. Ming, K. He, and J. Sun. Efficient and Accurate Approximations of Nonlinear Convolutional Networks. CoRR, abs/1411.4229, 2014.
[24]
A. Zlateski, K. Lee, and H. S. Seung. ZNN - A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-Core and Many-Core Shared Memory Machines. CoRR, abs/1510.06706, 2015.

Cited By

View all

Index Terms

  1. Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
      February 2017
      312 pages
      ISBN:9781450343541
      DOI:10.1145/3020078
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 February 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. CPU
      2. FPGA
      3. concurrent processing
      4. convolutional neural networks
      5. discrete fourier transform
      6. double buffering
      7. overlap-and-add
      8. shared memory

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      FPGA '17
      Sponsor:

      Acceptance Rates

      FPGA '17 Paper Acceptance Rate 25 of 101 submissions, 25%;
      Overall Acceptance Rate 125 of 627 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)403
      • Downloads (Last 6 weeks)48
      Reflects downloads up to 06 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media