research-article

FlexCL: A Model of Performance and Power for OpenCL Workloads on FPGAs

Authors:

Wei ZhangAuthors Info & Claims

IEEE Transactions on Computers, Volume 67, Issue 12

Pages 1750 - 1764

https://doi.org/10.1109/TC.2018.2840686

Published: 01 December 2018 Publication History

Abstract

Hardware acceleration is a promising trend for the energy and thermally constrained systems. The programmable nature of FPGAs allows it to deliver high performance and energy efficient solution. Unfortunately, the traditional RTL-based synthesis flow of FPGAs prevents its wide adoption. In response, recent adoption of OpenCL programming model has raised the possibility to program FPGAs in a software manner. To harness the power of FPGAs using OpenCL programming model, it is advantageous to design an analytical model for performance analysis, design space exploration and provide insights into the performance bottlenecks. To this end, this paper presents FlexCL, an analytical performance and power model for OpenCL workloads on FPGAs. FlexCL leverages static analysis to analyze the OpenCL kernels. As for the performance estimation, it first develops systematic computation models for processing elements, compute units and kernels by modeling the operation scheduling, work-item and work-group scheduling, and the resource constraints. Then, it models different global memory access patterns. Finally, FlexCL estimates the overall performance by tightly coupling the memory and computation models based on the communication mode. FlexCL can be also used to guide performance and power trade-off analysis. Experiments demonstrate that the average performance and power estimation errors of FlexCL are 9.5 and 12.6 percent for the Rodinia suite, respectively. The OpenCL model on FPGAs also exposes a rich optimization design space. With FlexCL, we can enable rapid exploration of the design space with respect to both performance and power within seconds instead of hours or days.

References

[1]

H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” in Proc. 38th Annu. Int. Symp. Comput. Archit., 2011, pp. 365 –376.

[2]

E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai, “Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs?,” in Proc. 43rd Annu. IEEE/ACM Int. Symp. Microarchitecture, 2010, pp. 225– 236.

[3]

A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger, “A reconfigurable fabric for accelerating large-scale datacenter services,” in Proc. 41st Annu. Int. Symp. Comput. Archit., 2014, pp. 13–24.

[4]

J. Ouyang, S. Lin, W. Qi, Y. Wang, B. Yu, and S. Jiang, “ SDA: Software-defined accelerator for large-scale DNN systems,” in Proc. IEEE Hot Chips Symp., 2014, pp. 1–23.

[5]

S. Weston, J. T. Marin, J. Spooner, O. Pell, and O. Mencer, “Accelerating the computation of portfolios of tranched credit derivatives,” in Proc. IEEE Workshop High Perform. Comput. Finance , 2010, pp. 1–8.

[6]

J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, “High-level synthesis for FPGAs: From prototyping to deployment,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 30, no. 4, pp. 473–491, Apr. 2011.

Digital Library

[7]

Y. Liang, K. Rupnow, Y. Li, D. Min, M. N. Do, and D. Chen, “ High-level synthesis: Productivity, performance, and software constraints,” J. Electr. Comput. Eng., vol. 2012, 2012, Art. no.

[8]

J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang, and B. He, “COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. , 2017, pp. 430–437.

[9]

L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms for convolutional neural networks on FPGAs,” in Proc. IEEE 25th Annu. Int. Symp. Field-Programmable Custom Comput. Mach., 2017, pp. 101–108.

[10]

Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks, “Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures,” in Proc. ACM/IEEE 41st Int. Symp. Comput. Archit., 2014, pp. 97–108.

[11]

L. Lu and Y. Liang, “ SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs,” in Proc. 55th ACM/EDAC/IEEE Des. Autom. Conf., Jun. 2018.

[12]

X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, “Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs,” in Proc. 54th ACM/EDAC/IEEE Des. Autom. Conf., 2017, pp. 1–6.

[13]

Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.-W. Tai, “Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs,” in Proc. 54th ACM/EDAC/IEEE Des. Autom. Conf. , 2017, pp. 1–6.

[14]

S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. W. Wang, and Y. Liang, “C-LSTM: Enabling Efficient LSTM using structured compression techniques on FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2018, pp. 11–20.

[15]

Y.-T. Chen, J. Cong, Z. Fang, J. Lei, and P. Wei, “When apache spark meets FPGAs: A case study for next-generation DNA sequencing acceleration,” in Proc. 8th USENIX Workshop Hot Topics Cloud Comput. , 2016, pp. 64–70.

[16]

A. Sirasao, E. Delaye, R. Sunkavalli, and S. Neuendorffer, “FPGA based OpenCL acceleration of genome sequencing software,” Syst., vol. 128, no. 8.7, 2015, Art. no.

[17]

S. Wang and Y. Liang, “A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model,” in Proc. 54th ACM/EDAC/IEEE Des. Autom. Conf., 2017, pp. 1–6.

[18]

J. Kim, H. Kim, J. H. Lee, and J. Lee, “Achieving a single compute device image in OpenCL for multiple GPUs,” in Proc. 16th ACM Symp. Principles Practice Parallel Program., 2011, pp. 277–288.

[19]

S. Wang, Y. Liang, and W. Zhang, “FlexCL: An analytical performance model for OpenCL workloads on flexible FPGAs,” in Proc. 54th ACM/EDAC/IEEE Des. Autom. Conf., 2017, pp. 1–6.

[20]

J. Villarreal, A. Park, W. Najjar, and R. Halstead, “Designing modular hardware accelerators in C with ROCCC 2.0,” in Proc. 18th IEEE Annu. Int. Symp. Field-Programmable Custom Comput. Mach., 2010, pp. 127–134.

[21]

F. Vahid, G. Stitt, and R. Lysecky, “Warp processing: Dynamic translation of binaries to FPGA circuits,” IEEE Comput., vol. 41, no. 7, pp. 40–46, Jul. 2008.

Digital Library

[22]

J. Cong and Z. Zhang, “An efficient and versatile scheduling algorithm based on SDC formulation,” in Proc. 43rd ACM/IEEE Des. Autom. Conf., 2006, pp. 433–438.

[23]

A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1986.

Digital Library

[24]

T. M. Lattner, “An implementation of swing modulo scheduling with extensions for superblocks,” Master's thesis, Computer Science Dept., Univ. Illinois at Urbana-Champaign, Urbana, IL, Jun. 2005. [Online]. Available: http://llvm.cs.uiuc.edu

[25]

B. R. Rau, “Iterative Modulo scheduling: An algorithm for software pipelining loops,” in Proc. 27th Annu. Int. Symp. Microarchitecture, 1994, pp. 63–74.

[26]

J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero, “Swing module scheduling: A lifetime-sensitive approach,” in Proc. Conf. Parallel Archit. Compilation Techn., 1996, pp. 80–86.

[27]

H. Choi, J. Lee, and W. Sung, “Memory access pattern-aware DRAM performance model for multi-core systems,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2011, pp. 66– 75.

[28]

J. H. Anderson and F. N. Najm, “Power estimation techniques for FPGAs,” IEEE Trans. Very Large Scale Integr. Syst., vol. 12, no. 10, pp. 1015 –1027, Oct. 2004.

Digital Library

[29]

K. Gulati, S. P. Khatri, and P. Li, “Closed-loop modeling of power and temperature profiles of FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2009, p. 287.

[30]

F. Li, Y. Lin, L. He, D. Chen, and J. Cong, “Power modeling and characteristics of field programmable gate arrays,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 24, no. 11, pp. 1712–1724, Nov. 2005.

Digital Library

[31]

W. Zuo, W. Kemmerer, J. B. Lim, L.-N. Pouchet, A. Ayupov, T. Kim, K. Han, and D. Chen, “A polyhedral-based SystemC modeling and generation framework for effective low-power design space exploration,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., 2015, pp. 357– 364.

[32]

A. Bogliolo, L. Benini, B. Riccó, and G. De Micheli, “Efficient switching activity computation during high-level synthesis of control-dominated designs,” in Proc. Int. Symp. Low Power Electron. Des., 1999, pp. 127–132.

[33]

D. Chen, J. Cong, Y. Fan, and Z. Zhang, “High-level power estimation and low-power design space exploration for FPGAs,” in Proc. Asia South Pacific Des. Autom. Conf. , 2007, pp. 529–534.

[34]

T. Sherwood, E. Perelman, and B. Calder, “Basic block distribution analysis to find periodic behavior and simulation points in applications,” in Proc. Int. Conf. Parallel Archit. Compilation Techn., 2001, pp. 3 –14.

[35]

X. Zheng, L. K. John, and A. Gerstlauer, “ Accurate phase-level cross-platform power and performance estimation,” in Proc. 53rd ACM/EDAC/IEEE Des. Autom. Conf., 2016, pp. 1–6.

[36]

K. T. Malladi, B. C. Lee, F. A. Nothaft, C. Kozyrakis, K. Periyathambi, and M. Horowitz, “Towards energy-proportional datacenter memory with mobile DRAM,” in Proc. 39th Annu. Int. Symp. Comput. Archit., 2012, pp. 37– 48.

[37]

S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, “ A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads,” in Proc. IEEE Int. Symp. Workload Characterization, 2010, pp. 1–11.

[38]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in Proc. IEEE Int. Symp. Workload Characterization, 2009, pp. 44–54.

[39]

Z. Wang, B. He, W. Zhang, and S. Jiang, “A performance analysis framework for optimizing OpenCL applications on FPGAs,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2016, pp. 97–108.

[40]

E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjective evolutionary algorithms: Empirical results,” Evol. Comput., vol. 8, pp. 173–195, Jun. 2000.

Digital Library

[41]

G. Zhong, A. Prakash, Y. Liang, T. Mitra, and S. Niar, “Lin-Analyzer: A high-level performance analysis tool for FPGA-based accelerators,” in Proc. 53rd ACM/EDAC/IEEE Des. Autom. Conf., 2016, pp. 1–6.

[42]

N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-S. Seo, and Y. Cao, “Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2016, pp. 16–25.

[43]

D. Chen, J. Cong, Y. Fan, and L. Wan, “LOPASS: A low-power architectural synthesis system for FPGAs with interconnect estimation and optimization,” IEEE Trans. Very Large Scale Integr. Syst., vol. 18, no. 4, pp. 564 –577, Apr. 2010.

Digital Library

[44]

H. Liang, Y. C. Chen, T. Luo, W. Zhang, H. Li, and B. He, “ Hierarchical library based power estimator for versatile FPGAs,” in Proc. IEEE 9th Int. Symp. Embedded Multicore/Many-Core Syst.-Chip, 2015, pp. 25 –32.

[45]

E. Kadric, D. Lakata, and A. DeHon, “Impact of memory architecture on FPGA energy consumption,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2015, pp. 146–155.

[46]

D. Lee, T. Kim, K. Han, Y. Hoskote, L. K. John, and A. Gerstlauer, “ Learning-based power modeling of system-level black-box IPs,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., 2015, pp. 847–853.

[47]

J. Cong, W. Jiang, B. Liu, and Y. Zou, “Automatic memory partitioning and scheduling for throughput and power optimization,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., 2009, pp. 697–704.

[48]

P. Zhou, H. Park, Z. Fang, J. Cong, and A. DeHon, “Energy efficiency of full pipelining: A case study for matrix multiplication,” in Proc. IEEE 24th Annu. Int. Symp. Field-Programmable Custom Comput. Mach., 2016, pp. 172–175.

Cited By

Lin ZLiang TZhao JSinha SZhang W(2023)HL-Pow: Learning-Assisted Pre-RTL Power Modeling and Optimization for FPGA HLSIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.324638742:11(3925-3938)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1109/TCAD.2023.3246387
Ahmed KYoshii KTasnim S(2021)Parallel application power and performance prediction modeling using simulationProceedings of the Winter Simulation Conference10.5555/3522802.3522980(1-12)Online publication date: 13-Dec-2021
https://dl.acm.org/doi/10.5555/3522802.3522980
Dávila-Guzmán MTejero RVillarroya-Gaudó MGracia D(2021)Analytical Model for Memory-Centric High Level Synthesis-Generated ApplicationsIEEE Transactions on Computers10.1109/TC.2021.311505670:12(2056-2069)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1109/TC.2021.3115056
Show More Cited By

Index Terms

FlexCL: A Model of Performance and Power for OpenCL Workloads on FPGAs

Index terms have been assigned to the content through auto-classification.

Recommendations

Programming FPGAs Using OpenCL from Performance Model to Application Study
ETCD'17: Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Recent adoption of OpenCL programming model by FPGA vendors has realized the function portability of OpenCL workloads on FPGA. However, the poor performance portability prevents its wide adoption. To harness the power of FPGAs using OpenCL programming ...
Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

We evaluate the power and performance of the Rodinia benchmark suite using the Altera SDK for OpenCL targeting a Stratix V FPGA against a modern CPU and GPU. We study multiple OpenCL kernels per benchmark, ranging from direct ports of the original GPU ...
FlexCL: An Analytical Performance Model for OpenCL Workloads on Flexible FPGAs
DAC '17: Proceedings of the 54th Annual Design Automation Conference 2017

The recent adoption of OpenCL programming model by FPGA vendors has realized the function portability of OpenCL workloads on FPGA. However, the poor performance portability prevents its wide adoption. To harness the power of FPGAs using OpenCL ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers

IEEE Transactions on Computers Volume 67, Issue 12

Dec. 2018

186 pages

ISSN:0018-9340

Issue’s Table of Contents

0018-9340 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 December 2018

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lin ZLiang TZhao JSinha SZhang W(2023)HL-Pow: Learning-Assisted Pre-RTL Power Modeling and Optimization for FPGA HLSIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.324638742:11(3925-3938)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1109/TCAD.2023.3246387
Ahmed KYoshii KTasnim S(2021)Parallel application power and performance prediction modeling using simulationProceedings of the Winter Simulation Conference10.5555/3522802.3522980(1-12)Online publication date: 13-Dec-2021
https://dl.acm.org/doi/10.5555/3522802.3522980
Dávila-Guzmán MTejero RVillarroya-Gaudó MGracia D(2021)Analytical Model for Memory-Centric High Level Synthesis-Generated ApplicationsIEEE Transactions on Computers10.1109/TC.2021.311505670:12(2056-2069)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1109/TC.2021.3115056
Mu JZhang WLiang HSinha S(2020)Optimizing OpenCL-Based CNN Design on FPGA with Comprehensive Design Space Exploration and Collaborative Performance ModelingACM Transactions on Reconfigurable Technology and Systems10.1145/339751413:3(1-28)Online publication date: 23-Jun-2020
https://dl.acm.org/doi/10.1145/3397514
Zheng SLiang YWang SChen RSheng KLarus JCeze LStrauss K(2020)FlexTensorProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378508(859-873)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378508

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents