skip to main content
research-article

FlexCL: A Model of Performance and Power for OpenCL Workloads on FPGAs

Published: 01 December 2018 Publication History

Abstract

Hardware acceleration is a promising trend for the energy and thermally constrained systems. The programmable nature of FPGAs allows it to deliver high performance and energy efficient solution. Unfortunately, the traditional RTL-based synthesis flow of FPGAs prevents its wide adoption. In response, recent adoption of OpenCL programming model has raised the possibility to program FPGAs in a software manner. To harness the power of FPGAs using OpenCL programming model, it is advantageous to design an analytical model for performance analysis, design space exploration and provide insights into the performance bottlenecks. To this end, this paper presents FlexCL, an analytical performance and power model for OpenCL workloads on FPGAs. FlexCL leverages static analysis to analyze the OpenCL kernels. As for the performance estimation, it first develops systematic computation models for processing elements, compute units and kernels by modeling the operation scheduling, work-item and work-group scheduling, and the resource constraints. Then, it models different global memory access patterns. Finally, FlexCL estimates the overall performance by tightly coupling the memory and computation models based on the communication mode. FlexCL can be also used to guide performance and power trade-off analysis. Experiments demonstrate that the average performance and power estimation errors of FlexCL are 9.5 and 12.6 percent for the Rodinia suite, respectively. The OpenCL model on FPGAs also exposes a rich optimization design space. With FlexCL, we can enable rapid exploration of the design space with respect to both performance and power within seconds instead of hours or days.

References

[1]
H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” in Proc. 38th Annu. Int. Symp. Comput. Archit., 2011, pp. 365 –376.
[2]
E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai, “Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs?,” in Proc. 43rd Annu. IEEE/ACM Int. Symp. Microarchitecture, 2010, pp. 225– 236.
[3]
A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger, “A reconfigurable fabric for accelerating large-scale datacenter services,” in Proc. 41st Annu. Int. Symp. Comput. Archit., 2014, pp. 13–24.
[4]
J. Ouyang, S. Lin, W. Qi, Y. Wang, B. Yu, and S. Jiang, “ SDA: Software-defined accelerator for large-scale DNN systems,” in Proc. IEEE Hot Chips Symp., 2014, pp. 1–23.
[5]
S. Weston, J. T. Marin, J. Spooner, O. Pell, and O. Mencer, “Accelerating the computation of portfolios of tranched credit derivatives,” in Proc. IEEE Workshop High Perform. Comput. Finance , 2010, pp. 1–8.
[6]
J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, “High-level synthesis for FPGAs: From prototyping to deployment,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 30, no. 4, pp. 473–491, Apr. 2011.
[7]
Y. Liang, K. Rupnow, Y. Li, D. Min, M. N. Do, and D. Chen, “ High-level synthesis: Productivity, performance, and software constraints,” J. Electr. Comput. Eng., vol. 2012, 2012, Art. no.
[8]
J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang, and B. He, “COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. , 2017, pp. 430–437.
[9]
L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms for convolutional neural networks on FPGAs,” in Proc. IEEE 25th Annu. Int. Symp. Field-Programmable Custom Comput. Mach., 2017, pp. 101–108.
[10]
Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks, “Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures,” in Proc. ACM/IEEE 41st Int. Symp. Comput. Archit., 2014, pp. 97–108.
[11]
L. Lu and Y. Liang, “ SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs,” in Proc. 55th ACM/EDAC/IEEE Des. Autom. Conf., Jun. 2018.
[12]
X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, “Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs,” in Proc. 54th ACM/EDAC/IEEE Des. Autom. Conf., 2017, pp. 1–6.
[13]
Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.-W. Tai, “Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs,” in Proc. 54th ACM/EDAC/IEEE Des. Autom. Conf. , 2017, pp. 1–6.
[14]
S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. W. Wang, and Y. Liang, “C-LSTM: Enabling Efficient LSTM using structured compression techniques on FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2018, pp. 11–20.
[15]
Y.-T. Chen, J. Cong, Z. Fang, J. Lei, and P. Wei, “When apache spark meets FPGAs: A case study for next-generation DNA sequencing acceleration,” in Proc. 8th USENIX Workshop Hot Topics Cloud Comput. , 2016, pp. 64–70.
[16]
A. Sirasao, E. Delaye, R. Sunkavalli, and S. Neuendorffer, “FPGA based OpenCL acceleration of genome sequencing software,” Syst., vol. 128, no. 8.7, 2015, Art. no.
[17]
S. Wang and Y. Liang, “A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model,” in Proc. 54th ACM/EDAC/IEEE Des. Autom. Conf., 2017, pp. 1–6.
[18]
J. Kim, H. Kim, J. H. Lee, and J. Lee, “Achieving a single compute device image in OpenCL for multiple GPUs,” in Proc. 16th ACM Symp. Principles Practice Parallel Program., 2011, pp. 277–288.
[19]
S. Wang, Y. Liang, and W. Zhang, “FlexCL: An analytical performance model for OpenCL workloads on flexible FPGAs,” in Proc. 54th ACM/EDAC/IEEE Des. Autom. Conf., 2017, pp. 1–6.
[20]
J. Villarreal, A. Park, W. Najjar, and R. Halstead, “Designing modular hardware accelerators in C with ROCCC 2.0,” in Proc. 18th IEEE Annu. Int. Symp. Field-Programmable Custom Comput. Mach., 2010, pp. 127–134.
[21]
F. Vahid, G. Stitt, and R. Lysecky, “Warp processing: Dynamic translation of binaries to FPGA circuits,” IEEE Comput., vol. 41, no. 7, pp. 40–46, Jul. 2008.
[22]
J. Cong and Z. Zhang, “An efficient and versatile scheduling algorithm based on SDC formulation,” in Proc. 43rd ACM/IEEE Des. Autom. Conf., 2006, pp. 433–438.
[23]
A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1986.
[24]
T. M. Lattner, “An implementation of swing modulo scheduling with extensions for superblocks,” Master's thesis, Computer Science Dept., Univ. Illinois at Urbana-Champaign, Urbana, IL, Jun. 2005. [Online]. Available: http://llvm.cs.uiuc.edu
[25]
B. R. Rau, “Iterative Modulo scheduling: An algorithm for software pipelining loops,” in Proc. 27th Annu. Int. Symp. Microarchitecture, 1994, pp. 63–74.
[26]
J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero, “Swing module scheduling: A lifetime-sensitive approach,” in Proc. Conf. Parallel Archit. Compilation Techn., 1996, pp. 80–86.
[27]
H. Choi, J. Lee, and W. Sung, “Memory access pattern-aware DRAM performance model for multi-core systems,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2011, pp. 66– 75.
[28]
J. H. Anderson and F. N. Najm, “Power estimation techniques for FPGAs,” IEEE Trans. Very Large Scale Integr. Syst., vol. 12, no. 10, pp. 1015 –1027, Oct. 2004.
[29]
K. Gulati, S. P. Khatri, and P. Li, “Closed-loop modeling of power and temperature profiles of FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2009, p. 287.
[30]
F. Li, Y. Lin, L. He, D. Chen, and J. Cong, “Power modeling and characteristics of field programmable gate arrays,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 24, no. 11, pp. 1712–1724, Nov. 2005.
[31]
W. Zuo, W. Kemmerer, J. B. Lim, L.-N. Pouchet, A. Ayupov, T. Kim, K. Han, and D. Chen, “A polyhedral-based SystemC modeling and generation framework for effective low-power design space exploration,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., 2015, pp. 357– 364.
[32]
A. Bogliolo, L. Benini, B. Riccó, and G. De Micheli, “Efficient switching activity computation during high-level synthesis of control-dominated designs,” in Proc. Int. Symp. Low Power Electron. Des., 1999, pp. 127–132.
[33]
D. Chen, J. Cong, Y. Fan, and Z. Zhang, “High-level power estimation and low-power design space exploration for FPGAs,” in Proc. Asia South Pacific Des. Autom. Conf. , 2007, pp. 529–534.
[34]
T. Sherwood, E. Perelman, and B. Calder, “Basic block distribution analysis to find periodic behavior and simulation points in applications,” in Proc. Int. Conf. Parallel Archit. Compilation Techn., 2001, pp. 3 –14.
[35]
X. Zheng, L. K. John, and A. Gerstlauer, “ Accurate phase-level cross-platform power and performance estimation,” in Proc. 53rd ACM/EDAC/IEEE Des. Autom. Conf., 2016, pp. 1–6.
[36]
K. T. Malladi, B. C. Lee, F. A. Nothaft, C. Kozyrakis, K. Periyathambi, and M. Horowitz, “Towards energy-proportional datacenter memory with mobile DRAM,” in Proc. 39th Annu. Int. Symp. Comput. Archit., 2012, pp. 37– 48.
[37]
S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, “ A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads,” in Proc. IEEE Int. Symp. Workload Characterization, 2010, pp. 1–11.
[38]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in Proc. IEEE Int. Symp. Workload Characterization, 2009, pp. 44–54.
[39]
Z. Wang, B. He, W. Zhang, and S. Jiang, “A performance analysis framework for optimizing OpenCL applications on FPGAs,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2016, pp. 97–108.
[40]
E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjective evolutionary algorithms: Empirical results,” Evol. Comput., vol. 8, pp. 173–195, Jun. 2000.
[41]
G. Zhong, A. Prakash, Y. Liang, T. Mitra, and S. Niar, “Lin-Analyzer: A high-level performance analysis tool for FPGA-based accelerators,” in Proc. 53rd ACM/EDAC/IEEE Des. Autom. Conf., 2016, pp. 1–6.
[42]
N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-S. Seo, and Y. Cao, “Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2016, pp. 16–25.
[43]
D. Chen, J. Cong, Y. Fan, and L. Wan, “LOPASS: A low-power architectural synthesis system for FPGAs with interconnect estimation and optimization,” IEEE Trans. Very Large Scale Integr. Syst., vol. 18, no. 4, pp. 564 –577, Apr. 2010.
[44]
H. Liang, Y. C. Chen, T. Luo, W. Zhang, H. Li, and B. He, “ Hierarchical library based power estimator for versatile FPGAs,” in Proc. IEEE 9th Int. Symp. Embedded Multicore/Many-Core Syst.-Chip, 2015, pp. 25 –32.
[45]
E. Kadric, D. Lakata, and A. DeHon, “Impact of memory architecture on FPGA energy consumption,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2015, pp. 146–155.
[46]
D. Lee, T. Kim, K. Han, Y. Hoskote, L. K. John, and A. Gerstlauer, “ Learning-based power modeling of system-level black-box IPs,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., 2015, pp. 847–853.
[47]
J. Cong, W. Jiang, B. Liu, and Y. Zou, “Automatic memory partitioning and scheduling for throughput and power optimization,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., 2009, pp. 697–704.
[48]
P. Zhou, H. Park, Z. Fang, J. Cong, and A. DeHon, “Energy efficiency of full pipelining: A case study for matrix multiplication,” in Proc. IEEE 24th Annu. Int. Symp. Field-Programmable Custom Comput. Mach., 2016, pp. 172–175.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers
IEEE Transactions on Computers  Volume 67, Issue 12
Dec. 2018
186 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 December 2018

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media