skip to main content
research-article

Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning

Published: 06 December 2021 Publication History

Abstract

Customized compute acceleration in the datacenter is key to the wider roll-out of applications based on deep neural network (DNN) inference. In this article, we investigate how to maximize the performance and scalability of field-programmable gate array (FPGA)-based pipeline dataflow DNN inference accelerators (DFAs) automatically on computing infrastructures consisting of multi-die, network-connected FPGAs. We present Elastic-DF, a novel resource partitioning tool and associated FPGA runtime infrastructure that integrates with the DNN compiler FINN. Elastic-DF allocates FPGA resources to DNN layers and layers to individual FPGA dies to maximize the total performance of the multi-FPGA system. In the resulting Elastic-DF mapping, the accelerator may be instantiated multiple times, and each instance may be segmented across multiple FPGAs transparently, whereby the segments communicate peer-to-peer through 100 Gbps Ethernet FPGA infrastructure, without host involvement. When applied to ResNet-50, Elastic-DF provides a 44% latency decrease on Alveo U280. For MobileNetV1 on Alveo U200 and U280, Elastic-DF enables a 78% throughput increase, eliminating the performance difference between these cards and the larger Alveo U250. Elastic-DF also increases operating frequency in all our experiments, on average by over 20%. Elastic-DF therefore increases performance portability between different sizes of FPGA and increases the critical throughput per cost metric of datacenter inference.

References

[1]
F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss, and S. Paredes. 2017. An FPGA platform for hyperscalers. In Proceedings of the IEEE 25th Annual Symposium on High-performance Interconnects (HOTI’17). 29–32. https://doi.org/10.1109/HOTI.2017.13
[3]
Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M. Bronstein, and Avi Mendelson. 2018. Streaming architecture for large-scale quantized neural networks on an FPGA-based dataflow platform. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’18). IEEE, 162–169.
[4]
Giuseppe Bianchi, Michael Welzl, Angelo Tulumello, Giacomo Belocchi, Marco Faltelli, and Salvatore Pontarelli. 2018. A fully portable TCP implementation using XFSMs. In Proceedings of the ACM SIGCOMM Conference on Posters and Demos. ACM, 99–101.
[5]
Michaela Blott, Thomas B. Preußer, Nicholas J. Fraser, Giulio Gambardella, Kenneth O’brien, Yaman Umuroglu, Miriam Leeser, and Kees Vissers. 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 1–23.
[6]
Alex R. Bucknall, Shanker Shreejith, and Suhaib A. Fahmy. 2020. Build automation and runtime abstraction for partial reconfiguration on Xilinx Zynq Ultrascale+. In Proceedings of the International Conference on Field-Programmable Technology.
[7]
Salvatore Calg̀, Grzegorz Korcyl, and Piotr Korcyl. 2020. Using Xilinx Alveo accelerators for lattice QCD. In Proceedings of the Asia-Pacific Intertional Symposium on Lattice Field Theory, Vol. 2020.
[8]
Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim et al. 2016. A cloud-scale acceleration architecture. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–13.
[9]
Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman et al. 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8–20.
[10]
ONNX Community. [n.d.]. ONNX: Open Neural Network Exchange. Retrieved from https://github.com/onnx.
[11]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT Press, Cambridge, MA.
[12]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 248–255.
[13]
Li Ding, Ping Kang, Wenbo Yin, and Linli Wang. 2016. Hardware TCP offload engine based on 10-Gbps ethernet for low-latency network communication. In Proceedings of the International Conference on Field-Programmable Technology (FPT’16). IEEE, 269–272.
[14]
Hamish Fallside, M. John, and S. Smith. 2000. Internet connected FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, 289–290.
[15]
Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 1–14.
[16]
Jeremy Fowers, Kalin Ovtcharov, Michael K. Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi et al. 2019. Inside project Brainwave’s cloud-scale, real-time AI processor. IEEE Micro 39, 3 (2019), 20–28.
[17]
Mohammad Ghasemzadeh, Mohammad Samragh, and Farinaz Koushanfar. 2018. ReBNet: Residual binarized neural network. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 57–64.
[18]
ETH Zurich Systems Group. [n.d.]. Vitis with 100 Gbps TCP/IP Network Stack. Retrieved from https://github.com/fpgasystems/Vitis_with_100Gbps_TCP-IP.
[19]
Andre Hahn Pereira and Vaughn Betz. 2014. Cad and routing architecture for interposer-based multi-fpga systems. In Proceedings of the ACM/SIGDA international symposium on Field-programmable gate arrays. 75–84.
[20]
Mathew Hall and Vaughn Betz. 2020. HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs. Retrieved from https://arXiv:2007.10451.
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. Retrieved from https://arXiv:1512.03385.
[22]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. Retrieved from http://arxiv.org/abs/1704.04861.
[24]
Yong Ji and Qing-Sheng Hu. 2011. 40Gbps Multi-Connection TCP/IP Offload Engine. In Proceedings of the International Conference on Wireless Communications and Signal Processing (WCSP’11). IEEE, 1–5.
[25]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. Retrieved from https://arXiv:1807.05358.
[26]
Weiwen Jiang, Edwin H.-M. Sha, Xinyi Zhang, Lei Yang, Qingfeng Zhuge, Yiyu Shi, and Jingtong Hu. 2019. Achieving super-linear speedup across multi-fpga for real-time dnn inference. ACM Trans. Embed. Comput. Syst. 18, 5s (2019), 1–23.
[27]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the International Symposium on Computer Architecture (ISCA’17). ACM, 1–12.
[28]
Christoforos Kachris. 2018. Performance evaluation of InAccel ML scalable suite. Technical Report. InAccel. Retrieved from https://www.inaccel.com/wp-content/uploads/inaccel_white_paper.pdf.
[29]
Vinod Kathail. 2020. Xilinx vitis unified software platform. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 173–174.
[30]
Justin Knapheide, Benno Stabernack, and Maximilian Kuhnke. 2020. A high throughput MobileNetV2 FPGA implementation based on a flexible architecture for depthwise separable convolution. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 277–283.
[31]
Mairin Kroes, Lucian Petrica, Sorin Cotofana, and Michaela Blott. 2020. Evolutionary bin packing for memory-efficient dataflow inference acceleration on FPGA. In Proceedings of the International Conference on Genetic and Evolutionary Computation (GECCO’20).
[32]
Xilinx Research Labs. [n.d.]. FINN Dataflow Accelerator Examples. Retrieved from https://github.com/Xilinx/finn-examples.
[33]
Xilinx Research Labs. [n.d.]. FINN: Dataflow compiler for QNN inference on FPGAs. Retrieved from https://github.com/Xilinx/finn.
[34]
Fubing Mao, Wei Zhang, Bo Feng, Bingsheng He, and Yuchun Ma. 2016. Modular placement for interposer based multi-FPGA systems. In Retrieved from International Great Lakes Symposium on VLSI (GLSVLSI’16). IEEE, 93–98.
[35]
Kevin E. Murray and Vaughn Betz. 2015. HETRIS: Adaptive floorplanning for heterogeneous FPGAs. In Proceedings of the International Conference on Field Programmable Technology (FPT’15). IEEE, 88–95.
[36]
Naif Tarafdar, Giuseppe Di Guglielmo, Philip C. Harris, Jeffrey D. Krupa, Vladimir Loncar, Dylan S. Rankin, Nhan Tran, Zhenbin Wu, Qianfeng Shen, and Paul Chow. [n.d.]. AIgean: An Open Framework for Machine Learning on a Heterogeneous Cluster. Retrieved from https://indico.cern.ch/event/924283/contributions/4105333/attachments/2154984/3634529/aigean_fastml.pdf.
[37]
Hiroki Nakahara, Zhiqiang Que, and Wayne Luk. 2020. High-Throughput convolutional neural network on an FPGA by customized JPEG compression. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 1–9.
[38]
Ehsan Nasiri, Javeed Shaikh, Andre Hahn Pereira, and Vaughn Betz. 2015. Multiple dice working as one: CAD flows and routing architectures for silicon interposer fpgas. IEEE Trans. Very Large Scale Integr. Syst. 24, 5 (2015), 1821–1834.
[40]
Alessandro Pappalardo. [n.d.]. Brevitas: Quantization-Aware Training in PyTorch. Retrieved from https://github.com/Xilinx/brevitas.
[41]
Lucian Petrica. [n.d.]. Elastic-DF partitioner and FINN integration. Retrieved from https://github.com/Xilinx/finn-experimental/blob/main/src/finn/util/platforms.py.
[42]
Lucian Petrica. [n.d.]. Quantized ResNet-50 Dataflow Acceleration on Alveo. Retrieved from https://github.com/Xilinx/ResNet50-PYNQ.
[43]
Lucian Petrica, Tobias Alonso, Mairin Kroes, Nicholas Fraser, Sorin Cotofana, and Michaela Blott. 2020. Memory-Efficient Dataflow Inference for Deep CNNs on FPGA. Retrieved from https://arXiv:2011.07317.
[44]
Xilinx University Program. [n.d.]. Xilinx Adaptive Compute Cluster (XACC) Program. Retrieved from https://www.xilinx.com/support/university/XUP-XACC.html.
[45]
Mario Ruiz. [n.d.]. XUP Vitis Network Example (VNx). Retrieved from https://github.com/Xilinx/xup_vitis_network_example.
[46]
Mario Ruiz, David Sidler, Gustavo Sutter, Gustavo Alonso, and Sergio López-Buedo. 2019. Limago: An FPGA-based Open-source 100 GbE TCP/IP Stack. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 286–292. https://doi.org/10.1109/FPL.2019.00053
[47]
Kirk Saban. 2012. Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity, Bandwidth, and Power Efficiency. https://www.xilinx.com/support/documentation/white_papers/wp380_Stacked_Silicon_Interconnect_Technology.pdf.
[48]
Junnan Shan, Mihai T. Lazarescu, Jordi Cortadella, Luciano Lavagno, and Mario R. Casu. 2021. CNN-on-AWS: Efficient allocation of multi-kernel applications on Multi-FPGA platforms. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 40, 2 (2021), 301–314. DOI:https://doi.org/10.1109/TCAD.2020.2994256
[49]
David Sidler, Gustavo Alonso, Michaela Blott, Kimon Karras, Kees Vissers, and Raymond Carley. 2015. Scalable 10 Gbps TCP/IP stack architecture for reconfigurable hardware. In Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 36–43.
[50]
David Sidler, Zsolt István, and Gustavo Alonso. 2016. Low-latency TCP/IP stack for data center applications. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1–4.
[51]
IEEE Computer Society. 2014. IEEE standard for ethernet-amendment 2: Physical layer specifications and management parameters for 100 Gb/s operation over backplanes and copper cables. IEEE Std 802.3 bj-2014 (Amendment to IEEE Std 802.3-2014). Retrieved from https://standards.ieee.org/standard/802-3bj-2014.html.
[52]
Roberto Dicecco Susanne M. Balle, Mark Tetreault. 2020. Inter-Kernel Links for Direct Inter-FPGA Communication. Retrieved from https://www.intel.com/content/dam/www/programmable/us/en/others/literature/wp/wp-01305-inter-kernel-links-for-direct-inter-fpga-communication.pdf.
[53]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329.
[54]
Naif Tarafdar, Giuseppe Di Guglielmo, Philip C. Harris, Jeffrey D. Krupa, Vladimir Loncar, Dylan S. Rankin, Nhan Tran, Zhenbin Wu, Qianfeng Shen, and Paul Chow. 2020. AIgean: An open framework for machine learning on heterogeneous clusters. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 239–239.
[55]
Naif Tarafdar, Giuseppe Di Guglielmo, Philip C. Harris, Jeffrey D. Krupa, Vladimir Loncar, Dylan S. Rankin, Nhan Tran, Zhenbin Wu, Qianfeng Shen, and Paul Chow. 2020. AIgean: An open framework for machine learning on heterogeneous clusters. In Proceedings of the 6th International Workshop on Heterogeneous High-performance Reconfigurable Computing. IEEE.
[56]
Naif Tarafdar, Nariman Eskandari, Varun Sharma, Charles Lo, and Paul Chow. 2018. Galapagos: A full stack approach to FPGA integration in the cloud. IEEE Micro 38, 6 (2018), 18–24.
[57]
Haroldo G. Santos and Tulio A. M. Toffolo. [n.d.]. The Python MIP Package. Retrieved from https://www.python-mip.com/.
[58]
Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 65–74.
[59]
Yaman Umuroglu and Magnus Jahre. 2017. Streamlined deployment for quantized neural networks. Retrieved from http://arxiv.org/abs/1709.04060.
[60]
Stylianos I. Venieris and Christos-Savvas Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, 40–47.
[61]
Nils Voss, Pablo Quintana, Oskar Mencer, Wayne Luk, and Georgi Gaydadjiev. 2019. Memory mapping for multi-die FPGAs. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). IEEE, 78–86.
[62]
Tianqi Wang, Tong Geng, Ang Li, Xi Jin, and Martin Herbordt. 2020. FPDeep: Scalable acceleration of CNN training on deeply-pipelined FPGA clusters. IEEE Trans. Comput. 69, 8 (2020), 1143–1158. https://doi.org/10.1109/TC.2020.3000118
[64]
Xiaoyu Yu, Yuwei Wang, Jie Miao, Ephrem Wu, Heng Zhang, Yu Meng, Bo Zhang, Biao Min, Dewei Chen, and Jianlin Gao. 2019. A data-center FPGA acceleration platform for convolutional neural networks. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 151–158.
[65]
Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the International Symposium on Low Power Electronics and Design. 326–331.
[66]
Wentai Zhang, Jiaxi Zhang, Minghua Shen, Guojie Luo, and Nong Xiao. 2019. An efficient mapping approach to large-scale DNNs on multi-FPGA architectures. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’19). IEEE, 1241–1244.

Cited By

View all

Index Terms

  1. Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Reconfigurable Technology and Systems
    ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 2
    June 2022
    310 pages
    ISSN:1936-7406
    EISSN:1936-7414
    DOI:10.1145/3501287
    • Editor:
    • Deming Chen
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 December 2021
    Accepted: 01 June 2021
    Revised: 01 April 2021
    Received: 01 January 2021
    Published in TRETS Volume 15, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Deep neural networks
    2. partitioning
    3. distributed inference

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)467
    • Downloads (Last 6 weeks)26
    Reflects downloads up to 06 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media