research-article

Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning

Authors:

Lucian Petrica,

Jakoba Petri-Koenig,

Yaman Umuroglu,

Ioannis Stamelos,

Elias Koromilas,

Michaela Blott,

Kees VissersAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 15, Issue 2

Article No.: 15, Pages 1 - 34

https://doi.org/10.1145/3470567

Published: 06 December 2021 Publication History

Abstract

Customized compute acceleration in the datacenter is key to the wider roll-out of applications based on deep neural network (DNN) inference. In this article, we investigate how to maximize the performance and scalability of field-programmable gate array (FPGA)-based pipeline dataflow DNN inference accelerators (DFAs) automatically on computing infrastructures consisting of multi-die, network-connected FPGAs. We present Elastic-DF, a novel resource partitioning tool and associated FPGA runtime infrastructure that integrates with the DNN compiler FINN. Elastic-DF allocates FPGA resources to DNN layers and layers to individual FPGA dies to maximize the total performance of the multi-FPGA system. In the resulting Elastic-DF mapping, the accelerator may be instantiated multiple times, and each instance may be segmented across multiple FPGAs transparently, whereby the segments communicate peer-to-peer through 100 Gbps Ethernet FPGA infrastructure, without host involvement. When applied to ResNet-50, Elastic-DF provides a 44% latency decrease on Alveo U280. For MobileNetV1 on Alveo U200 and U280, Elastic-DF enables a 78% throughput increase, eliminating the performance difference between these cards and the larger Alveo U250. Elastic-DF also increases operating frequency in all our experiments, on average by over 20%. Elastic-DF therefore increases performance portability between different sizes of FPGA and increases the critical throughput per cost metric of datacenter inference.

References

[1]

F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss, and S. Paredes. 2017. An FPGA platform for hyperscalers. In Proceedings of the IEEE 25th Annual Symposium on High-performance Interconnects (HOTI’17). 29–32. https://doi.org/10.1109/HOTI.2017.13

[2]

Amazon AWS. 2018. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.

[3]

Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M. Bronstein, and Avi Mendelson. 2018. Streaming architecture for large-scale quantized neural networks on an FPGA-based dataflow platform. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’18). IEEE, 162–169.

[4]

Giuseppe Bianchi, Michael Welzl, Angelo Tulumello, Giacomo Belocchi, Marco Faltelli, and Salvatore Pontarelli. 2018. A fully portable TCP implementation using XFSMs. In Proceedings of the ACM SIGCOMM Conference on Posters and Demos. ACM, 99–101.

Digital Library

[5]

Michaela Blott, Thomas B. Preußer, Nicholas J. Fraser, Giulio Gambardella, Kenneth O’brien, Yaman Umuroglu, Miriam Leeser, and Kees Vissers. 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 1–23.

Digital Library

[6]

Alex R. Bucknall, Shanker Shreejith, and Suhaib A. Fahmy. 2020. Build automation and runtime abstraction for partial reconfiguration on Xilinx Zynq Ultrascale+. In Proceedings of the International Conference on Field-Programmable Technology.

[7]

Salvatore Calg̀, Grzegorz Korcyl, and Piotr Korcyl. 2020. Using Xilinx Alveo accelerators for lattice QCD. In Proceedings of the Asia-Pacific Intertional Symposium on Lattice Field Theory, Vol. 2020.

[8]

Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim et al. 2016. A cloud-scale acceleration architecture. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–13.

Digital Library

[9]

Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman et al. 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8–20.

[10]

ONNX Community. [n.d.]. ONNX: Open Neural Network Exchange. Retrieved from https://github.com/onnx.

[11]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT Press, Cambridge, MA.

Digital Library

[12]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 248–255.

[13]

Li Ding, Ping Kang, Wenbo Yin, and Linli Wang. 2016. Hardware TCP offload engine based on 10-Gbps ethernet for low-latency network communication. In Proceedings of the International Conference on Field-Programmable Technology (FPT’16). IEEE, 269–272.

[14]

Hamish Fallside, M. John, and S. Smith. 2000. Internet connected FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, 289–290.

Digital Library

[15]

Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 1–14.

Digital Library

[16]

Jeremy Fowers, Kalin Ovtcharov, Michael K. Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi et al. 2019. Inside project Brainwave’s cloud-scale, real-time AI processor. IEEE Micro 39, 3 (2019), 20–28.

Digital Library

[17]

Mohammad Ghasemzadeh, Mohammad Samragh, and Farinaz Koushanfar. 2018. ReBNet: Residual binarized neural network. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 57–64.

[18]

ETH Zurich Systems Group. [n.d.]. Vitis with 100 Gbps TCP/IP Network Stack. Retrieved from https://github.com/fpgasystems/Vitis_with_100Gbps_TCP-IP.

[19]

Andre Hahn Pereira and Vaughn Betz. 2014. Cad and routing architecture for interposer-based multi-fpga systems. In Proceedings of the ACM/SIGDA international symposium on Field-programmable gate arrays. 75–84.

Digital Library

[20]

Mathew Hall and Vaughn Betz. 2020. HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs. Retrieved from https://arXiv:2007.10451.

[21]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. Retrieved from https://arXiv:1512.03385.

[22]

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. Retrieved from http://arxiv.org/abs/1704.04861.

[23]

Huawei. 2019. Retrieved from https://www.huaweicloud.com/en-us/product/fcs.html.

[24]

Yong Ji and Qing-Sheng Hu. 2011. 40Gbps Multi-Connection TCP/IP Offload Engine. In Proceedings of the International Conference on Wireless Communications and Signal Processing (WCSP’11). IEEE, 1–5.

[25]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. Retrieved from https://arXiv:1807.05358.

[26]

Weiwen Jiang, Edwin H.-M. Sha, Xinyi Zhang, Lei Yang, Qingfeng Zhuge, Yiyu Shi, and Jingtong Hu. 2019. Achieving super-linear speedup across multi-fpga for real-time dnn inference. ACM Trans. Embed. Comput. Syst. 18, 5s (2019), 1–23.

Digital Library

[27]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the International Symposium on Computer Architecture (ISCA’17). ACM, 1–12.

Digital Library

[28]

Christoforos Kachris. 2018. Performance evaluation of InAccel ML scalable suite. Technical Report. InAccel. Retrieved from https://www.inaccel.com/wp-content/uploads/inaccel_white_paper.pdf.

[29]

Vinod Kathail. 2020. Xilinx vitis unified software platform. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 173–174.

Digital Library

[30]

Justin Knapheide, Benno Stabernack, and Maximilian Kuhnke. 2020. A high throughput MobileNetV2 FPGA implementation based on a flexible architecture for depthwise separable convolution. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 277–283.

[31]

Mairin Kroes, Lucian Petrica, Sorin Cotofana, and Michaela Blott. 2020. Evolutionary bin packing for memory-efficient dataflow inference acceleration on FPGA. In Proceedings of the International Conference on Genetic and Evolutionary Computation (GECCO’20).

Digital Library

[32]

Xilinx Research Labs. [n.d.]. FINN Dataflow Accelerator Examples. Retrieved from https://github.com/Xilinx/finn-examples.

[33]

Xilinx Research Labs. [n.d.]. FINN: Dataflow compiler for QNN inference on FPGAs. Retrieved from https://github.com/Xilinx/finn.

[34]

Fubing Mao, Wei Zhang, Bo Feng, Bingsheng He, and Yuchun Ma. 2016. Modular placement for interposer based multi-FPGA systems. In Retrieved from International Great Lakes Symposium on VLSI (GLSVLSI’16). IEEE, 93–98.

Digital Library

[35]

Kevin E. Murray and Vaughn Betz. 2015. HETRIS: Adaptive floorplanning for heterogeneous FPGAs. In Proceedings of the International Conference on Field Programmable Technology (FPT’15). IEEE, 88–95.

[36]

Naif Tarafdar, Giuseppe Di Guglielmo, Philip C. Harris, Jeffrey D. Krupa, Vladimir Loncar, Dylan S. Rankin, Nhan Tran, Zhenbin Wu, Qianfeng Shen, and Paul Chow. [n.d.]. AIgean: An Open Framework for Machine Learning on a Heterogeneous Cluster. Retrieved from https://indico.cern.ch/event/924283/contributions/4105333/attachments/2154984/3634529/aigean_fastml.pdf.

[37]

Hiroki Nakahara, Zhiqiang Que, and Wayne Luk. 2020. High-Throughput convolutional neural network on an FPGA by customized JPEG compression. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 1–9.

[38]

Ehsan Nasiri, Javeed Shaikh, Andre Hahn Pereira, and Vaughn Betz. 2015. Multiple dice working as one: CAD flows and routing architectures for silicon interposer fpgas. IEEE Trans. Very Large Scale Integr. Syst. 24, 5 (2015), 1821–1834.

Digital Library

[39]

Nimbix. 2019. Retrieved from https://www.nimbix.net/alveo.

[40]

Alessandro Pappalardo. [n.d.]. Brevitas: Quantization-Aware Training in PyTorch. Retrieved from https://github.com/Xilinx/brevitas.

[41]

Lucian Petrica. [n.d.]. Elastic-DF partitioner and FINN integration. Retrieved from https://github.com/Xilinx/finn-experimental/blob/main/src/finn/util/platforms.py.

[42]

Lucian Petrica. [n.d.]. Quantized ResNet-50 Dataflow Acceleration on Alveo. Retrieved from https://github.com/Xilinx/ResNet50-PYNQ.

[43]

Lucian Petrica, Tobias Alonso, Mairin Kroes, Nicholas Fraser, Sorin Cotofana, and Michaela Blott. 2020. Memory-Efficient Dataflow Inference for Deep CNNs on FPGA. Retrieved from https://arXiv:2011.07317.

[44]

Xilinx University Program. [n.d.]. Xilinx Adaptive Compute Cluster (XACC) Program. Retrieved from https://www.xilinx.com/support/university/XUP-XACC.html.

[45]

Mario Ruiz. [n.d.]. XUP Vitis Network Example (VNx). Retrieved from https://github.com/Xilinx/xup_vitis_network_example.

[46]

Mario Ruiz, David Sidler, Gustavo Sutter, Gustavo Alonso, and Sergio López-Buedo. 2019. Limago: An FPGA-based Open-source 100 GbE TCP/IP Stack. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 286–292. https://doi.org/10.1109/FPL.2019.00053

[47]

Kirk Saban. 2012. Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity, Bandwidth, and Power Efficiency. https://www.xilinx.com/support/documentation/white_papers/wp380_Stacked_Silicon_Interconnect_Technology.pdf.

[48]

Junnan Shan, Mihai T. Lazarescu, Jordi Cortadella, Luciano Lavagno, and Mario R. Casu. 2021. CNN-on-AWS: Efficient allocation of multi-kernel applications on Multi-FPGA platforms. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 40, 2 (2021), 301–314. DOI:https://doi.org/10.1109/TCAD.2020.2994256

Digital Library

[49]

David Sidler, Gustavo Alonso, Michaela Blott, Kimon Karras, Kees Vissers, and Raymond Carley. 2015. Scalable 10 Gbps TCP/IP stack architecture for reconfigurable hardware. In Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 36–43.

Digital Library

[50]

David Sidler, Zsolt István, and Gustavo Alonso. 2016. Low-latency TCP/IP stack for data center applications. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1–4.

[51]

IEEE Computer Society. 2014. IEEE standard for ethernet-amendment 2: Physical layer specifications and management parameters for 100 Gb/s operation over backplanes and copper cables. IEEE Std 802.3 bj-2014 (Amendment to IEEE Std 802.3-2014). Retrieved from https://standards.ieee.org/standard/802-3bj-2014.html.

[52]

Roberto Dicecco Susanne M. Balle, Mark Tetreault. 2020. Inter-Kernel Links for Direct Inter-FPGA Communication. Retrieved from https://www.intel.com/content/dam/www/programmable/us/en/others/literature/wp/wp-01305-inter-kernel-links-for-direct-inter-fpga-communication.pdf.

[53]

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329.

[54]

Naif Tarafdar, Giuseppe Di Guglielmo, Philip C. Harris, Jeffrey D. Krupa, Vladimir Loncar, Dylan S. Rankin, Nhan Tran, Zhenbin Wu, Qianfeng Shen, and Paul Chow. 2020. AIgean: An open framework for machine learning on heterogeneous clusters. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 239–239.

[55]

Naif Tarafdar, Giuseppe Di Guglielmo, Philip C. Harris, Jeffrey D. Krupa, Vladimir Loncar, Dylan S. Rankin, Nhan Tran, Zhenbin Wu, Qianfeng Shen, and Paul Chow. 2020. AIgean: An open framework for machine learning on heterogeneous clusters. In Proceedings of the 6th International Workshop on Heterogeneous High-performance Reconfigurable Computing. IEEE.

[56]

Naif Tarafdar, Nariman Eskandari, Varun Sharma, Charles Lo, and Paul Chow. 2018. Galapagos: A full stack approach to FPGA integration in the cloud. IEEE Micro 38, 6 (2018), 18–24.

[57]

Haroldo G. Santos and Tulio A. M. Toffolo. [n.d.]. The Python MIP Package. Retrieved from https://www.python-mip.com/.

[58]

Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 65–74.

Digital Library

[59]

Yaman Umuroglu and Magnus Jahre. 2017. Streamlined deployment for quantized neural networks. Retrieved from http://arxiv.org/abs/1709.04060.

[60]

Stylianos I. Venieris and Christos-Savvas Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, 40–47.

[61]

Nils Voss, Pablo Quintana, Oskar Mencer, Wayne Luk, and Georgi Gaydadjiev. 2019. Memory mapping for multi-die FPGAs. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). IEEE, 78–86.

[62]

Tianqi Wang, Tong Geng, Ang Li, Xi Jin, and Martin Herbordt. 2020. FPDeep: Scalable acceleration of CNN training on deeply-pipelined FPGA clusters. IEEE Trans. Comput. 69, 8 (2020), 1143–1158. https://doi.org/10.1109/TC.2020.3000118

Digital Library

[63]

Xilinx. [n.d.]. Vitis AI Model Zoo. Retrieved from https://github.com/Xilinx/Vitis-AI/tree/master/models/AI-Model-Zoo.

[64]

Xiaoyu Yu, Yuwei Wang, Jie Miao, Ephrem Wu, Heng Zhang, Yu Meng, Bo Zhang, Biao Min, Dewei Chen, and Jianlin Gao. 2019. A data-center FPGA acceleration platform for convolutional neural networks. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 151–158.

[65]

Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the International Symposium on Low Power Electronics and Design. 326–331.

Digital Library

[66]

Wentai Zhang, Jiaxi Zhang, Minghua Shen, Guojie Luo, and Nong Xiao. 2019. An efficient mapping approach to large-scale DNNs on multi-FPGA architectures. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’19). IEEE, 1241–1244.

Cited By

Silva WVerdi FMatos GWilliams AJoe-Wong CVarvello M(2024)Delivering Augmented Reality to the Edge: An Approach Toward Object Recognition through the In-Network Computing ParadigmProceedings of the CoNEXT on Student Workshop 202410.1145/3694812.3699933(13-14)Online publication date: 9-Dec-2024
https://dl.acm.org/doi/10.1145/3694812.3699933
Du LLiang TZhou XGe JLi SSinha SZhao JXie ZZhang W(2024)FADO: Floorplan-Aware Directive Optimization Based on Synthesis and Analytical Models for High-Level Synthesis Designs on Multi-Die FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/365345817:3(1-33)Online publication date: 17-Sep-2024
https://doi.org/10.1145/3653458
Fang JXu HZhao GYu ZShen BXie L(2024)Accelerating Distributed Training With Collaborative In-Network AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2024.338794832:4(3437-3452)Online publication date: Aug-2024
https://doi.org/10.1109/TNET.2024.3387948
Show More Cited By

Index Terms

Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning
1. Computer systems organization
  1. Architectures
    1. Distributed architectures

Recommendations

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference
Special Issue ESWEEK 2019, CASES 2019, CODES+ISSS 2019 and EMSOFT 2019

Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple’s Siri) and edge computing (e.g., Google/Waymo’s driverless car). FPGA-based DNN ...
Partitioning signal processing applications to different granularity reconfigurable logic
SSIP'05: Proceedings of the 5th WSEAS international conference on Signal, speech and image processing

In this paper, we propose a methodology for partitioning DSP applications between the fine and coarse-grain reconfigurable hardware for improving performance. The fine-grain logic is implemented by an embedded FPGA unit, while for the coarse-grain ...
FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge
DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

While embedded FPGAs are attractive platforms for DNN acceleration on edge-devices due to their low latency and high energy efficiency, the scarcity of resources of edge-scale FPGA devices also makes it challenging for DNN deployment. In this paper, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 15, Issue 2

June 2022

310 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/3501287

Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2021

Accepted: 01 June 2021

Revised: 01 April 2021

Received: 01 January 2021

Published in TRETS Volume 15, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
1,599
Total Downloads

Downloads (Last 12 months)467
Downloads (Last 6 weeks)26

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Silva WVerdi FMatos GWilliams AJoe-Wong CVarvello M(2024)Delivering Augmented Reality to the Edge: An Approach Toward Object Recognition through the In-Network Computing ParadigmProceedings of the CoNEXT on Student Workshop 202410.1145/3694812.3699933(13-14)Online publication date: 9-Dec-2024
https://dl.acm.org/doi/10.1145/3694812.3699933
Du LLiang TZhou XGe JLi SSinha SZhao JXie ZZhang W(2024)FADO: Floorplan-Aware Directive Optimization Based on Synthesis and Analytical Models for High-Level Synthesis Designs on Multi-Die FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/365345817:3(1-33)Online publication date: 17-Sep-2024
https://doi.org/10.1145/3653458
Fang JXu HZhao GYu ZShen BXie L(2024)Accelerating Distributed Training With Collaborative In-Network AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2024.338794832:4(3437-3452)Online publication date: Aug-2024
https://doi.org/10.1109/TNET.2024.3387948
Tang YSong YElango NPriya SJones AXiong JZhou PHu J(2024)CHEF: A Framework for Deploying Heterogeneous Models on Clusters With Heterogeneous FPGAsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343899443:11(3937-3948)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TCAD.2024.3438994
Kreß FEl Annabi EHotfilter THoefer JHarbaum TBecker J(2024)Automated Deep Neural Network Inference Partitioning for Distributed Embedded Systems2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI61997.2024.00019(39-44)Online publication date: 1-Jul-2024
https://doi.org/10.1109/ISVLSI61997.2024.00019
Toupas PYu ZBouganis CTzovaras D(2024)SMOF: Streaming Modern CNNs on FPGAs with Smart Off-Chip Eviction2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00029(185-196)Online publication date: 5-May-2024
https://doi.org/10.1109/FCCM60383.2024.00029
Mohan NHosni AAtef M(2024)Neural Networks Implementations on FPGA for Biomedical Applications: A ReviewSN Computer Science10.1007/s42979-024-03381-45:8Online publication date: 30-Oct-2024
https://doi.org/10.1007/s42979-024-03381-4
Agut DTornero RFlich J(2023)Towards Efficient Neural Network Model Parallelism on Multi-FPGA Platforms2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137117(1-6)Online publication date: Apr-2023
https://doi.org/10.23919/DATE56975.2023.10137117
Alam SGregg DGambardella GPreusser TBlott M(2023)On the RTL Implementation of FINN Matrix Vector UnitACM Transactions on Embedded Computing Systems10.1145/354714122:6(1-27)Online publication date: 9-Nov-2023
https://dl.acm.org/doi/10.1145/3547141
Ibrahim MZhao ZHall MBetz V(2023)Extending Data Flow Architectures for Convolutional Neural Networks to Multiple FPGAs2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00020(132-141)Online publication date: 12-Dec-2023
https://doi.org/10.1109/ICFPT59805.2023.00020
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents