We first describe the hardware testbed used for our experiments and discuss the ease of use of AIgean in its current state with a case study we did for our own experiments. Then we present more quantitative results by addressing the physical limits of the communication links, and finally we present the current performance results of our first applications starting with a small network to illustrate the latency benefits of using network-connected FPGAs and then for ResNet-50 as a test to see whether we can implement a very large network.
5.4 Autoencoder
Here we describe our first small multi-FPGA network implemented with
AIgean. We consider an example network with applications for high-energy physics. Specifically, our network is an
autoencoder designed to detect anomalous events, potentially from new physics. An autoencoder is an unsupervised learning technique that leverages a neural network where a bottleneck in the shape of the network forces a compressed representation of the original input. Details about the model and use cases can be found in Appendix
A.2.
This network is a very interesting size for our studies, as it can be implemented on a single FPGA, but this requires a high degree of resource reuse that necessarily increases the inference latency. When splitting the network across multiple FPGAs, we can adjust the throughput and latency of the network by changing the reuse factor and compiling the network across multiple FPGAs. The network split across multiple FPGAs will have a higher throughput but incurs some latency from the transfer of the intermediate results.
The resources for the autoencoder network are shown in Table
2 along with the resources available on the FPGAs we used. To test this autoencoder, we considered two separate implementations of the network: an implementation using an AWS F1-instance (VU9P FPGA) using SDAccel, and a second implementation using
AIgean on three Sidewinder (ZU19EG FPGA) boards. What is notable is that the single FPGA implementation would not be able to fit on a single Sidewinder board, and it would have to be spread over multiple FPGAs for the chosen reuse factor. The single FPGA implementation also requires more than one super logic region, and as a consequence has difficulty meeting timing when compiled on the F1 instance with SDAccel.
Table
3 highlights the results from implementing the autoencoder on various devices as well as on a single FPGA using SDAccel and three FPGAs using
AIgean.
Our 1-FPGA autoencoder is clocked at 125 MHz at a low reuse factor when using SDAccel. Limitations in our version of SDAccel, as well as the resources required for the FPGA, prevented us from using a higher clock speed. For the 3-FPGA version, we used AIgean and were able to achieve 200 MHz for two of the FPGAs and 190 MHz for the third one. We did not try to improve it, so we will use 190 MHz since that is the limitation. To make a fair comparison to the 1-FPGA implementation, we scale the AIgean latency by the ratio of clock speeds and get \(0.08 \times 190/125 = 0.12\)ms, which is still \(0.24/0.12 = 2\) times better than the latency using SDAccel. This shows that there is still a significant architectural advantage to using multiple FPGAs and is not unexpected because more resources are available. The performance increase with three FPGAs can be attributed to (a) the use of networking to directly communicate with the FPGA, yielding low latency, and (b) less demanding resources per FPGA since only one-third of the model is implemented on each FPGA.
The implementations of this model on both a single FPGA and the full three FPGAs have an initiation interval of 552 clocks and require roughly the same resources (the reuse factor is the same). In other words, the three FPGAs are capable of processing a new image every 2.76 \(\mu s\) (362 KHz). Such a throughput approaches the demands needed for real-time processing of anomalies at the LHC. Although the single FPGA implementation with SDAccel has a potential throughput that is half that of the 3-FPGA implementation, achieving this throughput would require efficiently buffering the inputs and outputs by sending larger batches of calls on and off the FPGA through the DDR and PCIe transfers. As a consequence, the individual (batch-1) latency would be significantly degraded for the final throughput to approach half that of the 3-FPGA implementation.
5.5 ResNet-50
To test
AIgean on a much larger network, we have developed a multi-FPGA implementation of ResNet-50 [
50]. The flexibility provided by
AIgean allows us to target a high throughput implementation whereby we unroll the multiplication in each CNN layer at a rate corresponding to the number of pixels that are being used in each respective CNN layer. This allows for the design of ResNet-50 that can be balanced across the different CNN layers to have a uniform throughput.
Most of ResNet-50’s architecture can be broken down into many sub-blocks consisting of a Split, two to three convolutions followed by a Relu, and an addition operator as shown in Figure
6. The dashed boxes represent the IP block granularity that we have used within our implementations.
We have two implementations of ResNet-50: the first requires 12 Sidewinder boards using int-8 precision (ranging from 80% to about 90% of the resources used on each FPGA); the second is more DSP efficient and requires 10 Sidewinder boards using int-8 precision as well. We have one FPGA available to use as a 100G data generator that can feed inputs at line rate to the FPGAs. For the 12-FPGA configuration, we tested in a piece-wise fashion.
5 We have tested the traffic generator and the first 10 of 12 FPGAs followed by testing the traffic generator and the remaining 2 FPGAs. We have verified that the full 10-FPGA configuration and the piece-wise 12-FPGA configuration can run at 660 images per second.
Table
4 summarizes the throughput and latency results of our full 12-FPGA implementation of ResNet-50. When the source data is coming from the CPU, we observe that the maximum throughput is only 400 images per second with a latency of 7 ms due to the bandwidth limitation between the CPU and the FPGA (5-ms latency between the CPU and the FPGA). To demonstrate the full performance achievable with the FPGAs, we use the FPGA data generator and observe a throughput of 660 images per second with a latency of about 1.9 ms. The latency is determined through a simulation of the full ResNet-50 network where each layer is separately run in parallel. The network delay between each FPGA is estimated from Table
1 using the QSFP. For 10 hops, the total network delay would be 0.0017 ms, which is insignificant compared to the computation latency. The next row gives the values for Microsoft’s Brainwave [
51]. For the latency of Brainwave, we quote the end-to-end latency determined from sending an image to a Brainwave server and then receiving the result for a CPU within the same computing cluster. The final row shows the performance for an Nvidia V100 GPU using the mixed precision implementation of ResNet-50 applied for batch 1. The latency and throughput quoted is obtained through the use of the Triton inference server with a client on the same machine. As a consequence, the latency numbers include the PCIe transfer time in addition to the network inference. Equivalent numbers quoted by Nvidia yield a batch-2 latency of 1 ms with a throughput of 2,000 images per second for the same model [
52]; batch 1 latency is not quoted.
Table
5 summarizes the resources used for our 12-FPGA implementation. Note that this was partitioned with our greedy partitioning scheme that uses a heuristic of 80% utilization before allocating the next FPGA. The 10-FPGA configuration is very similar in terms of resources but with half the DSPs. Some other noteworthy details are that a number of the layers early in the network are smaller, and we can see that the FPGAs are DSP limited as compared to the larger layers later in the network being logic limited. The highest resource utilized for each FPGA is shown in bold, representing the limiting factor of each FPGA (with exception of the last FPGA that is not fully used.). For perspective, the total resources available on an individual FPGA are shown at the bottom of Table
5. This FPGA is approximately equivalent to a single SLR of the VU9P FPGA in the Amazon F1 instance (each VU9P having three SLRs) [
53]. For further perspective, we can also compare this to the Xilinx Alveo U250 [
54]. Our current utilization is DSP limited, and we could fit our entire ResNet-50 implementation on two Alveo U250 boards, where the U250 board has 12.2K DSPs.
Last, we would like to contrast this implementation with previous implementations of ResNet-50. The design flow of AIgean differs from previous 8-bit implementations of ResNet-50 in that no overlay is used, and each layer is implemented separately. In this scenario, it is possible to continuously stream images through the implementation without having to wait for an image to be complete. With the overlay architecture, the images are streamed through each layer to a buffer and then subsequent layers are loaded and the next layer is streamed. As a consequence, a scheme is needed for buffering of each input. Additionally, some time is needed to switch between layers. With the AIgean design flow, the whole network exists on the FPGA fabric, and so images can be continuously pumped through. This leads to a more efficient use of multiplier resources, at a cost of additional resources to route individual layers together. Since images are continuously pumped through, we achieve batch-1 streaming. Additionally, since we are continuously pumping images through, the amount of buffering between the layers is limited to just the partial image that is needed for matrix multiplications of the CNN applied to nearby pixels.
To understand the efficient use of resources, we compute the total number of multiplication operations needed for a perfectly efficient FPGA clocked at 200 MHz. With our implementation of ResNet-50, we find a total of 4B multiplications, which if we divide by 3 \(\times 10^{5}\) clocks to achieve a 1.5-ms latency at 200 MHz yields a total of 13,500 multiplications per cycle. Our current implementation uses 15,419 DSPs, which is slightly more due to the fact that many of the individual layers are tuned to a latency that is actually below 1.5 ms. The number of DSPs can be reduced through two means: first, through the sharing of DSPs, which is only partially implemented here, and second, through the use of a faster clock frequency. The sharing of DSPs would lead to roughly a factor of 2 reduction in DSPs. A faster clock frequency would yield a lower latency for the same number of DSPs. Since each multiplier unit is mapped directly to a specific multiplication within the network, the only way to inefficiently use the DSP resources results from the case where an allowed reuse parameter for a specific latency is not near the desired throughput and, as a consequence, the individual layer has a significantly lower latency than its neighboring layers.
Adjustment of the reuse parameter effectively modifies the initiation interval of each layer. A reuse factor of 5,000, corresponds to a layer that has an initiation interval of 5,000. To efficiently adjust the reuse parameters with hls4ml, the reuse needs to split the dense matrix multiply embedded within the layer across DSPs so as to maintain a regular systolic-array architecture. As a consequence, optimal implementations of the reuse can only be certain numbers, which is determined by the number of input and output features of each layer. Our current implementation is near ideal since 1.5 ms allows for a consistent set of reuse values that are near the 1.5-ms ideal latency point. To achieve a higher throughput, we need to adjust the reuse factor to the desired throughput and re-implement the whole design. Although this procedure requires a lot of computing, the whole procedure is automated through the AIgean design flow.
When adjusting the reuse factor, we observe a direct correlation with the number of DSPs. Halving the reuse factor will halve the initiation interval of the matrix multiply within a layer, and it will also double the number of DSPs. Flip-Flops, and LUTs will not change as significantly since they largely exist to store partial images. BlockRAMs are used primarily to store weights of the neural network on the FPGA. Their second use is to act as a buffer between layers. As a consequence, the BlockRAM resources will not change significantly with reuse factor. In this current implementation, since DSP sharing of the multiplications is only partially used, the resulting resources are more consistent with a ResNet-50 implementation having a latency of roughly half the observed latency (0.75 ms).
Faster implementations of ResNet-50 are possible by adjusting the reuse factor. However, for CNNs, a lower bound is present in the current, pixel-by-pixel implementation of the algorithm. The lower bound results from the fact that for each pixel that streams through the algorithm, there is a one clock latency. Furthermore, there is an additional latency of three clocks to prepare the inputs to run the matrix multiplication. For layers within the network, where there are many pixels, such as the first layer, the ultimate latency is limited by these operations. Applying this limit to the first layer of ResNet-50, we find that the single layer throughput is bounded to be greater than roughly 0.4 ms. Lower single-inference latencies can still be achieved by splitting the image into sub-images and simultaneously streaming these sub-image streams into separate, cloned implementations of the chosen layer. Although the use of multiple streams effectively reduces the single inference throughput by the (number of streams)\(^{-1}\), it has the added cost of increasing the resources by the number of streams.