4.2 Impact of Hardware Optimizations
To evaluate the cumulative benefits introduced by each hardware optimization, we progressively apply them to the
Inverse Helmholtz operator used as a case study and so extensively discussed in this article. We evaluated the implementations for two polynomial degrees:
\(p=7\) and
\(p=11\). In particular, given the polynomial degree
\(p\), we assume that each contraction is composed of three loop nests that execute two floating-point operations (one addition and one multiplication) for
\(p\times p \times p~ \times ~p\) times each. Similarly, the Hadamard product requires
\(p\times p \times p\) multiplications. So, the entire Inverse Helmholtz operator the following number of floating-point operations:
So a single element is required to execute
\(N^{el}_{op}\) = 177,023 floating-point operations when
\(p=11\) and
\(N^{el}_{op}\) = 29,155 floating-point operations when
\(p=7\). The total number of floating-point operations for a CFD simulation can be obtained as
We executed all experiments with
\(N_{eq}= \hbox{2,000,000}\), i.e., we simulated 2,000,000 elements.
We executed our CFDlang on the DSL description in Figure
2 to generate the C kernel for hardware optimization. We first performed experiments to evaluate the effects of optimizations with
\(p=11\). In particular, we progressively added the following optimizations:
•
Baseline: No optimizations are used. The code serially executes kernels and data transfers, while each CU contains only one kernel and is connected to the HBM with 64-bit AXI channels.
•
Host-HBM Double Buffering: We introduce this well-known optimization to hide CPU-FPGA communication latency.
•
HBM-FPGA Bus Optimization: We evaluate the effect of widening the bus to 256 bits, with only one kernel unit (and serializing the data) and with multiple lanes feeding parallel kernel units.
•
Dataflow Optimization: We create several variants of the compute functions with one, two, three, and seven subkernels. We evaluate the performance vs. resources tradeoff.
•
Resource Optimization: We apply on-chip memory sharing (only in the case of dataflow implementations with one block inside the compute part) and fixed-point optimizations (with 64- and 32-bit implementations).
For each of these implementations, we measured total and kernel execution times, maximum and average power consumption, and cost in terms of hardware resources. Figure
15 shows the performance (in terms of GFLOPS) achieved in each experiment when adding the specific optimization on top of the previous ones. In each experiment, the left black and white bar (CU) shows the GFLOPS of the CUs on their own, without considering host-FPGA data transfers, while the right azure bar (System) includes the entire application. Comparing the two bars allows us to evaluate the peak performance of the kernels and the effects of data transfers.
The Baseline case achieves only 2.9 GFLOPS, and the difference between the CU performance and the overall system performance is 9.2%. This is due to the serial nature of the implementation where data are transferred from the host to the HBM, then processed by the CU and sent back to the host before starting a new batch. If more data need to be transferred, then this discrepancy between CU performance and overall system performance will grow larger, as the CU needs to wait for all of the data to be sent before beginning execution.
After the Double Buffering optimization, the CU performance remains similar, with a small degradation due to overhead, while the system performance is now the same as the CU performance. This is an improvement over the Baseline implementation because now the host to HBM data transfers are happening in parallel to and are entirely hidden behind the CU execution.
We then executed two experiments for evaluating Bus Optimization. In the Serial version, we attempt to utilize the full bandwidth of the 256-bit bus by packing four doubles. The CU reads them in parallel, but then it serializes them when it needs to access its own local buffers. While this optimization is supposed to speed up data reads from the HBM, its implementation in the CU leads to a performance degradation of about 3\(\times\). This is mostly due to the complexity of aligning the data (\(p\times p\) and \(p\times p\times p\)) to multiples of fours inside the bus. To combat this, but still use the full bus bandwidth, this optimization was replaced with the Parallel implementation where four kernels are instantiated in the CU and the data for each “lane” is stored in separate buffers, one for each kernel. This led to a \(3.92\times\) speedup over the Serial implementation, as the four lanes are now all read in parallel. There is only a \(1.23\times\) speedup over the Double Buffering implementation, where a \(4\times\) speedup would be expected. This is because in both Bus Opt implementations, the HLS tool only instantiated two double-precision multipliers (rather than 11 in all other implementations). So when the innermost loops are unrolled, a resource limitation violation increases the initiation interval (II) to 4, effectively counteracting the expected \(4\times\) speedup. We use the Parallel architecture in the following experiments.
Next, we tested various forms of the
Dataflow Optimization. Each implementation of this optimization separated the kernels into read, compute, and write modules and streams were used to pass data between them, allowing a pipelined structure. When using one compute subkernel (
Dataflow (1 Compute)) test, the speedup was
\(3.68\times\). In these cases, the HLS tool instantiated the full 11 multipliers, removing the II violation. This decision of the HLS tool along with the overlapping execution of the read, compute, and write modules allows us to effectively achieve a 4
\(\times\) speed-up, i.e., to full exploit the four parallel lanes. Since the compute module dominated the execution time, it was further split into two, three, and seven modules. The Inverse Helmholtz operator comprises seven loop nests, each implementing the operations in the seven grey rows on the right side of Figure
10. To split into two modules, the kernel is divided into a first module with the first three loop nests with
S and
u as input and
t as output, and a second module with the last four loop nests with
S,
D, and
t as input and
v as the final output. The split was made as such so that the first module does not need
D as an input and the second module does not need
u as an input. To split into three modules, we use the division shown on the left side of Figure
10 and in Figure
11. This is the most natural division as it matches the initial DSL representation where the first three loop nests implement the
gemm operator, the fourth loop nest implements the
mmult operator, and the last three loop nests implement the
gemm_inv operator. Another benefit to this division is that the
mmult loop nest consumes and produces data in the same order it is sent via the streams, meaning that no extra buffering is needed for this module and each data element can be immediately processed as it is received leading to a minimal latency. To split into seven modules, each loop nest is a separate module.
All three of these tests gained speedup over the 1-Compute version by breaking the total execution time of a single module down further. The 2-Compute version is \(1.7\times\) faster than the 1-Compute version. The discrepancy between this result and the ideal speedup of \(2\times\) is due to the extra data buffering that must be done in each module to allow random access. In the 1-Compute case, the input arrays are each buffered one time, which adds an overall latency equal to the total input data size. In the 2-Compute case, the S array is needed by both modules and must be buffered twice. Additionally, the output of the first module, t, is used as input to the second module and must also be buffered. This extra buffering is overlapped while the two modules execute in a pipeline fashion, but it means that the latencies of each module are not exactly half of the total latency of one unified module. However, 3-Compute modules was slower than 2-Compute modules. The overall execution time is determined by the module with the longest latency, as it is the limiting factor in the overall latency of the system. Because the loop nest implementing the gemm operator has a minimal latency, moving it to a separate module does not significantly reduce the latency of the largest module. In fact, in each case, the module with the longest latency was the same, but the extra modules and control routing caused the tools to frequency scale the 3-Compute case to execute at 266 MHz, whereas the 2-Compute case executed at 292 MHz. When this is considered, the performance of both tests is approximately the same. The 7-Compute test, however, performed the best because each of the compute modules were much smaller than the previous tests. In this case, the latencies of these modules were now slightly shorter than the latency of the read module, meaning that this is the limit of the performance increase by dividing the compute portion. The 7-Compute test gained a total speedup of \(4.03\times\) over the Bus Opt Parallel implementation.
To evaluate the efficiency of the allocated resources, we computed the “ideal” GFLOPS value for each of the double-precision floating-point implementations. We identified the total number of double-precision adders and multipliers instantiated in the CU (# Ops) by analyzing the synthesis reports generated by Vitis HLS. The ideal GFLOPS is computed by multiplying this value by the frequency of the CU and represents the performance when all operators were constantly in use concurrently. Table
2 compares these values with the measured GFLOPS of each implementation. In the last column, an “efficiency” is calculated as a ratio between the ideal and achieved GFLOPS.
This “efficiency” reflects the behavior of the allocation and scheduling of the HLS tool more than the efficiency of our system design surrounding each HLS kernel. However, we can still gain some insight into our design in the cases where the HLS decisions were the same. For instance, in the Baseline and Double Buffering cases, the same kernels are used, and therefore they each have the same # Ops. The efficiency increases because less time is “wasted” waiting on data transfers from the host. Both Bus Opt implementations reduce the # Ops because the HLS tool used a different local memory type with fewer read ports, restricting the unrolling, and therefore only used two adders and two multipliers for each kernel. The efficiency values of these implementations are also much higher because these are the only cases where the multipliers themselves are pipelined. The ideal GFLOPS metric expects each operator to produce a result every clock cycle, so in all other cases where the operators are not pipelined, there are several cycles of latency for each operation reducing the efficiency. Between each of the Dataflow implementations, the efficiency drops slightly as the computation is split into more modules because it is impossible to split the computation into equal latency modules. The module with the longest latency may constantly be computing, but the shorter-length modules must stall.
The efficiency values for all implementations (except Bus Opt) are all near 0.5 because each multiply-accumulate is implemented as eleven parallel multipliers and eleven sequential adders. Even though the additions are sequential, the tool still allocated eleven of them. Because the Bus Opt implementations are restricted to two adders, their efficiencies are higher.
At this point, we want to start replicating the CUs using the remaining area available in the FPGA fabric to maximize parallelism. For this reason, we need to evaluate the hardware cost of each implementation. The numbers of LUT, FF, BRAM, URAM, and DSP used by each case for
\(p=11\) are shown in Table
3. In general, each test from Baseline to Dataflow (7 Compute) showed an increase in resource utilization. Any utilization value over 25% is shown in red. These resources are most likely to cause placement and routing issues when instantiating multiple CUs. We tested a few methods to reduce resource utilization to be able to increase the number of instantiated CUs.
The
Mem Sharing optimization is applied to the
Dataflow 1-Compute implementation where several arrays are used in the compute module (cf. Figure
14(d)). Mnemosyne generated an architecture to internally share arrays based on their liveness intervals. This decreased the BRAM utilization by 14.5% and the URAM utilization by 48.3% while the LUT and FF utilization only increased minimally and the DSP utilization remained the same. Also, the execution time was only slightly reduced (a slowdown of
\(0.98\times\)). Conversely, this optimization cannot be applied to the
Dataflow 2-Compute,
3-Compute, and
7-Compute implementations because, in these cases, each compute module only uses arrays that cannot be shared, as they are always in use during the module execution. This optimization is indeed beneficial when on-chip memory inside the CU is the limiting factor, and when replicating the CUs brings more improvements than dataflow execution.
Another method to reduce resources is to change the numerical representation. All of the previous tests used the floating-point format with double precision. In general, fixed-point representations utilize fewer resources than floating-point ones. We tested 64- and 32-bit fixed-point representations by modifying the Dataflow 7-Compute implementation. The 64-bit implementation uses 24 bits for the integer portion and 40 bits for the fractional portion. The 32-bit implementation uses 8 bits for the integer portion and 24 bits for the fractional portion. These values are provided by the user after an analysis of the algorithm. Because the 32-bit data are half the size, we instantiate eight kernels per CU and divide the 256-bit bus into eight lanes. In the Fixed Point 64 test, the LUT utilization reduced by 46.3%, the FF utilization reduced by 53.4%, the RAM utilization remained the same, and the DSP utilization increased by 44.8%. In the Fixed Point 32 test, concerning the Fixed Point 64 test, the LUT and FF utilization remained roughly the same. The DSP utilization was nearly halved. The BRAM increased by about four times while the URAM decreased to zero. This is because the data representation is half as long, so the overall size of the data structures is half as big. The arrays representing the tensors are no longer big enough for the synthesis tool to decide if it is efficient to use URAM to store them. When considering the size of the physical memories, the total memory space is approximately halved. The performance of the Fixed Point 64 test had a slight speedup of \(1.19\times\) due to the simplification of the logic allowing the frequency to be higher. The Dataflow 7-Compute test with double format was scaled to 199 MHz while the Fixed Point 64 test was scaled to 234 MHz. The performance of the Fixed Point 32 test had a speedup over the double format of \(2.37\times\) and it reaches up to 103 GFLOPS. This represents a speed up of more than 35\(\times\) over the Baseline version. The Fixed Point 64 test exhibited a mean square error of \(9.39\times 10^{-22}\) while the Fixed Point 32 test had a mean square error of \(3.58\times 10^{-12}\). It is up to the application designer to determine what an acceptable error is and decide on an appropriate number format, and our flow can help facilitate a design space exploration of these parameters.
Another method to reduce resource utilization for this kernel is to vary the input parameter
\(p\). We tested the
Dataflow 7-Compute implementation using 64-bit double, 64-bit fixed point, and 32-bit fixed point with
\(p=7\) and
\(p=11\). The results are summarized in Figure
16 and Table
4.
Compared to their \(p=11\) counterparts, the \(p=7\) implementations performed slightly slower. This is because the actual hardware implementation does not scale the same way as the conceptual floating point operations per kernel (used to compute the GFLOPS). However, the resource reduction between \(p=11\) and \(p=7\) is enough to allow for more replication of the CUs. For instance, the Fixed Point 32 implementation uses 66.4% of the available BRAM for \(p=11\) while it only uses 21.7% for \(p=7\), allowing \(4\times\) replication.
To further facilitate instantiating multiple CUs, we reduced the stream FIFOs from a naive full size to a small enough version to save space and still prevent deadlock. This led to a small performance reduction due to stalls but significantly reduced the total number of BRAMs. Also, because the DSP utilization was exceptionally high in some cases, we used pragmas to guide the HLS tool on using LUTs instead of DSPs to implement fixed-point multipliers. We used this pragma in one of the seven compute modules to shift some of the resource load off of DSPs and onto LUTs.
We were able to instantiate two parallel CUs for the cases of
Double with
\(p=11\),
Fixed Point 64 with
\(p=11\), and
Fixed Point 64 with
\(p=7\), three CUs for the cases of
Double with
\(p=7\) and
Fixed Point 32 with
\(p=11\), and four CUs for the case of
Fixed Point 32 with
\(p=7\). The performance results for these implementations are shown in Figure
17, and the area results are shown in Table
5. All of these implementations were built targeting 225 MHz, as most of their 1 CU counterparts could not even achieve this.
In most cases, replicating the CUs actually led to a slowdown. This is because the extra logic and routing caused the maximum frequency to be reduced thereby slowing down everything in the system. However, most cases did show speedup in terms of the CU execution time. In particular, the Fixed Point 32 implementations achieved up to 172 GFLOPS for the kernel but around 87 GFLOPS for the system. This huge discrepancy is because even though several CUs are now executing in parallel, all of the data must still be sent from the host to the HBM in series. Host data transfers are now the dominating factor by far, so it is not recommended to replicate CUs until the host data transfer time can be reduced. Otherwise, the overall system will have a slowdown from the extra logic.
From the resource utilization results, it can be seen that both 64-bit data types are constrained by resources used for computation, namely LUTs and DSPs. The 32-bit fixed-point implementation is also somewhat constrained by DSPs. In any case, this application is composed of almost entirely floating or fixed point multiplications, and performance-optimized designs will quickly use most of the available DSPs. The 32-bit cases are also constrained by the on-chip memories, the 3 CU implementation of Fixed Point 32 with \(p=11\) even uses 100% of the URAM, but both Fixed Point 32 implementations were able to be replicated more than their Fixed Point 64 counterparts, due to the data width reduction. The \(p=7\) tests were also, in general, able to be replicated more than their \(p=11\) counterparts due to the effect \(p\) has on the amount of computations and the array sizes. Fixed Point 64 was only able to be replicated twice in both cases of \(p\), as the reduction of DSPs between \(p=11\) and \(p=7\) was not enough to allow for a third CU.
Figure
18 shows the power consumption of the different implementations and a comparison of the energy efficiency (GFLOPS/W for floating-point operations and GOPS/W for fixed-point operations). The bars report the average power consumption measured with the XRT infrastructure. We also include the results of the multiple-CU implementations to show the effects of replication on both power consumption (W bars) and energy efficiency (GFLOPS/W and GOPS/W bars).
As expected, the fixed-point implementations are more efficient than the floating-point ones. Also, reducing the bitwidth from 64 to 32 bits allows us to achieve maximum efficiency. This is because these implementations are much faster and use fewer hardware resources. The \(p=7\) implementations have lower average power consumption than their \(p=11\) counterparts due to their smaller resource utilization. However, in most cases, the efficiency of the \(p=7\) cases is lower due to their longer overall execution time. The multiple-CU implementations are generally less efficient than their single-CU counterparts, both because of the increased work occurring in parallel, yielding a higher average power, and because of longer execution times from frequency scaling.