4.3 Benchmarks
We create Verilog designs for several diverse applications to use as microbenchmarks (Table
4). These include DL (matrix vector multiplication (
General Matrix Vector Multiplication (GEMV)), matrix matrix multiplication (
General Matrix Matrix Multiplication (GEMM)), 2D convolution (Conv2D), reduction, elementwise multiplication (Elt Mul),
Rectified Linear Unit (ReLU) activation), signal processing (FIR filter or 1D convolution), and bitwise applications (database search and RAID array data recovery). We manually map the applications to CoMeFa RAMs and instantiate the CoMeFa RAM blocks in Verilog RTL. During functional verification, a simulation model of CoMeFa RAM is used. We create different scenarios (compute bound, DRAM bandwidth bound, and on-chip memory bound) in these applications. Additionally, we evaluate the impact of adding CoMeFa RAMs on the performance of real-world DNNs from three common types: fully connected networks (Multi-Level Perceptron (MLP)), RNNs (LSTM and GRU), and
Convolutional Neural Networks (CNNs) (Tiny Darknet and ResNet).
GEMV and GEMM. GEMV and GEMM are fundamental operations in DL applications. They are used in MLPs, LSTMs, and many other DNNs. We consider a GEMV workload where a weight matrix of size 2048 × 512 is multiplied with an input vector of size 512 × 1, and a GEMM workload where a weight matrix of size 1536 × 512 is multiplied with an input matrix of size 512 × 32. These are sizes from actual layers in DeepBench benchmarks [
39]. Eight-bit integer precision with 27-bit accumulation is used. On the baseline FPGA, compute units are implemented using efficient chaining of DSPs. On the proposed FPGA, compute units based on CoMeFa RAMs are additionally deployed, because many RAM blocks are available after mapping the baseline design on the proposed FPGA. The efficient OOOR-based dot product algorithm, described in Section
3.13, is used. Partial sums are read out from the CoMeFa blocks and accumulated using a pipelined bit-serial tree [
34]. No online data transpose is required—the weight matrix is transposed offline and pinned into CoMeFa RAM blocks; the input is streamed and does not need to be transposed because it is outside the RAM. Since both DSP-based and CoMeFa-based compute units are used, a reduction in data movement is not expected.
Convolution. The convolution operation forms the backbone of CNNs. We consider a convolution layer with the following parameters: Input—Height = 72, Width = 72, Channels = 128; Filters—Height = 2, Width = 2, Number = 128; Output—Height = 71, Width = 71, Channels = 128. On the baseline FPGA, dot product units are designed using DSP slices to perform multiplications and additions along the channel dimension, then the results from the four filter locations are added. The filters are stored in BRAMs, and the inputs are streamed. A compute unit on the baseline FPGA is made of 64 DSPs and 8 BRAMs. On the proposed FPGAs, CoMeFa RAMs are additionally deployed. Filters are pre-transposed and stored in the CoMeFa RAMs. A compute unit formed by CoMeFa RAMs contains 128 CoMeFa RAMs, along with instruction generation logic. The columns of a CoMeFa RAM are used to store different filters (vectorization across the output channel dimension), whereas the RAMs in a unit are used for vectorization across the input channel dimension. OOOR operations are used to compute dot products. The input data is divided between the compute units formed by DSPs and those formed by CoMeFa RAMs.
FIR Filter. FIR filters are a common DSP application. We consider an FIR filter with 128 taps. Input operands are streamed onto the FPGA through the DRAM interface. The baseline FPGA uses an efficient implementation of the FIR filter using systolic DSP chaining [
4]. The proposed FPGA uses CoMeFa RAMs for computation along with DSP chains. LBs were used for control logic. Operands are transposed on-the-fly and loaded into multiple CoMeFa RAMs in parallel. While some CoMeFa RAMs are computing, other CoMeFa RAMs are loaded in a pipelined manner to improve parallelism. When a CoMeFa RAM finishes computing, its results are unloaded and sent to DRAM, and the process starts again until all inputs are processed. We refer to this as the LCU (Load-Compute-Unload) pipeline. In this application, the CoMeFa RAM-to-CoMeFa RAM chaining (Section
3.10) feature is used to share inputs between neighboring blocks.
Elementwise Multiplication. Elementwise multiplications (Elt Mult) are commonly used in DL, for example, in normalization layers and Winograd-based convolution layers. We consider an application involving elementwise multiplication of two arrays of 100K elements. Floating-point data with a precision of HFP8 [
47] is used. We showcase here that CoMeFa RAMs are adaptable to any custom precision. The operands are read from DRAM, and the results are written to DRAM. This is a DRAM bandwidth bound application because of low arithmetic intensity. We observed that the number of LBs used was significantly higher (25×) than in the baseline FPGA. This is because to saturate the DRAM bandwidth available on the chip, many swizzle logic instances are required. However, if the swizzle logic is hardened into a DRAM controller, as discussed in Section
3.11, then this overhead is entirely removed.
Bitwise Operations. Bitwise operations (AND, OR, XOR, XNOR, etc.) are commonly used in databases, encryption, DNA sequence alignment, and so forth. They are also used in binary neural networks. CoMeFa RAMs are very efficient at these massively parallel operations because of the presence of mux-based fully configurable PEs. The operands are assumed to be available in BRAMs in the right layout. The speedup seen in these applications is attributed to the effective increase in on-chip memory bandwidth because 160 bits can be operated upon in one cycle in a CoMeFa RAM, compared to only 40 bits from a BRAM in the baseline FPGA. We consider two applications in this category:
(1)
Database search: In this application, records matching a key are searched. If a record matches the key, it is replaced with special marker data (like constant 0). Each operand is bitwise XORed with the key. Bitwise OR reduction is performed on the result. And then a bitwise ANDing operation is performed to zero out the operands that match the key. BRAMs are used to store operands. Each row of a BRAM has two 16-bit elements. On the proposed FPGA, elements are stored in 256 CoMeFa RAMs. Seven data elements are stored in each column, and temporary results consume 16 rows in a CoMeFa RAM. The key is outside the RAM.
(2)
RAID data recovery: In RAID (Redundancy Array of Independent Disks) arrays, parity protection is used. If a drive in an array fails, the remaining data on other drives is combined with the parity data (using XOR) to reconstruct the missing data. These numerous parallel XOR operations with the parity data can be accelerated using an FPGA. Instead of storing operands in a transposed format (bits of one operand in multiple rows), we use an untransposed data layout where we store bits of one operand in one row and bits of the second operand in another row. This works for logical operations like bitwise XOR where there is no dependency/communication between consecutive bits and avoids the overhead of transposing data. Performing an XOR operation between operands stored on two rows takes one cycle. A total of 256 RAMs is used.
ReLU. ReLU is the most common activation function used in DNNs. Activations usually follow a GEMM or GEMV or CONV operation. The operation involves zeroing out any negative input, but any positive input stays unchanged. We consider that the input data is available in a RAM (e.g., computed by a prior kernel). The precision is 16-bit. In CoMeFa RAMs, the inverted most significant bit (sign bit) of each input is copied into the mask latches in the PEs. The value 0 is written to each row containing the input elements. In some columns, the operation is masked (because the sign bit was 0), implying the values stay unchanged. But in other columns, the values are zeroed out. In the baseline FPGA, values are read from the RAM, their most significant bit is inspected, and the output is generated using simple multiplexing logic and written back into the RAM.
Reduction. Reduction (or accumulation) is heavily used in DL and DSP applications. We design this application to create a scenario of an on-chip memory bandwidth limited application. Data is available in transposed format (e.g., computed in RAM by a prior kernel). The precision is varied from 4-bit to 20-bit (accumulator size = 32-bit). In the baseline, operands stored in BRAMs are read and successively accumulated using a pipelined adder tree (in LBs). On the proposed FPGA, CoMeFa RAMs store the operands. The reduction algorithm from Eckert et al. [
14] is used to reduce the elements to 40 partial sums (1 partial sum in each multiplexed column of the RAM). These intermediate results from multiple CoMeFa RAMs are then read out and accumulated using a popcount-based adder [
53] to obtain the result. A significantly smaller number of LBs (~2x–3.5x) is required on the proposed FPGA.
DNNs. To evaluate full neural networks, we create a Microsoft Brainwave-like accelerator [
17] based on the work of Boutros et al. [
10]. This accelerator consists of five pipeline stages: the
Matrix Unit (MU) for matrix-vector multiplication operations, the selector unit for skipping the MU when necessary, two MFUs (multi-function units) for vector elementwise operations (e.g., activation, addition, multiplication), and the LD (loader), which interfaces with the DRAM to load and unload data. Register files (MRF and VRF) store the data locally. Similarly to CCB [
53], we create two versions of this accelerator: one for the baseline FPGA and another for the proposed FPGA. For the baseline FPGA, the MU consists of
Dot Product Engines (DPEs) that contain DSP slice cascade chains. Each DPE generates one result. For the proposed FPGA, the MU additionally contains DPEs that are mapped to CoMeFa RAMs (we call these
CoMeFa-DPEs or
C-DPEs). The CoMeFa RAMs in C-DPEs receive instructions from instruction generation FSM (duplicated to reduce fanout). A popcount-based bit-serial reduction tree [
53] is used to combine the results from various CoMeFa RAMs. Each C-DPE generates 40 results. Figure
11 shows the architecture of the accelerator for the proposed FPGA.
We write an analytical model to explore the distribution of data and BRAMs between DPEs and C-DPEs. There are two main knobs in our analytical model: f_data, which decides the fraction of workload (in terms of rows of the matrix processed by the MU) processed by DPEs compared to C-DPEs, and f_arch, which decides the fraction of BRAMs allocated to DPEs compared to C-DPEs. Additionally, the analytical model also varies the number of DSPs per DPE and the number of BRAMs per C-DPE over pre-specified ranges. The analytical model iterates over each layer for each neural network and calculates the cycles consumed for each layer. Then we post-process the results from the analytical model using Pandas to find out the best knob (or parameter) settings for each neural network. This results in a different architecture for each neural network. So, instead of having a one-size-fits-all overlay, there is a customized overlay for each neural network. We write an RTL generator to generate the Verilog design for the accelerator with the best hardware parameters identified by the analytical model. Through simulation, we perform sanity verification of our Verilog design and also the analytical model’s results.
The Brainwave-like accelerator does not directly support convolutions. So, for CNNs, convolution is expressed as matrix multiplication using the im2col operation. We assume that the im2col operation is performed in hardware. Although this can be optimized by designing an accelerator specifically for convolution, our goal here is to showcase the gains from in-memory computation rather than designing the most efficient accelerator.
We consider five DNN benchmarks for this part of the evaluation from three common DNN types: fully connected networks (Multi-Level Perceptron (MLP)), RNNs (LSTM and GRU), CNNs (Tiny Darknet) and residual neural networks (ResNet)). The mlp network is a five-layer MLP with each hidden layer having 1,024 neurons, with 4M parameters. The gru network has a hidden size = 512, embedding size = 512, and timesteps = 50. It has 1.5M parameters. The tdarknet network is Tiny Darknet, a small image classification network for edge devices. It has 650K parameters. The lstm network is an LSTM with hidden size = 1,024, embedding size = 1,024, and timesteps = 50. It has 8.4M parameters. The resnet benchmark is the ResNet-50 variation of ResNet. It has 24M parameters.
We consider two precisions (int8 and int4) and two batch sizes (1 and 8). We also evaluate the speedup using the two dot product algorithms mentioned in Section
3.13. The FPGA used in our evaluation (Intel Arria 10) is a mid-sized FPGA (47 megabits capacity). Some of the DNNs used for evaluation have weights that do not fit on the FPGA. For int8,
lstm and
resnet do not fit. For int4, only
resnet does not fit. For those cases, we consider the overhead in loading the weights onto the FPGA from DRAM as well.
4.4 Implementation Details
Area. Table
5 shows the area breakdown of both architectures of CoMeFa RAM. For CoMeFa-D, the area overhead is 1,546.78 um
2. This represents an increase of 25.4% in the BRAM tile area compared to the baseline. This overhead is mainly attributed to the addition of 160 PEs and the additional 120 sense amplifiers and write drivers. With BRAMs occupying 15% of the die size in our baseline FPGA, this overhead corresponds to only 3.8% increase in the FPGA chip area. The overhead for CoMeFa-A is 493.5 um
2. Compared to the baseline, this represents an increase of 8.1% in BRAM tile area and only 1.2% increase in FPGA chip area. This overhead is mainly attributed to the addition of 40 PEs.
Frequency. We use COFFE to obtain the overhead in the frequency of operation of a CoMeFa RAM in Hybrid mode, compared to a BRAM (735 MHz). For CoMeFa-D, the cycle duration increases to 1.25× (588 MHz). This is mainly attributed to performing read, compute (PE circuitry delay), and write in the same cycle. For CoMeFa-A, the cycle duration increases to 2.5× (294 MHz). This is because four reads and two writes are done successively as described in Section
3.7. A lower frequency of the CoMeFa RAM is not a concern because realistic FPGA designs typically are constrained by soft logic and routing delays, so designs do not achieve high frequencies like those of individual BRAMs (735 MHz in this case). In Memory mode, the delay overhead is negligible; there is only one additional mux in the write path, and the read path remains unchanged.
Routing. The interface of a CoMeFa RAM block to the programmable routing is not changed compared to that of a BRAM. The only change is the addition of two pins, which are used for direct connections between neighboring BRAMs. These do not impact the programmable interconnect directly but do increase the pin density.
CCB. The implementation of CCB [
53] is based on a BRAM with 128 x 128 geometry. The area overhead for the CCB block evaluated in the work of Wang et al. [
53] does not include the area of the additional sense amplifiers and write drivers. In our re-implementation of CCB, the total area overhead comes out to be 872.64 um
2, which is a 16.8% increase at the block level and 2.5% at the chip level in the Arria 10–like FPGA used in this study. The frequency of operation of the CCB evaluated by Wang et al. [
53] is 1.6× (469 MHz) compared to the baseline BRAM. Table
6 shows the differences between CCB and CoMeFa.