Improved JPEG Lossless Compression for Compression of Intermediate Layers in Neural Networks Based on Compute-In-Memory

Hua, Junyong; Xu, Hang; Du, Yuan; Du, Li

doi:10.3390/electronics13193872

Open AccessArticle

Improved JPEG Lossless Compression for Compression of Intermediate Layers in Neural Networks Based on Compute-In-Memory

Intelligent Sensing and Communication Lab, School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(19), 3872; https://doi.org/10.3390/electronics13193872

Submission received: 14 August 2024 / Revised: 25 September 2024 / Accepted: 27 September 2024 / Published: 30 September 2024

(This article belongs to the Special Issue New Insights into Memory/Storage Circuit, Architecture, and System)

Download

Browse Figures

Versions Notes

Abstract

:

With the development of Convolutional Neural Networks (CNNs), there is a growing requirement for their deployment on edge devices. At the same time, Compute-In-Memory (CIM) technology has gained significant attention in edge CNN applications due to its ability to minimize data movement between memory and computing units. However, the deployment of complex deep neural network models on edge devices with restricted hardware resources continues to be challenged by a lack of adequate storage for intermediate layer data. In this article, we propose an optimized JPEG Lossless Compression (JPEG-LS) algorithm that implements serial context parameter updating alongside parallel encoding. This method is designed for the global prediction and efficient compression of intermediate data layers in neural networks employing CIM techniques. The results indicate average compression ratios of 6.44× for VGG16, 3.62× for ResNet34, 1.67× for MobileNetV2, and 2.31× for InceptionV3. Moreover, the implementation achieves a data throughput of 32 bits per cycle at 600 MHz on the TSMC 28 nm, with a hardware cost of 122 K Gate Count.

Keywords:

convolutional neural network; compute-in-memory; JPEG-LS; data compression

1. Introduction

In recent years, Convolutional Neural Networks (CNNs) have become powerful tools for tackling complex problems in various fields such as computer vision [1] and speech recognition [2]. However, the high performance of CNNs comes at the cost of substantial demands on computational resources and memory bandwidth, particularly in scenarios involving large-scale models and high-resolution inputs. Compute-In-Memory (CIM) achieves significant enhancement in computational efficiency and energy utilization by tightly integrating computation and storage. This dramatically reduces the movement of weight data between memory and the processor, making it particularly well-suited for data-intensive applications such as CNNs [3,4,5]. However, the storage and transfer of intermediate layer data remain an issue in the CIM architecture. Therefore, compressing this intermediate layer data can effectively enhance the performance of CNNs operating under a CIM architecture.

Some algorithms have been proposed to compress intermediate layer data, aiming to reduce the storage overhead [6,7,8]. However, the application of these compression algorithms faces several challenges. Firstly, the compression algorithm needs to approach the information entropy of the data as closely as possible to achieve optimal compression efficiency [9]. Secondly, complex compression algorithms may not be suitable for hardware implementation due to the additional computational overhead they introduce. Lastly, lossy compression may result in information loss, which can subsequently affect the performance of the final model.

This paper introduces the hardware-friendly lossless compression algorithm JPEG Lossless Compression (JPEG-LS), incorporating inter-channel prediction to enhance sparsity, and it describes the hardware implementation and prototype verification. The experimental results demonstrate that this approach significantly reduces the storage demand for intermediate layer data, achieving average compression ratios of 6.44× for VGG16, 3.62× for ResNet34, 1.67× for MobileNetV2, and 2.31× for InceptionV3. During the hardware design phase, a scheme with four-way parallel encoding and serial context parameter updates was adopted, realizing a data throughput of 32 bit/cycle at 600 MHz on the TSMC 28 nm. The hardware consumption amounted to 122 K Gate Count.

2. Related Work

2.1. Data Compression Theory

In 1948, Claude Shannon introduced the concept of “information entropy”, which serves as a quantitative measure of the amount of information [9]. Information entropy refers to the average information content once the redundancy in the original information is eliminated. Let us assume that a source emits information consisting of

X_{i}

, where i is an integer ranging from 1 to n. The information content represented by any individual source symbol

X_{i}

can be described by Equation (1), which is typically denoted as follows:

I (X_{i}) = {log}_{a} (1 / P (X_{i})) = - {log}_{a} P (X_{i})

(1)

Thus, the total information content emitted by a source containing n source symbols can be represented by Equation (2).

H (X) = \sum_{i = 1}^{n} P (X_{i}) I (X_{i}) = - \sum_{i = 1}^{n} P (X_{i}) {log}_{a} P (X_{i})

(2)

Among these,

P (X_{i})

is the probability of occurrence of the source symbol, and H(X) represents the information entropy of the source. When a = 2 in the logarithm base, the unit of information quantity becomes bits, and in this paper, bits are used as the unit of information quantity. Information entropy not only quantitatively measures the size of the information but also provides the theoretical optimum for information encoding: the theoretical lower limit for the average code length in coding is the information entropy. In other words, information entropy defines the ultimate limit for data compression.

2.2. Encoding Introduction

2.2.1. Predictive Coding

Pixels in images exhibit a high degree of correlation with their neighboring pixels. Predictive coding leverages this spatial and temporal correlation among image pixels for encoding purposes [10,11,12]. Initially, the prediction value for the current pixel to be encoded is estimated using its surrounding pixels. Subsequently, the difference between the current pixel and its predicted value is computed. Finally, this prediction residual is encoded. Predictive coding is easy to implement in hardware and offers good compression performance. For instance, JPEG-LS employs predictive coding to encode the difference between the pixel being encoded and its predicted value, thereby enhancing encoding efficiency [13].

2.2.2. Statistical Coding

Statistical coding is a variable-length coding method that assigns shorter codewords to source symbols with higher probabilities and longer codewords to those with lower probabilities [14,15,16]. This strategy aims to minimize the average codeword length, thereby achieving compression. For example, JPEG utilizes Huffman coding to encode the differences between pixels and their predicted values [17]. Similarly, JPEG-LS employs run-length encoding to code sequences of pixels with equal grayscale values in the horizontal direction of an image, enhancing encoding efficiency [13].

2.2.3. Transform Coding

Transform coding involves transforming an image signal, typically described in the spatial domain, into another vector space where it can be represented differently [18,19,20]. This transformation serves to reduce the redundancy among image pixels. Following the transformation, the coefficients in the new vector space are coded according to their statistical characteristics and human visual perception, achieving compression. An example of transform coding is seen in JPEG2000, which codes image data that have been processed through Discrete Wavelet Transform, improving encoding efficiency [19].

3. Framework

The overall system structure is illustrated in Figure 1, encompassing an Instruction Queue, a Top-Level Control Unit, and the Datapath. Firstly, data from Off-Chip DRAM is loaded into the Instruction Queue via the Command Direct Memory Access (CDMA) for command handling, under the control of the Top-Level Control Unit. The Weight Direct Memory Access (WDMA) interfaces with the Processing Units at a bandwidth of 128 bits per cycle. Meanwhile, input data, after being decompressed by the JPEG-LS decoder, are connected to the Processing Units at a bandwidth of 32 bits per cycle.

Each Processing Unit comprises a 32 × 128 CIM Macro, a Scale Add Logic, a Scratch Pad Memory, an Activation Module, and a Pooling Module. These units execute CNN operations under the Control Unit. Upon completion of processing, the data undergo JPEG-LS Encoding before being stored back into the Direct Memory Access (DMA) buffer.

3.1. JPEG-LS Encode

In Figure 2, the improved JPEG-LS encoding process is divided into the following six steps:

Step 1: Input Buffering—This step receives the raw data and their adjacent data, buffering them into an input queue in preparation for encoding.

Step 2: Gradient Calculation and Mode Selection—Gradients are calculated, and based on these gradients, an encoding mode is selected.

Step 3: If all gradients are zero, counting is performed by the run length counter. Otherwise, the gradients are quantized and merged, obtaining the Context Address parameter Q.

Step 4: Prediction Value Calculation, Correction, and Prediction Error Calculation—The prediction value for the current data point is calculated, corrected if necessary, and the prediction error is determined.

Step 5: Encoding and Context Parameter Update—Encoding occurs alongside the update of context parameters. Golomb encoding and run length encoding operate in parallel, while the parameter update stage is handled sequentially to avoid read–write conflicts.

Step 6: Output Buffering—Encoded data streams are merged and output after the encoding process is completed.

3.1.1. Gradient Value Calculation

As shown in Figure 2, during the Gradients phase, the inputs are a, c, b, d, which correspond to the four adjacent positions of the current pixel to be encoded, and each has a corresponding reconstructed value of Ra, Rb, Rc, Rd, respectively. Local gradient values D2, D2, D3 represent the activity levels of the neighboring pixels of the current pixel to be encoded, such as smoothness or boundary characteristics. These are used to estimate the statistical behavior of the prediction error for the current pixel to be encoded. Their calculation formulas are as follows:

\begin{matrix} D 1 & = & R d - R b \\ D 2 & = & R d - R c \\ D 3 & = & R c - R a \end{matrix}

(3)

After the calculation of the local gradient values D1, D2, and D3, the selection of the encoding mode takes place. If the absolute values of all the local gradient values are equal to zero, then the current pixel to be encoded enters the run-length encoding mode. Otherwise, the current pixel enters the normal encoding mode.

3.1.2. Quantization of Gradient Values

To prevent the creation of a vast number of contexts, which could lead to increased complexity and reduced efficiency, further processing of the local gradient values is required. Similar characteristic local gradient values are, therefore, quantized and combined. Given that pixels in the horizontal direction have the same influence on the quantization of local gradient values as those in the vertical direction, the effects of D1, D2, and D3 in the quantization process of local gradient values are considered equivalent. Moreover, D1, D2, and D3 are all quantized into a small number of approximately equal regions. The quantization process for local gradient values in this algorithm is as follows:

q_{i} = \{\begin{matrix} - 4, & D_{i} < = - 15 \\ - 3, & D_{i} < = - 7 \\ - 2, & D_{i} < = - 1 \\ - 1, & D_{i} < = 0 \\ 0, & D_{i} < = 1 \\ 1, & D_{i} < = 7 \\ 2, & D_{i} < = 15 \\ 3, & o t h e r w i s e \end{matrix}

(4)

3.1.3. Gradient Value Combination

After quantization, the range of each gradient value is reduced from the original 9 bits to a 3-bit two’s complement representation after quantization, significantly decreasing the cost of hardware implementation. With only 3 bits for each of the three gradients, the total size representing all scenarios is merely 3 × 3 bits. By employing a shift operation for feature combination, the final feature representation is obtained as follows:

Q = (q_{1} < < 3 + q_{2}) < < 3 + q_{3}

(5)

Compared to the traditional implementation of JPEG-LS [21,22,23,24,25], the step for feature combination in this paper can be accomplished merely through the use of shifters, full adders, and some logic gates, making it highly amenable to hardware implementation.

3.1.4. Prediction Value Calculation

Firstly, an initial prediction needs to be made based on the adjacent values of the current data to be encoded:

P_{x} = \{\begin{matrix} m i n (R_{a}, R_{b}), & R_{c} \geq m a x (R_{a}, R_{b}) \\ m a x (R_{a}, R_{b}), & R_{c} \leq m i n (R_{a}, R_{b}) \\ R_{a} + R_{b} - R_{C}, & o t h e r w i s e \end{matrix}

(6)

After the initial prediction is completed, the prediction value

P_{x}

needs to be adjusted using C[Q] and the SIGN function:

P_{x} = \{\begin{matrix} P_{x} + C [Q], & S I G N = = 1 \\ P_{x} - C [Q], & o t h e r w i s e \end{matrix}

(7)

Once the adjustment of the prediction value

P_{x}

is completed, the normalized or standardized version of the adjusted prediction value

P_{x}

is calculated:

P_{x} = \{\begin{matrix} 0, & P_{x} \leq 0 \\ M A X V A L, & P_{x} \geq M A X V A L \\ P_{x}, & o t h e r w i s e \end{matrix}

(8)

Here, C[Q] corresponds to the correction value for the prediction error associated with the context parameter Q, and MAXVAL represents the maximum pixel value encountered in the source image being scanned. The calculation for MAXVAL is as follows:

M A X V A L = 2^{p} - 1

(9)

In this context, p refers to the bit depth, which, in this article, is specified as 8 bits, leading to MAXVAL = 127.

3.1.5. Prediction Error Calculation

Upon completing the adjustment of the prediction value, the prediction error Errval is calculated. The formula for calculating the prediction error is as follows:

E r r v a l = I_{x} - P_{x}

(10)

After the prediction error calculation is completed, a sign correction is applied to the prediction error:

E r r v a l = \{\begin{matrix} - E r r v a l, & S I G N = = - 1 \\ E r r v a l, & o t h e r w i s e \end{matrix}

(11)

Once the sign correction of the prediction error is completed, a modulo operation is performed on the prediction error:

E r r v a l = \{\begin{matrix} E r r v a l + R A N G E, & E r r v a l < (- R A N G E / 2) \\ E r r v a l = E r r v a l - R A N G E, & E r r v a l \geq (1 + R A N G E / 2) \\ E r r v a l, & o t h e r w i s e \end{matrix}

(12)

RANGE refers to the range of the prediction error, and its calculation is illustrated as follows:

R A N G E = M A X V A L + 1

(13)

3.1.6. Golomb Coding

Firstly, the Golomb coding parameter k is determined by comparing (N[Q] ≪ k) and A[Q], finding the smallest value of k for which the condition (N[Q] ≪ k) < A[Q] does not hold. Here, A[Q] is the sum of the absolute values of the prediction errors associated with the context parameter Q, and N[Q] is the count of occurrences of the context parameter Q. Since k is adaptively adjusted based on the variables A[Q] and N[Q], Golomb coding can utilize the shortest codewords to encode the mapped prediction errors, resulting in the shortest average codeword length.

Finally, the mapped prediction error Errval is encoded using Golomb coding. This involves a right shift operation on Errval by k bits to obtain the quotient q, which is then unary-coded. The remainder part is encoded using binary coding.

3.1.7. Runlen Coding

In the run-length encoding mode, the process begins by comparing the input pixel to be encoded,

I_{x}

, with a reference pixel

R_{a}

to tally the run length. This comparison determines whether the current pixel is identical to the reference pixel, and if so, the Runlen counter is incremented, indicating the continuation of a run of identical pixels.

3.1.8. Context Update

Considering that after the Golomb coding parameter k has been determined, the context parameters can be updated, it is feasible to parallelize the encoding process and context updating to enhance efficiency. The context parameters that undergo updating include A[Q], B[Q], C[Q], and N[Q], with A[Q] and B[Q] being updated based on the prediction error Errval of the current pixel to be encoded. To bolster adaptability to local image variations and the general non-stationarity of image statistics, when N[Q] reaches a predetermined threshold, A[Q] and N[Q] undergo a halving operation. In this paper, N[Q] is set to a value of 64. During each iteration, B[Q] governs whether C[Q] should increment or decrement by 1, and, in turn, B[Q] is adjusted to reflect the changes in C[Q]. Additionally, clamping is applied to the variables to constrain their possible value ranges, which effectively improves the computational and statistical efficiency.

During the period of context parameter updates, context conflicts occur when consecutive pixels are processed using the same context [26,27,28]. As shown in Figure 2, in this work, a serial scheme is adopted for updating contexts belonging to different branches, thus avoiding context conflicts without incurring any additional hardware overhead.

3.2. JPEG-LS Decode

The decoding process in JPEG-LS is largely analogous to the encoding process. Initially, gradient calculations are performed, followed by identifying the coding mode. Based on the determined coding mode, either Golomb decoding or Runlen decoding is executed.

If Golomb decoding is applied, the context prediction must also be taken into account. This involves adding the context-predicted value to the decoded Golomb value to retrieve the original prediction error. Subsequently, this prediction error is utilized to reconstruct the original data point by applying the inverse operation of the prediction step conducted during encoding.

4. Results

4.1. Hardware Evaluation

On the Xilinx Zynq UltraScale+ MPSoC ZCU102 platform, the design synthesis and implementation were carried out using Vivado 2019.2. We wrote a behavioral model of the CIM in Verilog and then placed it in the testbench of an FPGA for testing. Table 1 provides information regarding the system overhead:

The hardware consumption amounted to 8728 LUTs and 1208 FFs with a data throughput of 32 bit/cycle at 100 MHz.

Table 2 shows the hardware information for the TSMC 28 nm process. After adding the compression module, the area increased by 8.96%, and the energy consumption decreased by approximately 6.55%. The overall hardware overhead of the system was 1.46 M Gate Count, where the hardware overhead of the CIM was obtained from [29]. The energy consumption without the compression module was 6.41 pJ/MAC. The energy consumption with the compression module was 5.99 pJ/MAC. The energy consumption of 8-bit read/write operations and 8-bit SRAM CIM MAC operations were obtained from [29,30].

4.2. Compression Ratio

In the experiments, we utilized the Imagenet ILSVRC2012 dataset as the input images for validation. The CNN models employed were pre-trained models from PyTorch 2.1.0/Torchvision 0.16.0: VGG16 [31], Resnet34 [32], MobilenetV2 [33], and InceptionV3 [34]. During the forward propagation in the VGG16, Resnet34, and InceptionV3 models, all feature maps were extracted from the output of the ReLU layers. For the MobilenetV2 model, all feature maps were obtained from the output of the CONV layers. Both the weights and the feature maps were quantized to an 8-bit Dynamic Fixed Point.

Figure 3, Figure 4 and Figure 5 display the compression ratios for each layer of VGG16, Resnet34, MobilenetV2, and InceptionV3. As the depth of the layers increases, the sparsity of the data grows, leading to an overall upward trend in the compression ratio.

4.3. Comparison

Table 3 presents the hardware overhead and compression ratios of our proposed method, along with a comparison with three compression schemes. Our proposed method achieved a data throughput of 32 bits per cycle at 600 MHz on the TSMC 28 nm, with a hardware cost of 122 K Gate Count. The overall hardware overhead of the system was 1.46 M Gate Count, and the energy efficiency was 1.24 TOPS/W. The system’s gate count and energy efficiency were estimated values [29].

The experimental results demonstrate that the method significantly reduces the storage requirements for intermediate layer data. Specifically, the average compression ratios were 6.44× for VGG16 with a 61% accuracy, 3.62× for ResNet34 with a 64% accuracy, 1.67× for MobileNetV2 with a 70.5% accuracy, and 2.31× for InceptionV3 with a 70.7% accuracy.

TCASI’22 [8] uses Discrete Cosine Transform (DCT) combined with quantization to remove high-frequency information for compressing intermediate layers; TCASII’23 [35] uses a simple previous-value prediction method, which is essentially a one-dimensional prediction; ISCAS’22 [36] employs a block-based approach combined with least squares fitting for parameter estimation; our approach utilizes two-dimensional prediction and takes into account the correlation between channels in the intermediate layers.

Overall, TCASI’22 achieved an average compression ratio of 1.41× with a compression hardware overhead of 13%; TCASII’23 achieved an average compression ratio of 2.04× with a hardware overhead of 2.09 K Gate Count; ISCAS’22 achieved an average compression ratio of 1.33× with a compression hardware overhead of 10.44%; our approach achieved an average compression ratio of 3.51× with a compression hardware overhead of 8.96%.

5. Conclusions

This paper explores an enhanced JPEG-LS algorithm designed for CIM-based CNN hardware architectures, focusing on improving the compression efficiency of intermediate layer data in neural networks. Through experiments conducted on four distinct CNN architectures—VGG16, Resnet34, MobilenetV2, and InceptionV3—we observed an increase in data sparsity with greater network depth, leading to an overall improvement in compression ratios. The experimental outcomes confirm that our method significantly reduces the storage demands for intermediate layer data, achieving average compression ratios of 6.44× for VGG16, 3.62× for Resnet34, 1.67× for MobilenetV2, and 2.31× for InceptionV3.

Furthermore, we evaluated the practical implementation of our proposed method on the Xilinx Zynq UltraScale+ MPSoC ZCU102 platform and TSMC 28nm, ensuring that the algorithm is not only theoretically efficient but also highly viable for real-world deployment. Leveraging the prediction error coding and run-length coding modes of JPEG-LS, coupled with an adaptive context update mechanism, our approach intelligently adjusts encoding parameters to accommodate the characteristics of varying neural network layers.

In conclusion, the improved JPEG-LS algorithm based on CIM offers a robust solution for the efficient compression of intermediate layers in neural networks. It not only effectively decreases storage requirements but also demonstrates commendable adaptability and scalability under hardware resource constraints. This research provides critical technological support for accelerating the deployment of deep learning models, particularly in edge computing and mobile device scenarios where resource optimization is paramount.

Author Contributions

J.H. was responsible for the hardware design of the algorithm, literature search, writing, and data analysis. H.X. was in charge of literature research and software implementation. L.D. and Y.D. planned and supervised the whole project. All authors contributed to discussing the results and the manuscript revisions. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

We thank Heng Zhang and Yichuan Bai for their technical guidance on the project.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Si, X.; Chen, J.J.; Tu, Y.N.; Huang, W.H.; Wang, J.H.; Chiu, Y.C.; Wei, W.C.; Wu, S.Y.; Sun, X.; Liu, R.; et al. Twin-8T SRAM Computation-in-Memory Unit-Macro for Multibit CNN-Based AI Edge Processors. IEEE J.-Solid-State Circuits 2020, 55, 189–202. [Google Scholar] [CrossRef]
Yue, J.; Yuan, Z.; Feng, X.; He, Y.; Zhang, Z.; Si, X.; Liu, R.; Chang, M.F.; Li, X.; Yang, H.; et al. 14.3 A 65nm Computing-in-Memory-Based CNN Processor with 2.9-to-35.8TOPS/W System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling Architecture and Energy-Efficient Inter/Intra-Macro Data Reuse. In Proceedings of the 2020 IEEE International Solid-State Circuits Conference-(ISSCC), San Francisco, CA, USA, 16–20 February 2020; pp. 234–236. [Google Scholar]
Xue, C.X.; Chen, W.H.; Liu, J.S.; Li, J.F.; Lin, W.Y.; Lin, W.E.; Wang, J.H.; Wei, W.C.; Huang, T.Y.; Chang, T.W.; et al. Embedded 1-Mb ReRAM-Based Computing-in- Memory Macro With Multibit Input and Weight for CNN-Based AI Edge Processors. IEEE J.-Solid-State Circuits 2020, 55, 203–215. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. DeepCompression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. Fiber 2015, 56, 3–7. [Google Scholar]
Xie, C.; Shao, Z.; Zhao, N.; Du, Y.; Du, L. An Efficient CNN Inference Accelerator Based on Intra- and Inter-Channel Feature Map Compression. IEEE Trans. Circuits Syst. Regul. Pap. 2023, 70, 3625–3638. [Google Scholar] [CrossRef]
Shao, Z.; Chen, X.; Du, L.; Chen, L.; Du, Y.; Zhuang, W.; Wei, H.; Xie, C.; Wang, Z. Memory-Efficient CNN Accelerator Based on Interlayer Feature Map Compression. IEEE Trans. Circuits Sys Tems Regul. Pap. 2022, 69, 668–681. [Google Scholar] [CrossRef]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Matsuo, Y. Predictive Coding Using Local Decoded Images with Different Degrees of Blurring. In Proceedings of the 2023 IEEE 13th International Conference on Consumer Electronics-Berlin (ICCE-Berlin), Berlin, Germany, 3–5 September 2023; pp. 25–28. [Google Scholar]
Barowsky, M.; Mariona, A.; Calmon, F.P. Predictive Coding for Lossless Dataset Compression. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1545–1549. [Google Scholar]
Zhang, J.; Zhao, D.; Jiang, F. Spatially directional predictive coding for block-based compressive sensing of natural images. In Proceedings of the 2013 IEEE International Conference on Image Processing, Melbourne, VIC, Australia, 15–18 September 2013; pp. 1021–1025. [Google Scholar]
Weinberger, M.J.; Seroussi, G.; Sapiro, G. The LOCO-I lossless image compression algorithm: Principles and standardization into JPEG-LS. IEEE Trans. Image Process. 2000, 9, 1309–1324. [Google Scholar] [CrossRef] [PubMed]
Marpe, D.; Schwarz, H.; Wiegand, T. Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard. IEEE Trans. Circuits Syst. Video Technol. 2003, 13, 620–636. [Google Scholar] [CrossRef]
Rissanen, J.; Langdon, G.G. Arithmetic Coding. Ibm J. Res. Dev. 1979, 23, 149–162. [Google Scholar] [CrossRef]
Jas, A.; Ghosh-Dastidar, J.; Ng, M.-E.; Touba, N.A. An efficient test vector compression scheme using selective Huffman coding. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2003, 22, 797–806. [Google Scholar] [CrossRef]
Wallace, G.K. The JPEG still picture compression standard. IEEE Trans. Consum. Electron. 1992, 38, xviii–xxxiv. [Google Scholar] [CrossRef]
Chang, C.-L.; Girod, B. Direction-Adaptive Discrete Wavelet Transform for Image Compression. IEEE Trans. Image Process. 2007, 16, 1289–1302. [Google Scholar] [CrossRef] [PubMed]
Skodras, A.; Christopoulos, C.; Ebrahimi, T. The JPEG 2000 still image compression standard. IEEE Signal Process. Mag. 2001, 18, 36–58. [Google Scholar] [CrossRef]
Cintra, R.J.; Bayer, F.M. A DCT Approximation for Image Compression. IEEE Signal Process. Lett. 2011, 18, 579–582. [Google Scholar] [CrossRef]
Klimesh, M.; Stanton, V.; Watola, D. Hardware implementation of a lossless image compression algorithm using a field programmable gate array. Mars 2001, 4, 5–72. [Google Scholar]
Kim, B.S.; Baek, S.; Kim, D.S.; Chung, D.J. A high performance fully pipeline JPEG-LS encoder for lossless compression. IEICE Electron. Express 2013, 10, 20130348. [Google Scholar] [CrossRef]
Nazar, F.; Murugan, S. Implementation of JPEG-LS compression algorithm for real time applications. In Proceedings of the 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, India, 3–5 March 2016; pp. 2772–2774. [Google Scholar]
Jallouli, S.; Zouari, S.; Masmoudi, N.; Masmoudi, A. An Adaptive Block-Based Histogram Packing for Improving the Compression Performance of JPEG-LS for Images with Sparse and Locally Sparse Histograms. In International Conference on Image and Signal Processing; Springer: Cham, Switzerland, 2018; pp. 63–71. [Google Scholar]
Daryanavard, H.; Abbasi, O.; Talebi, R. FPGA implementation of JPEG-LS compression algorithm for real time applications. In Proceedings of the 2011 19th Iranian Conference on Electrical Engineering, Tehran, Iran, 17–19 May 2011; pp. 1–4. [Google Scholar]
Ferretti, M.; Boffadossi, M. A parallel pipelined implementation of LOCO-I for JPEG-LS. In Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, UK, 26 August 2004; pp. 769–772. [Google Scholar]
Merlino, P.; Abramo, A. A fully pipelined architecture for the LOCO-I compression algorithm. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2009, 17, 967–971. [Google Scholar] [CrossRef]
Chen, L.; Yan, L.; Sang, H.; Zhang, T. High-Throughput Architecture for Both Lossless and Near-lossless Compression Modes of LOCO-I Algorithm. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 3754–3764. [Google Scholar] [CrossRef]
Si, X.; Tu, Y.N.; Huang, W.H.; Su, J.W.; Lu, P.J.; Wang, J.H.; Liu, T.W.; Wu, S.Y.; Liu, R.; Chou, Y.C. 15.5 A 28nm 64Kb 6T SRAM Computing-in-Memory Macro with 8b MAC Operation for AI Edge Chips. In Proceedings of the 2020 IEEE International Solid-State Circuits Conference-(ISSCC), San Francisco, CA, USA, 16–20 February 2020; pp. 246–248. [Google Scholar]
Sun, C.; Chen, C.H.; Kurian, G.; Wei, L.; Miller, J.; Agarwal, A.; Peh, L.S.; Stojanovic, V. DSENT—A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling. In Proceedings of the 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip, Lyngby, Denmark, 9–11 May 2012; pp. 201–210. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Yan, B.-K.; Ruan, S.-J. Area Efficient Compression for Floating-Point Feature Maps in Convolutional Neural Network Accelerators. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 746–750. [Google Scholar] [CrossRef]
Xie, C.; Shao, Z.; Xu, H.; Chen, X.; Du, L.; Du, Y.; Wang, Z. Deep Neural Network Interlayer Feature Map Compression Based on Least-Squares Fitting. In Proceedings of the 2022 IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, 27 May–1 June 2022; pp. 3398–3402. [Google Scholar]

Figure 1. Overall system architecture.

Figure 2. JPEG-LS encode.

Figure 3. Per-layer compression ratio for vgg16 and resnet34.

Figure 4. Per-layer compression ratio for Mobilenetv2.

Figure 5. Per-layer compression ratio for Inceptionv3.

Table 1. Hardware evaluation on FPGA.

Resource	Utilization	Available	Utilization %
LUT	8728	274,080	3.18
LUTRAM	156	144,000	0.11
FF	1208	548,160	0.22
BRAM	4.5	912	0.49

Table 2. Hardware evaluation on TSMC 28 nm.

	Without Compression	With Compression
Gate Count (M)	1.34	1.46
Energy Consumption (pJ/MAC)	6.41	5.99

Table 3. Comparison with other intermediate layer compression methods.

		This Work	TCASI’22 [8]	TCASII’23 [35]	ISCAS’22 [36]
Technology (nm)		28	28	130	28
Clock Rate (MHz)		600	700	N/A	800
Energy Efficiency (TOPS/W)		1.24	2.16	N/A	1.14
Compression Throughput (bits/cycle)		32	N/A	8	N/A
Compression Gate Count (K)		122	N/A	2.09	136
Compression Ratio	VGG16 (61%)	6.44	N/A	2.54	1.37
	ResNet34 (64%)	3.62	N/A	1.93	N/A
	MobileNetV2 (70.5%)	1.67	1.41	1.66	N/A
	InceptionV3 (70.7%)	2.31	N/A	N/A	1.28

“N/A” indicates that the field is not applicable.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hua, J.; Xu, H.; Du, Y.; Du, L. Improved JPEG Lossless Compression for Compression of Intermediate Layers in Neural Networks Based on Compute-In-Memory. Electronics 2024, 13, 3872. https://doi.org/10.3390/electronics13193872

AMA Style

Hua J, Xu H, Du Y, Du L. Improved JPEG Lossless Compression for Compression of Intermediate Layers in Neural Networks Based on Compute-In-Memory. Electronics. 2024; 13(19):3872. https://doi.org/10.3390/electronics13193872

Chicago/Turabian Style

Hua, Junyong, Hang Xu, Yuan Du, and Li Du. 2024. "Improved JPEG Lossless Compression for Compression of Intermediate Layers in Neural Networks Based on Compute-In-Memory" Electronics 13, no. 19: 3872. https://doi.org/10.3390/electronics13193872

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improved JPEG Lossless Compression for Compression of Intermediate Layers in Neural Networks Based on Compute-In-Memory

Abstract

1. Introduction

2. Related Work

2.1. Data Compression Theory

2.2. Encoding Introduction

2.2.1. Predictive Coding

2.2.2. Statistical Coding

2.2.3. Transform Coding

3. Framework

3.1. JPEG-LS Encode

3.1.1. Gradient Value Calculation

3.1.2. Quantization of Gradient Values

3.1.3. Gradient Value Combination

3.1.4. Prediction Value Calculation

3.1.5. Prediction Error Calculation

3.1.6. Golomb Coding

3.1.7. Runlen Coding

3.1.8. Context Update

3.2. JPEG-LS Decode

4. Results

4.1. Hardware Evaluation

4.2. Compression Ratio

4.3. Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI