Towards an Efficient Remote Sensing Image Compression Network with Visual State Space Model

Wang, Yongqiang; Liang, Feng; Wang, Shang; Chen, Hang; Cao, Qi; Fu, Haisheng; Chen, Zhenjiao

doi:10.3390/rs17030425

Open AccessArticle

Towards an Efficient Remote Sensing Image Compression Network with Visual State Space Model

by

Yongqiang Wang

¹

,

Feng Liang

^1,*

,

Shang Wang

¹,

Hang Chen

¹,

Qi Cao

¹

,

Haisheng Fu

¹ and

Zhenjiao Chen

²

¹

School of Microelectronics, Xi’an Jiaotong University, Xi’an 710049, China

²

School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(3), 425; https://doi.org/10.3390/rs17030425

Submission received: 27 November 2024 / Revised: 24 January 2025 / Accepted: 24 January 2025 / Published: 26 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

In the past few years, deep learning has achieved remarkable advancements in the area of image compression. Remote sensing image compression networks focus on enhancing the similarity between the input and reconstructed images, effectively reducing the storage and bandwidth requirements for high-resolution remote sensing images. As the network’s effective receptive field (ERF) expands, it can capture more feature information across the remote sensing images, thereby reducing spatial redundancy and improving compression efficiency. However, the majority of these learned image compression (LIC) techniques are primarily CNN-based and transformer-based, often failing to balance the global ERF and computational complexity optimally. To alleviate this issue, we propose a learned remote sensing image compression network with visual state space model named VMIC to achieve a better trade-off between computational complexity and performance. Specifically, instead of stacking small convolution kernels or heavy self-attention mechanisms, we employ a 2D-bidirectional selective scan mechanism. Every element within the feature map aggregates data from multiple spatial positions, establishing a globally effective receptive field with linear computational complexity. We extend it to an omni-selective scan for the global-spatial correlations within our Channel and Global Context Entropy Model (CGCM), enabling the integration of spatial and channel priors to minimize redundancy across slices. Experimental results demonstrate that the proposed method achieves superior trade-off between rate-distortion performance and complexity. Furthermore, in comparison to traditional codecs and learned image compression algorithms, our model achieves BD-rate reductions of −4.48%, −9.80% over the state-of-the-art VTM on the AID and NWPU VHR-10 datasets, respectively, as well as −6.73% and −7.93% on the panchromatic and multispectral images of the WorldView-3 remote sensing dataset.

Keywords:

remote sensing image compression; state space model; visual Mamba; rate-distortion performance; image compression network

1. Introduction

Remote sensing images are digital representations of information observed by remote sensing optical cameras [1,2]. In recent years, advancements in high-definition satellite imaging technologies have led to substantial improvements in the spectral and spatial resolution of remote sensing images, posing significant challenges for remote sensing image storage and transmission. Lossy image compression offers an effective solution. Traditional methods, including Joint Photographic Experts Group (JPEG) [3], JPEG2000 [4], High Efficiency Video Coding (HEVC) [5], and Versatile Video Coding (VVC) [6], achieve nonlinear transformations through manual processing [7,8]. Recently, deep learning-based remote sensing processing methods have achieved significant advancements, including multi-feature constraints [9], object extraction [10,11], and detection [12] from remote sensing images. These advancements have also been extended to the field of image compression, where several learned image compression (LIC) methods [13,14] have surpassed VTM (VVC-Intra codec) in peak signal-to-noise ratio (PSNR), demonstrating their potential for future applications in remote sensing image compression.

In earlier studies, JPEG [3] and JPEG2000 [4] were commonly used for lossy compression of remote sensing images. Traditional image codec typically involves three key components: feature transformation, quantization, and encoding. During the transformation stage, the input image is divided into macroblocks, followed by application of the discrete cosine transform (DCT). With advancements in traditional codecs, several methods based on BPG (HEVC-intra codec) [5] have also been proposed [15,16]. However, these manually designed transform modules often result in blurring and blocky effects. Similarly to these traditional methods, deep learning-based image compression frameworks also rely on a transformation module, quantization operation, and entropy module as the core components. In these frameworks, each component is implemented as a learnable network and jointly optimized through an end-to-end approach. This design effectively mitigates block artifacts typically associated with manually designed modules.

The primary goal of LIC methods is to achieve high compression rates while minimizing distortion between the reconstructed and input images. To meet this challenge, various neural network architectures have been developed using rate-distortion (R-D) optimization objectives. These methods leverage the Convolutional Neural Network (CNN) and transformer to efficiently capture image features and latent representations. Initially, many methods rely on CNN architectures. Ballé et al. [17] first introduced convolutional blocks to enhance feature capture. With the advent of the vision transformer, a swin transformer-based image compression model is attempted in [13,18]. Transformers effectively address CNN’s limitations in capturing global features. Therefore, Liu [14] combined CNNs for extracting local features and transformers for capturing global dependencies.

The effective receptive field (ERF) was first introduced in [19]. It refers to the region of the input space that can influence a specific output unit. As the ERF of a network expands, more feature information can be extracted from a wider range. Therefore, having a global ERF enables the model to better capture long-range dependencies and complex structural patterns when processing images. Transformer-based image compression methods outperform CNN-based methods due to their broader receptive fields. However, balancing global receptive fields with efficient computation is a challenging task. Recently, Mamba models based on the state space sequence paradigm have been proposed as an effective solution that efficiently balances global feature extraction and computational complexity. Leveraging Mamba’s advantages, we aim to extract rich contextual information from remote sensing images. Building on this, we propose an efficient remote sensing image compression network with the visual Mamba architecture named VMIC to address these issues. We visualize the ERF for these different schemes on the Aerial Image Dataset (AID) in Figure 1. They are based on CNN, transformer, CNN–transformer hybrids, and Mamba, respectively. The proposed model is the only one that offers a global ERF. Although transformer-based methods theoretically have the potential for global coverage, they involve quadratic computational complexity. In contrast, our model achieves global perception through a 2D bidirectional scanning mechanism, maintaining linear complexity.

The specific contributions of this paper include the following.

Based on the variational auto-encoder (VAE), we propose a remote sensing image compression network with the visual state space model, replacing traditional CNN and transformer methods to achieve a balance between computational complexity and performance.
To extract global-spatial features effectively while maintaining linear computational complexity, we introduce the Cross-Selective Scan Block (CSSB) as the fundamental transformation block. The CSSB employs a 2D-bidirectional selective scan strategy to replace the self-attention mechanism.
To address the challenge of estimating a more accurate entropy model in remote sensing image compression networks, we propose the Omni-Selective Scan mechanism for channel and global context model (CGCM) in our networks, which performs bidirectional scanning from three different directions to model data flow, enabling global-spatial context interaction between different slices.

We conducted extensive experiments on remote sensing image compression, and the results show that the Mamba-based approach outperforms other state-of-the-art LIC methods and the latest VTM in terms of rate-distortion (R-D) performance. Specifically, it achieves Bjontegaard-delta-rate (BD-rate) reductions of −4.48% and −9.80% over VTM on the Aerial Image Dataset (AID) and the Northwestern Polytechnical University Very High Resolution 10-Class Dataset (NWPU VHR-10), respectively, as well as −6.73% and −7.93% on the panchromatic and multispectral images of the WorldView-3 dataset.

This paper is organized as follows. First, we review the related work in Section 2. Section 3 offers a detailed explanation of the methodology used in the proposed framework. In addition, the experimental setup and comparative analysis with other methods are discussed in Section 4, along with ablation studies. Finally, Section 5 presents the conclusion.

2. Related Work

2.1. Learning-Based Image Compression

The learning-based image compression (LIC) method employs Variational Autoencoders (VAE) [17], which can obtain latent features y from the input image x and then reconstruct them as

\hat{x}

. The fundamental process can be written as follows:

\begin{matrix} \hat{y} & = ⌊ g_{a} (x) ⌉, \\ \hat{x} & = g_{s} (\hat{y}), \\ \hat{z} & = ⌊ h_{a} (y) ⌉, \\ Θ & = h_{s} (\hat{z}), \end{matrix}

(1)

In Equation (1), the analysis

g_{a}

and synthesis

g_{s}

refer to the main transforms, with

⌊ \cdot ⌉

standing for the quantization. Similarly,

h_{a}

and

h_{s}

are the hyperprior transforms, which are used to extract side information from the hyperprior network. The hyperprior

\hat{z}

is then employed to calculate the parameters

Θ (μ, σ^{2})

for the entropy model.

The optimization goal of the loss function is to minimize the bitrate and reduce distortion, which are formulated as follows:

\begin{matrix} L & = R (\hat{y}) + R (\hat{z}) + λ D (x, \hat{x}), \\ R (\hat{y}) & = E [- {log}_{2} p_{\hat{y} | \hat{z}} (\hat{y} | \hat{z})], \\ R (\hat{z}) & = E [- {log}_{2} p_{\hat{z}} (\hat{z})], \\ p_{\hat{y} | \hat{z}} & = [N (μ, σ^{2}) * U (- 0.5, 0.5)] (\hat{y}), \end{matrix}

(2)

where

R (\hat{y})

,

R (\hat{z})

represent the bit rates of the latent features

\hat{y}

and

\hat{z}

, respectively, and are calculated from the probability distributions

p_{\hat{y} | \hat{z}}

and

p_{\hat{z}}

. The Lagrange parameter

λ

controls the balance between bit rate and distortion. D denotes the distortion between the original image x and its reconstruction

\hat{x}

, measured via mean squared error (MSE). N refers to the Gaussian conditional entropy model. To overcome the non-differentiability issue of quantization operations, noise sampled from the uniform distribution U is added.

Recently, learning-based image compression methods have made significant progress. Initially, Ballé et al. [17] first introduced the CNN-based image compression framework. In [20], a hyperprior network is introduced to extract additional side information from the latent representation y, aiming to enhance the rate-distortion (R-D) performance. Later, to further extract feature information from the main transformation and entropy modules, several studies explored the use of various types of CNNs. Refs. [21,22] propose residual blocks and attention modules in both the encoder and decoder backbones. However, CNN-based models focus only on local features, making it difficult to capture global information. The emergence of transformers has effectively addressed this challenge. As a result, more powerful transformation backbones (e.g., TIC [18], S2LIC [13]) have been introduced to replace stacking residual blocks. Due to their powerful sequence modeling capabilities and global self-attention mechanisms, transformer-based methods have demonstrated superior performance in image compression. However, the self-attention mechanism also needs to compute the associated weights of each position with the others, resulting in a quadratic increase in computational complexity and input sequence length. Thus, these models face the challenge of balancing computational complexity with rate-distortion performance.

2.2. Learned Remote Sensing Image Compression

With advancements in imaging technology, the growing number of high-definition remote sensing images has presented significant challenges in terms of storage and transmission. Compared to other natural images, remote sensing images possess richer texture features and content information. The implementation of more efficient remote sensing image compression methods can not only facilitate storage but also enhance image reconstruction, allowing for the retention of more detailed features and semantic information, which can ensure its accuracy in various machine tasks, such as remote sensing detection and instance segmentation. In the past, traditional compression models like JPEG [3] and BPG [5] were widely used for remote sensing images. However, these traditional codecs are based on manually designed transformation modules, and the block artifacts significantly limit the compression quality of remote sensing images at lower bit rates.

Inspired by VAE-based methods [13,17,18,20,21], some studies have begun to apply these models to remote sensing. Li et al. [23] employed a convolutional neural network to process remote sensing images and performed efficient image transformation through small-scale wavelet transform. Chong [24] utilized an attention network with high-order Markov random field to capture potential semantic tasks in high-resolution remote sensing images. Ref. [25] further investigated the redundancy of information between local and non-local features, leveraging this prior information to improve the accuracy of probability distribution in entropy modeling. In recent work, the multi-level domain similarity between input and reconstructed images was first proposed in [1], employing a multi-level domain similarity enhanced guided network (MDSNet) to extract global and channel features. In [26], they integrated a latent feature selection module into the image compression framework to reconstruct regions of interest (ROI), enabling adaptive bit allocation based on various downstream tasks.

In addition to various VAE-based models, some research also focuses on generative adversarial network (GAN) models. For instance, Pan et al. [27] fused content and texture information within images, proposing a GAN-based coupled remote sensing image compression model that achieves higher image reconstruction quality at a lower bit rate. Han et al. [28] employed an edge-guided adversarial network to enhance the restoration of edge structures and texture details in the reconstructed images. Furthermore, unlike VAE-based and GAN-based approaches, ref. [29] utilized a combination of a diffusion model and image priors to improve the quality of the reconstructed images. The first stage employs a VAE-based architecture to obtain the latent features from the input image, while a conditional diffusion model and semantic guidance are utilized to produce superior reconstructed remote sensing images in the subsequent stage.

2.3. State Space Models

Motivated by classical control system theory, the development of the state space model [30] has recently gained popularity in computer vision and natural language processing. The structured state spaces for sequences model (S4) [31,32] leverages the specialized Hippo structure to effectively capture long-range dependencies, emerging as a strong contender among sequence modeling frameworks. Based on S4, a new selective state space model named Mamba [33] is proposed. Mamba has a selective mechanism and an efficient hardware design that scales linearly with input length, making it superior to transformers in natural language processing.

In addition, some pioneering studies have applied the Mamba architecture to visual tasks [34,35,36] and extended it to image super-resolution [37,38] and remote sensing segmentation [39,40,41]. Zhao et al. [42] was the first to apply the Mamba model to dense prediction in remote sensing images. They conducted multi-directional global-spatial modeling on these images to capture large spatial features, achieving better accuracy than transformer-based models. In [43], the author applied visual Mamba in remote sensing change detection, modeling in different temporal and spatial contexts to obtain precise change information. For remote sensing image classification, the RSMamba architecture based on the state space model demonstrates excellent classification performance on multiple remote sensing datasets [44]. In this paper, we apply visual Mamba to remote sensing image compression frameworks, providing a more efficient model for processing remote sensing imagery.

3. Materials and Methods

3.1. Preliminaries

The state space model (SSM) is based on control system theory and draws inspiration from continuous-time linear time-invariant systems. In an SSM system, the input value

v (t) \in R

is transformed into the output response

u (t) \in R

. The dynamics of this system are given in the following equations:

\begin{matrix} h^{'} (t) & = A h (t) + B v (t), \\ u (t) & = C h (t) + D v (t), \end{matrix}

(3)

where

A \in R^{N \times N}

,

B \in R^{N \times N}

,

C \in R^{1 \times N}

, and

D \in R

are the weight parameters of size N for the state, and

h (t) \in R

denotes the hidden latent state at time t.

In the context of deep learning, the state space model often requires discretization for effective application. Specifically, the continuous differential Equation (2) is converted into its discrete counterpart via the zero-order hold discretization method. Consequently, the discretized formula can be expressed in the following forms:

h (t + 1) = \bar{A_{t}} h_{t} + \bar{B_{t}} v_{t + 1},

(4)

where the discretization parameter

\bar{A_{t}} = e^{A Δ_{t}}

, while for

\bar{B_{t}} = B Δ_{t}

is the first-order Taylor expansion approximation, and

Δ

is the time scale parameter.

Given that the state space model is iteratively executed and the matrices

A

,

B

, and

C

exhibit the property of linear time-invariance, Equation (3) can be further transformed into a convolution form to facilitate recursive state space. Given the length of the input sequence k and the structured convolution kernel

\bar{K} \in R^{k}

, the iterative equation for the convolution operation is as follows:

\begin{matrix} \bar{K} & = (C \bar{B}, C \bar{AB}, \dots, C \bar{A^{k - 1} B}), \\ y & = \bar{K} * x, \end{matrix}

(5)

Recently, in the advanced visual state space model [34], the parameters

B, C

, and the time parameter

Δ

were improved to better incorporate input data dependence, enabling more effective spatial feature modeling. Additionally, the bidirectional selective scanning algorithm leverages parallel processing capabilities, significantly improving overall training efficiency. Therefore, transforming the state space model into convolutional operations theoretically provides support for further exploration of the visual state space model in image compression.

3.2. Overall Framework of the Proposed VMIC

First, we give the overall network architecture of the proposed VMIC in Figure 2. Our framework includes three main subnetworks: main transformation modules

g_{a}

and

g_{s}

, hyperprior modules

h_{a}

and

h_{s}

, and an entropy module for latent representation and hyperprior. In the main transformation modules, we replace the stacked convolution residual block and transformer block with a cross-selective scan block (CSSB), which is introduced in detail in the next section. The downsampling block in the analysis transformation (

g_{a}

and

h_{a}

) comprises two convolutional layers. The first layer uses a stride of 2 for downsampling, followed by a Leaky ReLU activation function. After the second convolutional operation, Generalized Divisive Normalization (GDN) is applied. In the synthesis transformation (

g_{s}

and

h_{s}

), upsampling is performed with inverse GDN (IGDN).

Given the input image

X \in R^{H \times W \times 3}

(where H, W, and 3 represent the height, width, and the number of channels), we start with a 5 × 5 downsampling convolution to extract the latent representation

y \in R^{\frac{H}{8} \times \frac{W}{8} \times 320}

while reducing the image size. The data are then processed by a main transformation module consisting of three stages. This stage includes a CSSB and a 3 × 3 convolution downsampling or upsampling operation aimed at extracting global-spatial information. For the hyperprior part, we use three convolutional blocks with a kernel size of

3 \times 3

and two residual downsampling blocks in

h_{a}

. In

h_{s}

, since the hyperprior information needs to be sent to the mean and variance of the entropy coding, the output channel number is set to

2 M

. In the hyperprior module, we adopt the same factorized entropy model as in [20] to encode and decode

z \in R^{\frac{H}{32} \times \frac{W}{32} \times 128}

.

Finally, in the entropy module, we propose an omni-selective scan module based on the 2D-selective scan, which specifically performs channel scanning between different slices. The quantized

\hat{y}

is subsequently encoded into a bitstream through a channel and global context entropy model.

3.3. Cross-Selective Scan Block (CSSB)

Previous works on image compression have primarily relied on CNN and transformer blocks. While in the transformer, the self-attention mechanism can capture global dependencies, its computational complexity increases quadratically. To address this challenge, we introduce a Cross-Selective Scan Block (CSSB) for remote sensing image compression. As shown in Figure 3, we first employ patch embedding (PE) to adjust the depth feature dimension, mapping the original dimensions of

F \in R^{H \times W \times 3}

to

F \in R^{H W \times C}

representation. After flattening and applying a linear projection, the input features are fed into the Visual State Space Layer (VSSL) to capture spatial correlations.

In VSSL, as in visual Mamba [34], the input

F_{i} \in R^{H W \times C}

is split into two branches after layer normalization. One main branch

F_{1} \in R^{H W \times \frac{C}{2}}

first undergoes depth-wise convolution (DWConv), followed by selective scanning using a 2D-bidirectional selective scan block (SS2D). In SS2D, bidirectional scanning in the horizontal and vertical directions is expanded into a single sequence across four different traversal paths for parallel processing. This allows every pixel to effectively fuse information from various pixels. Finally, the output feature map is cross-merged with another branch

F_{2} \in R^{H W \times \frac{C}{2}}

, and the output feature map

F_{o} \in R^{H W \times C}

is formed by adding the input values. The main processing can be expressed as follows:

\begin{matrix} F_{1}^{'} & = LN (SS 2 D (σ (DWConv (Linear (F_{1}))))), \\ F_{2}^{'} & = σ (Linear (F_{2})), \\ F_{o} & = Linear (F_{1}^{'} ⊙ F_{2}^{'}) \oplus F_{i} . \end{matrix}

(6)

where Linear denotes linear processing, LN indicates layer normalization, and

σ

represents the SiLU activation function. The symbols ⊙ and ⊕ are element-wise multiplication and addition, respectively.

Finally, the Patch Unembedding (PU) layer restores the features to their original dimensions of H × W × C. To enhance feature aggregation, a skip connection is also employed.

3.4. Channel and Global Context Model (CGCM)

To design a more accurate entropy model, extracting information redundancy from different channels is essential, as it significantly impacts image compression performance. A channel-wise entropy model is proposed in [45], where they exploit the varying feature redundancies across the divided slices of the latent representation

\hat{y}

. To accelerate decoding speed, He et al. [46] introduced a checkerboard context entropy model that divides the latent into two parts (anchor and non-anchor). The decoded anchor can provide context information for decoding the non-anchor part through parallel processing. S2LIC [13] integrates multi-dimensional spatial and channel context information. However, these methods lack the ability to effectively extract information from the global-spatial dimensions of the slices. To address this issue, we propose the Channel and Global Context Model (CGCM) within an improved checkerboard entropy model. The CGCM framework is shown in Figure 2. Specifically, we divide the latent feature y into anchor

y_{a}

and non-anchor

y_{n a}

parts. For the anchor part, the hyperpriors

Φ_{h}

are obtained from the hyperprior decoding module

h_{s}

. The quantized latent

\hat{y}

is evenly split into slice

{\hat{y}}^{i}

(i ∈ [0, 10]). The channel-wise module

g_{c h}

is then used to extract the channel context

Φ_{c h}

from each slice

{\hat{y}}^{i}

, as illustrated in Figure 4. However, for the non-anchor, the decoded

{\hat{y}}_{a}^{i}

provides additional information. Then, we use the global-spatial module

g_{s c}

to extract the global-spatial context

Φ_{s c}

from

{\hat{y}}_{a}^{i}

and

y^{< i}

. After concatenating

Φ_{s c}

,

Φ_{c h}

, and

Φ_{h}

, the

i_{t h}

non-anchor slice

{\hat{y}}_{n a}^{i}

is encoded by predicting the entropy parameters

(μ, σ)

. The decoded

{\hat{y}}_{a}^{i}

and

{\hat{y}}_{n a}^{i}

provide additional information to better predict

{\hat{y}}^{i}

.

We visualize some redundant features of different slices in Figure 5. Different slices exhibit varying degrees of redundancy in their features. The main features are concentrated in the first four slices, but sparse features are still present in the other slices. Therefore, inspired by [47], we propose the Omni-Selective Scan Module (OSSM) to handle spatial information between slices, enhancing multidimensional modeling capability, and we apply it to an improved checkerboard context entropy model framework. A 2D-CSSB performs selective scans both horizontally and vertically to integrate plane information. However, it does not fully integrate channel information. To improve spatial correlation modeling between slices, we incorporate a bidirectional selective scan mechanism across different channels. The decoded anchor

{\hat{y}}_{a}^{i}

and

{\hat{y}}^{< i}

are divided into two information flow operations. After layer normalization, one flow (

Y_{1}

) is scanned horizontally and vertically using a VSSL to capture flat 2D information. Simultaneously, another information flow (

Y_{2}

) undergoes average pooling and is then scanned bidirectionally along the channel direction via a VSSL. The resulting flow, multiplied by a skip connection, is added to the processed

Φ_{s c}

flow, effectively integrating information from the 2D planes and channel dimensions without additional computational cost.

Figure 6 shows the detailed architecture of the proposed OSSM for global-spatial context, and its specific expression is as follows.

\begin{matrix} Y_{1} & = LN (VSSL (Y_{1})), \\ Y_{2} & = AP (VSSL (Y_{2})) ⊙ Y_{2}, \\ Φ_{s c} & = Conv (Y_{1} \oplus Y_{2}), \end{matrix}

(7)

where LN is layer normalization, VSSL is VSSL operation, and AP is average pooling. The symbols ⊕ and ⊗ represent the element-wise addition operation and the matrix multiplication operation.

4. Experiment and Results

4.1. Setup of Experiments

4.1.1. Training Details

In our paper, we first select 200k natural images from the ImageNet [48] and COCO [49] datasets to pre-train our image compression model. Subsequently, we fine-tune our model on two remote sensing datasets: the Aerial Image Dataset (AID) [50] and the Northwestern Polytechnical University Very High Resolution 10-Class Dataset (NWPU VHR-10) [51]. The AID dataset is a large-scale collection of images sourced from Google Earth, organized into 30 distinct scene categories. Each image is

600 \times 600

pixels in size and has rich details, making it suitable for remote sensing image compression models. Additionally, the NWPU VHR-10 remote sensing dataset includes 800 high-resolution optical images, with spatial resolutions varying from 0.5 to 2 m. Among these, 650 images contain at least one target, while 150 images have no targets. The training setup includes a batch size of 8, an initial learning rate of

10^{- 4}

, and an Adam optimizer with

β_{1} = 0.9

and

β_{2} = 0.999

. We show a partial sample of the AID [50] and NWPU VHR-10 [51] remote sensing datasets in Figure 7 and Figure 8.

To train the model for different bit rates, we select various

λ

values for the peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM) [52]. Specifically, we chose

λ

values of {0.0025, 0.0035, 0.0067, 0.013, 0.025, 0.048} for PSNR optimization and {4.58, 8.73, 16.64, 31.73, 60.50} for MS-SSIM optimization. In the training stage, we randomly cropped input images from ImageNet and COCO with a size of size

256 \times 256 \times 3

patches for the initial pre-training steps. Following this, the remote sensing images were cropped into

448 \times 448 \times 3

patches for the remaining phases of retraining. All experiments were conducted on the open-source platform CompressAI [53], using an Intel(R) Xeon(R) Gold 6226R [email protected] GHz (Intel Corporation, Santa Clara, CA, USA) and NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) for 500 epochs of training.

4.1.2. Evaluation Metrics

After training, we conducted evaluation tests on the proposed model using remote sensing datasets. We selected the AID [50] and NWPU VHR-10 [51] datasets and randomly divided the remote sensing images from two datasets, allocating 10% of them as the test set to evaluate various image compression methods. These test datasets correspond to different image resolutions. The AID test dataset features images with a fixed resolution of

600 \times 600

, while the NWPU VHR-10 test dataset includes images with varying resolutions ranging from 512 to 1200 pixels. Furthermore, we evaluated 20 panchromatic and multispectral images with a resolution of

512 \times 512

from the WorldView-3 [54] dataset.

For the evaluation of different models, we chose the Peak Signal-to-Noise Ratio (PSNR) and Multiscale Structural Similarity Index (MS-SSIM) [52], two metrics for measuring rate-distortion performance. PSNR primarily quantifies the similarity between the original and reconstructed images by calculating the mean squared error (MSE). In contrast, MS-SSIM focuses on human visual perception, reflecting the subjective quality of the image. To align with the PSNR metric, the original MS-SSIM values are transformed to

- 10 {log}_{10} (1 -

MS-SSIM).

4.2. Performance

In this section, we evaluate our models with quantitative results on rate-distortion (R-D) performance and complexity analysis using an Intel(R) Xeon(R) Gold 6226R CPU @2.90 GHz (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GTX 2080Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA). We compare the proposed VMIC model with other learned image compression methods, including Ballé [20], EntroFormer [55], TIC’22 [18], ELIC [46], LWLIC’24 [56], and S2LIC’24 [13], as well as traditional methods such as JPEG [3], BPG [5] (HEVC-intra codec), and the state-of-the-art (SOTA) VTM-17.1 (VVC-Intra codec) [6]. For complexity analysis, we consider four metrics: BD-rate, inference time (encoding and decoding time), multiplicative accumulation operations (MACs), and floating-point operations per second (FLOPs). BD-rate [57], short for Bjontegaard-Delta rate, is a widely used metric for assessing the rate-distortion (RD) performance of image encoders. It quantifies the bitrate savings of the codec evaluated compared to the anchor model at the same PSNR quality, with a smaller value indicating better performance. MACs and FLOPs are used to measure computing power and model computational complexity.

4.2.1. Rate-Distortion Performance

Figure 9 presents the PSNR and MS-SSIM curves for images from the AID dataset. Compared to S2LIC’24 [13], our method improves PSNR by 0.08–0.14 dB and achieves a gain of more than 0.4 dB compared to VTM-17.1. The performance gain is more significant at higher bit rates. ELIC [46] shows similar performance to EntroFormer [55] when the bit rate is below 0.5 bpp. Among other image compression methods, Ballé shows the poorest performance. LWLIC’24 [56] outperforms TIC’22 [18] but still falls short of the SOTA VTM-17.1 and our model. When optimized for the MS-SSIM metric, the results of ELIC and EntroFormer are not included because of the lack of the corresponding weight parameters. However, under this evaluation criterion, LWLIC’24 performs better than S2LIC’24 when the bit rate is higher than 0.3 bpp. At the same bit rate, our model outperforms LWLIC’24 by about 0.14 dB and still surpasses other learned compression methods such as Ballé [21], TIC’22 [18], and S2LIC’24 [13]. In contrast, traditional image compression methods exhibit the lowest performance in MS-SSIM evaluation.

We also present the R-D curves of different models in the NWPU VHR-10 and WorldView-3 datasets in Figure 10 and Figure 11. Based on the application characteristics of panchromatic and multispectral images, we focused more on the PSNR values of the reconstructed images rather than MS-SSIM. Consequently, we did not perform an MS-SSIM evaluation on the WorldView-3 dataset. The performance in the NWPU dataset is similar to that of AID. However, for panchromatic images, TIC’24 shows a significant drop in performance at higher bit rates. Our method is about 0.07 dB lower than ELIC at the same bit rate but still outperforms S2LIC’24. This demonstrates that our proposed model is also suitable for a variety of image types and resolutions.

4.2.2. Complexity Analysis

We chose SOTA VTM-17.1 as the anchor and calculated the value of the BD-rate [57] according to the bpp-PSNR curve. As shown in Table 1, traditional codecs were not compared due to their inability to run on GPU.

Ballé’s model [20] was the first to implement the LIC method using traditional convolutional blocks. It achieved the fastest encoding and decoding speeds across four datasets, with relatively low MACs and FLOPs. However, its compression performance was the poorest, making it less suitable for remote sensing image compression. In TIC’22 [18], a serial autoregressive context entropy model was used. Although this approach improved R-D performance, it significantly reduced encoder–decoder efficiency on the GPU, resulting in longer encoding and decoding times. In terms of inference speed, EntroFormer [55] and ELIC [46] have longer encoding times but faster decoding times. The lightweight design of LWLIC’24 [56] outperforms our model in terms of inference speed. However, our model has lower computational complexity (MACs and FLOPs) and BD-rate value compared to other state-of-the-art models. As a result, it achieved a better balance between complexity and compression performance.

4.3. Qualitative Results

We first present the visualization of the latent representation features for

c e n t e r

and

b r i d g e

in Figure 12. The original image generates compact latent features after the transformation module. The CSSB module captures local features and spatial structures more effectively, optimizing latent representation y to improve image compression performance. In CGCM, the anchor

{\hat{y}}_{a}

and non-anchor

{\hat{y}}_{n a}

features of the checkerboard context are visualized, with the non-anchor providing features to the anchor.

Additionally, Figure 13, Figure 14 and Figure 15 display examples from the AID, NWPU VHR-10, and WorldView-3 datasets for qualitative analysis. To ensure a clearer comparison, we select reconstructed images with lower bit rates. Compared with JPEG [3], BPG [5], VTM [6], Ballé [20], TIC’22 [18], and S2LIC’24 [13], our proposed method can capture finer details at a lower bit rate. In JPEG compression, the bit rate of the

B a s e b a l l F i e l d

image is approximately 0.391 bpp, higher than that of the other models. However, the reconstructed image exhibits the lowest PSNR and noticeable blurring. In contrast, with our VMIC model, the

B a s e b a l l F i e l d

image achieves a bit rate of around 0.364 bpp while preserving more detailed features, such as the trees and seating. For the second image

S c h o o l

, our method reconstructs the image at a bit rate of approximately 0.253 bpp, achieving a PSNR of 25.69 dB and an MS-SSIM of 13.17 dB, showcasing complex texture details more effectively. Similar performance is also observed for the panchromatic images.

4.4. Ablation Studies

Effects of different scans of OSSM: In visual Mamba, the 2D-bidirectional selective scan in both horizontal and vertical directions effectively integrates planar information. However, channel information remains crucial across different slices. To address this, we have designed an omni-directional selective scan module in the global-spatial context model. We conduct ablation studies on various scanning methods, including unidirectional, bidirectional, and omnidirectional selective scans within channels. We select different

λ

values, specifically

λ = 0.013

and

λ = 0.045

. The results in Table 2 demonstrate that incorporating the channel direction and enabling scanning in six directions allows for a broader pixel range, enhancing compression performance.

Effects of different designs of global-spatial context: Due to the redundancy of spatial features between different slices, we designed a global-spatial context module within the CGCM entropy model to extract spatial correlations across slices. To evaluate the various components of this module, we conduct ablation experiments. As illustrated in Figure 16, utilizing only the VSSL block results in a loss of local information. Conversely, integrating convolutional layers significantly enhances the module’s capacity to capture local details and improves compression performance. However, replacing convolutional layers with multi-layer perceptrons (MLP) increases structural complexity without providing a corresponding improvement in performance. Given that the OSSM already has robust global modeling capabilities, adding MLP layers is unnecessary.

5. Conclusions

In this paper, we propose an efficient remote sensing image compression network based on the visual Mamba architecture, which effectively achieves a global effective receptive field through a 2D-bidirection selective scan mechanism. It can achieve a better trade-off between computational complexity and performance than CNN-based and transformer-based models. To better capture channel redundancy of latent representation, we designed an omni-selective scan mechanism for global-spatial context and applied it to the channel-wise and global-spatial context entropy model. Experimental results show that the proposed method has lower computational complexity and outperforms the best traditional codecs and other recent learning-based image compression methods in the PSNR and MS-SSIM metrics. In the future, we will further explore Mamba’s potential in remote sensing image compression and consider the reduction of model complexity.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W.; software, Y.W. and S.W.; investigation, H.C. and Z.C.; resources, Q.C.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W., H.F. and F.L.; visualization, Y.W. and S.W.; supervision, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China (No.61474093) and by the Aeronautical Science Foundation of China (ASFC-20184370012).

Data Availability Statement

In this paper, we used the Aerial Image Dataset (AID) and Northwestern Polytechnical University Very High Resolution 10-Class Dataset (NWPU VHR-10), COCO, and ImageNet Datasets for training. After training, the AID and NWPU VHR-10 remote sensing datasets were used for evaluation. The datasets can be accessed at the following links: AID Dataset https://github.com/Hua-YS/AID-Multilabel-Dataset, accessed on 27 October 2024); NWPU VHR-10 Dataset https://github.com/Gaoshuaikun/NWPU-VHR-10, accessed on 27 October 2024; ImageNet Dataset https://github.com/topics/imagenet-dataset, accessed on 10 October 2024; COCO Dataset https://github.com/cocodataset/cocoapi, accessed on 10 October 2024. The code is available at https://github.com/wyq2021/VMIC.git (accessed on 21 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shi, C.; Shi, K.; Zhu, F.; Zeng, Z.; Wang, L. A multi-level domain similarity enhancement-guided network for remote sensing image compression. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5645819. [Google Scholar]
Guo, T.; Luo, F.; Zhang, L.; Tan, X.; Liu, J.; Zhou, X. Target detection in hyperspectral imagery via sparse and dense hybrid representation. IEEE Geosci. Remote Sens. Lett. 2019, 17, 716–720. [Google Scholar] [CrossRef]
Wallace, G.K. The JPEG still picture compression standard. IEEE Trans. Consum. Electron. 1992, 38, xviii–xxxiv. [Google Scholar] [CrossRef]
Taubman, D.S.; Marcellin, M.W.; Rabbani, M. JPEG2000: Image compression fundamentals, standards and practice. J. Electron. Imaging 2002, 11, 286–287. [Google Scholar] [CrossRef]
Sullivan, G.J.; Ohm, J.R.; Han, W.J.; Wiegand, T. Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
Bross, B.; Wang, Y.K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.R. Overview of the Versatile Video Coding (VVC) Standard and its Applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
Ahmed, N.; Natarajan, T.; Rao, K.R. Discrete cosine transform. IEEE Trans. Comput. 1974, 100, 90–93. [Google Scholar] [CrossRef]
Marpe, D.; Schwarz, H.; Wiegand, T. Context-based adaptive binary arithmetic coding in the H. 264/AVC video compression standard. IEEE Trans. Circuits Syst. Video Technol. 2003, 13, 620–636. [Google Scholar] [CrossRef]
Zhu, J.; Zhang, J.; Chen, H.; Xie, Y.; Gu, H.; Lian, H. A cross-view intelligent person search method based on multi-feature constraints. Int. J. Digit. Earth 2024, 17, 2346259. [Google Scholar] [CrossRef]
Xie, Y.; Zhan, N.; Zhu, J.; Xu, B.; Chen, H.; Mao, W.; Luo, X.; Hu, Y. Landslide extraction from aerial imagery considering context association characteristics. Int. J. Appl. Earth Obs. Geoinf. 2024, 131, 103950. [Google Scholar] [CrossRef]
Xu, W.; Feng, Z.; Wan, Q.; Xie, Y.; Feng, D.; Zhu, J.; Liu, Y. Building Height Extraction From High-Resolution Single-View Remote Sensing Images Using Shadow and Side Information. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 6514–6528. [Google Scholar] [CrossRef]
Xie, Y.; Liu, S.; Chen, H.; Cao, S.; Zhang, H.; Feng, D.; Wan, Q.; Zhu, J.; Zhu, Q. Localization, balance and affinity: A stronger multifaceted collaborative salient object detector in remote sensing images. IEEE Trans. Geosci. Remote. Sens. 2024, 63, 4700117. [Google Scholar] [CrossRef]
Wang, Y.; Liang, F.; Liang, J.; Fu, H. S2LIC: Learned Image Compression with the SwinV2 Block, Adaptive Channel-wise and Global-inter Attention Context. arXiv 2024, arXiv:2403.14471. [Google Scholar]
Liu, J.; Sun, H.; Katto, J. Learned image compression with mixed transformer-cnn architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14388–14397. [Google Scholar]
Makarichev, V.; Vasilyeva, I.; Lukin, V.; Vozel, B.; Shelestov, A.; Kussul, N. Discrete atomic transform-based lossy compression of three-channel remote sensing images with quality control. Remote Sens. 2021, 14, 125. [Google Scholar] [CrossRef]
Li, J.; Fu, Y.; Li, G.; Liu, Z. Remote sensing image compression in visible/near-infrared range using heterogeneous compressive sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 4932–4938. [Google Scholar] [CrossRef]
Ballé, J.; Laparra, V.; Simoncelli, E. End-to-end optimized image compression. arXiv 2016, arXiv:1611.01704. [Google Scholar]
Lu, M.; Guo, P.; Shi, H.; Cao, C.; Ma, Z. Transformer-based Image Compression. In Proceedings of the 2022 Data Compression Conference (DCC), Snowbird, UT, USA, 22–25 March 2022; p. 469. [Google Scholar] [CrossRef]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Ballé, J.; Minnen, D.; Singh, S.; Hwang, S.; Johnston, N. Variational image compression with a scale hyperprior. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Cheng, Z.; Sun, H.; Takeuchi, M.; Katto, J. Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Fu, H.; Liang, F.; Lin, J.; Li, B.; Akbari, M.; Liang, J.; Zhang, G.; Liu, D.; Tu, C.; Han, J. Learned Image Compression With Gaussian-Laplacian-Logistic Mixture Model and Concatenated Residual Modules. IEEE Trans. Image Process. 2023, 32, 2063–2076. [Google Scholar] [CrossRef]
Li, J.; Liu, Z. Efficient compression algorithm using learning networks for remote sensing images. Appl. Soft Comput. 2021, 100, 106987. [Google Scholar] [CrossRef]
Chong, Y.; Zhai, L.; Pan, S. High-order Markov random field as attention network for high-resolution remote-sensing image compression. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5401714. [Google Scholar] [CrossRef]
Fu, C.; Du, B. Remote sensing image compression based on the multiple prior information. Remote Sens. 2023, 15, 2211. [Google Scholar] [CrossRef]
Xiang, S.; Liang, Q.; Tang, P. Task-Oriented Compression Framework for Remote Sensing Satellite Data Transmission. IEEE Trans. Ind. Inform. 2024, 20, 3487–3496. [Google Scholar] [CrossRef]
Pan, T.; Zhang, L.; Qu, L.; Liu, Y. A Coupled Compression Generation Network for Remote-Sensing Images at Extremely Low Bitrates. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608514. [Google Scholar] [CrossRef]
Han, P.; Zhao, B.; Li, X. Edge-Guided Remote-Sensing Image Compression. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5524515. [Google Scholar] [CrossRef]
Ye, Y.; Wang, C.; Sun, W.; Chen, Z. Map-Assisted Remote-Sensing Image Compression at Extremely Low Bitrates. arXiv 2024, arXiv:2409.01935. [Google Scholar]
Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Adv. Neural Inf. Process. Syst. 2021, 34, 572–585. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. In Proceedings of the The International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar]
Fu, D.Y.; Dao, T.; Saab, K.K.; Thomas, A.W.; Rudra, A.; Ré, C. Hungry Hungry Hippos: Towards Language Modeling with State Space Models. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 November 2024. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Hu, V.T.; Baumann, S.A.; Gui, M.; Grebenkova, O.; Ma, P.; Fischer, J.; Ommer, B. ZigMa: A DiT-style Zigzag Mamba Diffusion Model. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Guo, H.; Li, J.; Dai, T.; Ouyang, Z.; Ren, X.; Xia, S.T. Mambair: A simple baseline for image restoration with state-space model. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Lu, Y.; Wang, S.; Wang, Z.; Xia, P.; Zhou, T. LFMamba: Light Field Image Super-Resolution with State Space Model. arXiv 2024, arXiv:2406.12463. [Google Scholar]
Cao, Y.; Liu, C.; Wu, Z.; Yao, W.; Xiong, L.; Chen, J.; Huang, Z. Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion. arXiv 2024, arXiv:2410.05624. [Google Scholar]
Ma, X.; Zhang, X.; Pun, M.O. RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
Zhi, R.; Fan, X.; Shi, J. MambaFormerSR: A Lightweight Model for Remote-Sensing Image Super-Resolution. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6015705. [Google Scholar] [CrossRef]
Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. Rs-mamba for large remote sensing image dense prediction. arXiv 2024, arXiv:2404.02668. [Google Scholar] [CrossRef]
Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote Sensing Change Detection with Spatiotemporal State Space Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. RSMamba: Remote Sensing Image Classification With State Space Model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
Minnen, D.; Singh, S. Channel-wise Autoregressive Entropy Models for Learned Image Compression. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020. [Google Scholar]
He, D.; Yang, Z.; Peng, W.; Ma, R.; Qin, H.; Wang, Y. Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5718–5727. [Google Scholar]
Shi, Y.; Xia, B.; Jin, X.; Wang, X.; Zhao, T.; Xia, X.; Xiao, X.; Yang, W. VmambaIR: Visual State Space Model for Image Restoration. arXiv 2024, arXiv:2403.11423. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Cheng, G.; Zhou, P.; Han, J. Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar]
Bégaint, J.; Racapé, F.; Feltman, S.; Pushparaja, A. Compressai: A pytorch library and evaluation platform for end-to-end compression research. arXiv 2020, arXiv:2011.03029. [Google Scholar]
Wu, X.; Huang, T.Z.; Deng, L.J.; Zhang, T.J. Dynamic Cross Feature Fusion for Remote Sensing Pansharpening. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Qian, Y.; Lin, M.; Sun, X.; Tan, Z.; Jin, R. Entroformer: A Transformer-based Entropy Model for Learned Image Compression. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
He, Z.; Huang, M.; Luo, L.; Yang, X.; Zhu, C. Towards real-time practical image compression with lightweight attention. Expert Syst. Appl. 2024, 252, 124142. [Google Scholar] [CrossRef]
Bjontegaard, G. Calculation of average PSNR differences between RD-curves. VCEG-M33 2001. Available online: https://api.semanticscholar.org/CorpusID:61598325 (accessed on 21 January 2025).

Figure 1. Comparison of the effective receptive field (ERF) [19] for the AID remote sensing dataset, including Ballé (CNN-based) [20], {TIC [18], S2LIC [13] (transformer-based)}, TCM [14] (Hybrid CNN and transformer-based), and our method (Mamba-based). The yellower area indicates the larger effective receptive field.

Figure 2. The network architecture of the Remote Sensing Image Compression with Visual State Space Model (VMIC).

D o w n s a m p l i n g ↓

and

U p s a m p l i n g ↑

denote the convolution operations with a stride of 2. Quantization is represented by Q, while

A E

and

A D

refer to the arithmetic encoder and decoder, respectively. N and M represent the number of channels, which are 128 and 320.

Figure 2. The network architecture of the Remote Sensing Image Compression with Visual State Space Model (VMIC).

D o w n s a m p l i n g ↓

and

U p s a m p l i n g ↑

denote the convolution operations with a stride of 2. Quantization is represented by Q, while

A E

and

A D

refer to the arithmetic encoder and decoder, respectively. N and M represent the number of channels, which are 128 and 320.

Figure 3. The framework of the cross-selective scan block (CSSB).

P E

and

P U

are patch embedding and patch unembedding, respectively.

L N

indicates layer normalization.

σ

is the SiLU activation funtion.

Figure 3. The framework of the cross-selective scan block (CSSB).

P E

and

P U

are patch embedding and patch unembedding, respectively.

L N

indicates layer normalization.

σ

is the SiLU activation funtion.

Figure 4. The channel-wise module

g_{c h}

to extract channel context

Φ_{c h}

from each slice

{\hat{y}}^{i}

in the Channel and Global Context entropy model (CGCM).

Figure 4. The channel-wise module

g_{c h}

to extract channel context

Φ_{c h}

from each slice

{\hat{y}}^{i}

in the Channel and Global Context entropy model (CGCM).

Figure 5. Feature visualization of

a i r p o r t

image in different slices of the NPWU VHR-10 dataset.

Figure 5. Feature visualization of

a i r p o r t

image in different slices of the NPWU VHR-10 dataset.

Figure 6. The proposed omni-selective scan module (OSSM) for global-spatial

g_{s c}

context in the Channel and Global Context entropy model (CGCM).

Figure 6. The proposed omni-selective scan module (OSSM) for global-spatial

g_{s c}

context in the Channel and Global Context entropy model (CGCM).

Figure 7. Examples from a subset of categories in the AID dataset, including (a) square, (b) mountain, (c) viaduct, (d) stadium, (e) storage tanks, (f) river, (g) forest, (h) dense residential, (i) bareland, (j) commercial.

Figure 8. Examples from all categories in the NWPU VHR-10 dataset, including (a) airplane, (b) vehicle, (c) bridge, (d) ground_track_field, (e) tennis_court, (f) basketball_court, (g) baseball_diamond, (h) harbor, (i) storage_tank, (j) ship.

Figure 9. Rate-distortion performance comparison on AID test images, evaluated using PSNR and MS-SSIM metrics.

Figure 10. Rate-distortion performance comparison on NWPU VHR-10 test images, evaluated using PSNR and MS-SSIM metrics.

Figure 11. Rate-distortion performance comparison on World View test images: panchromatic images (left) and multispectral images (right).

Figure 12. Feature visualization of

c e n t e r

and

b r i d g e

images in the AID dataset, including compact latent representation y after transformation module, anchor

{\hat{y}}_{a}

, and non-anchor

{\hat{y}}_{n a}

latent features in the checkerboard context entropy model.

Figure 12. Feature visualization of

c e n t e r

and

b r i d g e

images in the AID dataset, including compact latent representation y after transformation module, anchor

{\hat{y}}_{a}

, and non-anchor

{\hat{y}}_{n a}

latent features in the checkerboard context entropy model.

Figure 13. Reconstructed visual comparison of the

B a s e b a l l F i e l d

image from the NWPU VHR-10 across various models. The evaluation metrics are presented as bpp ↓, PSNR ↑, MS-SSIM ↑.

Figure 13. Reconstructed visual comparison of the

B a s e b a l l F i e l d

image from the NWPU VHR-10 across various models. The evaluation metrics are presented as bpp ↓, PSNR ↑, MS-SSIM ↑.

Figure 14. Reconstructed visual comparison of the

s c h o o l

image from the AID across various models. The evaluation metrics are presented as bpp ↓, PSNR ↑, MS-SSIM ↑.

Figure 14. Reconstructed visual comparison of the

s c h o o l

image from the AID across various models. The evaluation metrics are presented as bpp ↓, PSNR ↑, MS-SSIM ↑.

Figure 15. Reconstructed visual comparison of the

p a n c h r o m a t i c

image from the WorldView-3 across various models. The evaluation metrics are presented as bpp ↓, PSNR ↑, MS-SSIM ↑.

Figure 15. Reconstructed visual comparison of the

p a n c h r o m a t i c

image from the WorldView-3 across various models. The evaluation metrics are presented as bpp ↓, PSNR ↑, MS-SSIM ↑.

Figure 16. Rate-distortion performance comparison on ablation study for different design of global-spatial context.

Table 1. Complexity analysis of different models on the NWPU VHR-10 [51] and AID [50] datasets. We consider the following four metrics: BD-Rate (%), inference time (s), MACs (G), and FLOPs (s). For the BD-rate calculation, the comparison anchor is VTM-17.1 (0.00%). Smaller values indicate better performance. “−” denotes that the result is unavailable.

Datasets	Methods	BD-Rate (%)	Inference Time (s)		MACs (G)	FLOPs (G)
Datasets	Methods	BD-Rate (%)	Encode. Time	Decode. Time	MACs (G)	FLOPs (G)
	VTM-17.1 [6]	Anchor	−	−	−	−
AID	JPEG [3]	113.41	−	−	−	−
	BPG [5]	15.31	−	−	−	−
	Ballé [20]	18.14	0.07	0.06	50.85	101.75
	EntroFormer [55]	−0.33	2.23	0.14	146.69	343.25
	TIC’22 [18]	7.68	4.73	10.18	167.84	336.8
	ELIC [46]	−2.14	0.28	0.14	154.62	309.76
	LWLIC’24 [56]	2.67	0.01	0.05	170.33	341.73
	S2LIC’24 [13]	−2.79	0.30	0.39	246.44	494.31
	VMIC (Ours)	−4.48	0.27	0.32	141.14	314.06
NWPU VHR-10	JPEG [3]	196.14	−	−	−	−
	BPG [5]	14.47	−	−	−	−
	Ballé [20]	17.89	0.07	0.09	101.69	203.49
	EntroFormer [55]	−4.84	5.46	0.43	296.65	594.31
	TIC’22 [18]	0.47	5.13	12.55	335.65	673.56
	ELIC [46]	−7.15	0.35	0.16	309.24	619.52
	LWLIC’24 [56]	0.67	0.13	0.06	340.67	683.46
	S2LIC’24 [13]	−8.21	0.44	0.46	510.74	$> 10^{3}$
	VMIC (Ours)	−9.80	0.32	0.34	282.29	628.11
WorldView-3 Panchromatic	JPEG [3]	198.06	−	−	−	−
	BPG [5]	22.94	−	−	−	−
	Ballé [20]	22.04	0.08	0.04	38.93	77.9
	EntroFormer [55]	0.39	3.28	0.22	113.58	259.71
	TIC’22 [18]	4.51	3.45	8.06	135.18	271.35
	ELIC [46]	−7.01	0.34	0.14	118.38	247.16
	LWLIC’24 [56]	1.65	0.13	0.05	137.25	288.22
	S2LIC’24 [13]	−5.72	0.36	0.39	187.08	375.2
	VMIC (Ours)	−6.73	0.25	0.29	108.06	240.45
WorldView-3 Multispectral	JPEG [3]	365.57	−	−	−	−
	BPG [5]	34.11	−	−	−	−
	Ballé [20]	25.42	0.07	0.04	38.93	77.9
	EntroFormer [55]	1.01	1.15	0.16	113.58	259.71
	TIC’22 [18]	5.29	3.08	7.36	135.18	271.35
	ELIC [46]	−7.38	0.33	0.14	118.38	247.16
	LWLIC’24 [56]	3.27	0.14	0.06	137.25	288.22
	S2LIC’24 [13]	−2.93	0.31	0.35	187.08	375.2
	VMIC (Ours)	−7.93	0.24	0.27	108.06	240.45

The bold numbers represent the best performance achieved among these method.

Table 2. Ablation study on different scanning methods of OSSM under Uni−Scan (unidirectional selective scan), Bi−Scan (bidirectional selective scan), and Omni−Scan (omnidirectional selective scan) scanning directions.

Method	$λ$	bpp ↓	PSNR (dB) ↑	MS-SSIM (dB) ↑
Uni-scan	0.013	0.454	34.08	16.62
Bi-scan	0.013	0.451	34.14	16.65
Omni-scan	0.013	0.443	34.16	16.67
Uni-scan	0.045	0.874	37.56	19.98
Bi-scan	0.045	0.869	37.64	20.07
Omni-scan	0.045	0.861	37.69	20.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Liang, F.; Wang, S.; Chen, H.; Cao, Q.; Fu, H.; Chen, Z. Towards an Efficient Remote Sensing Image Compression Network with Visual State Space Model. Remote Sens. 2025, 17, 425. https://doi.org/10.3390/rs17030425

AMA Style

Wang Y, Liang F, Wang S, Chen H, Cao Q, Fu H, Chen Z. Towards an Efficient Remote Sensing Image Compression Network with Visual State Space Model. Remote Sensing. 2025; 17(3):425. https://doi.org/10.3390/rs17030425

Chicago/Turabian Style

Wang, Yongqiang, Feng Liang, Shang Wang, Hang Chen, Qi Cao, Haisheng Fu, and Zhenjiao Chen. 2025. "Towards an Efficient Remote Sensing Image Compression Network with Visual State Space Model" Remote Sensing 17, no. 3: 425. https://doi.org/10.3390/rs17030425

APA Style

Wang, Y., Liang, F., Wang, S., Chen, H., Cao, Q., Fu, H., & Chen, Z. (2025). Towards an Efficient Remote Sensing Image Compression Network with Visual State Space Model. Remote Sensing, 17(3), 425. https://doi.org/10.3390/rs17030425

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards an Efficient Remote Sensing Image Compression Network with Visual State Space Model

Abstract

1. Introduction

2. Related Work

2.1. Learning-Based Image Compression

2.2. Learned Remote Sensing Image Compression

2.3. State Space Models

3. Materials and Methods

3.1. Preliminaries

3.2. Overall Framework of the Proposed VMIC

3.3. Cross-Selective Scan Block (CSSB)

3.4. Channel and Global Context Model (CGCM)

4. Experiment and Results

4.1. Setup of Experiments

4.1.1. Training Details

4.1.2. Evaluation Metrics

4.2. Performance

4.2.1. Rate-Distortion Performance

4.2.2. Complexity Analysis

4.3. Qualitative Results

4.4. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI