KANDiff: Kolmogorov–Arnold Network and Diffusion Model-Based Network for Hyperspectral and Multispectral Image Fusion

Li, Wei; Li, Lu; Peng, Man; Tao, Ran

doi:10.3390/rs17010145

Open AccessArticle

KANDiff: Kolmogorov–Arnold Network and Diffusion Model-Based Network for Hyperspectral and Multispectral Image Fusion

by

Wei Li

¹,

Lu Li

^1,*

,

Man Peng

² and

Ran Tao

³

¹

School of Automation, Beijing Information Science and Technology University, Beijing 100192, China

²

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100045, China

³

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(1), 145; https://doi.org/10.3390/rs17010145

Submission received: 5 November 2024 / Revised: 26 December 2024 / Accepted: 27 December 2024 / Published: 3 January 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, the fusion of hyperspectral and multispectral images in remote sensing image processing still faces challenges, primarily due to their complexity and multimodal characteristics. Diffusion models, known for their stable training process and exceptional image generation capabilities, have shown good application potential in this field. However, when dealing with multimodal data, it may prove challenging for the models to fully capture the intricate relationships between the modalities, which may result in incomplete information integration and a small amount of remaining noise in the generated images. To address these problems, we propose a new model, KanDiff, for hyperspectral and multispectral image fusion. To address the differences in modal information between multispectral and hyperspectral images, KANDiff incorporates Kolmogorov–Arnold Networks (KAN) to guide the inputs. It helps the model understand the complex relationships between the modalities by replacing the fixed activation function in the traditional MLP with a learnable activation function. Furthermore, the image generated by the diffusion model may exhibit a small amount of the remaining noise. Convolutional Neural Networks (CNNs) effectively extract local features through their convolutional layers and achieve noise suppression via layer-by-layer feature representation. Therefore, the MergeCNN module is further introduced to enhance the fusion effect, resulting in smoother and more accurate outcomes. Experimental results on the public CAVE and Harvard datasets indicate that KanDiff has achieved improvements over current high-performance methods across several metrics, particularly showing significant enhancements in the peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM), thus demonstrating superior performance. Additionally, we have created an image fusion dataset of the lunar surface, and KANDiff exhibits robust performance on this dataset as well. This work introduces an effective solution for addressing the challenges posed by missing high-resolution hyperspectral images (HRHS) data, which is essential for tasks including landing site selection and resource exploration within the realm of deep space exploration.

Keywords:

image fusion; hyperspectral; multispectral; remote sensing images; diffusion models

1. Introduction

With the growing demand for high-quality images, it is difficult for image features acquired by a single sensor to meet the requirements of complex application scenarios due to the mutual constraints between spatial and spectral resolutions [1,2]. Therefore, multi-source image fusion technology has emerged to release this constraint. Multi-source image fusion technology generates high-resolution multispectral images (HRMS) or HRHS images [3,4] by integrating the redundant and complementary information of panchromatic, multispectral, and hyperspectral images [5]. This process enhances image quality, spectral accuracy, and spatial resolution, as well as reduces blurriness and noise. This technology has been widely used in medical imaging [6], remote sensing [7,8], and computer vision [9,10,11,12,13]. With the growth in demand for high-resolution hyperspectral images in the fields of environmental monitoring [14], agricultural production [15], and medical imaging in recent years, the fusion of hyperspectral images has encountered new challenges and opportunities. Hyperspectral imaging systems work in narrower wavebands and need to achieve a high signal-to-noise ratio. As a result, sensors need to collect as many photons in space as possible, which leads to lower resolution in hyperspectral images. In contrast, multispectral images possess high spatial resolution. Therefore, the fusion of low spatial resolution hyperspectral images (LRHS) and HRMS images has important practical applications [16,17,18,19,20].

Hyperspectral and multispectral image fusion are closely related to pansharpening. Consequently, they can be regarded as a specific application of pansharpening, which represents the same techniques employed in different applications. Pansharpening methods are mainly classified into four categories: (1) Component substitution (CS) methods [21,22,23,24,25,26,27]; (2) Multiresolution analysis (MRA) methods [28,29,30,31,32]; (3) Variational optimization (VO) techniques [33,34,35,36,37,38,39]; (4) Deep learning (DL) methods [40,41,42,43,44,45]. The component substitution method enhances details by replacing components of a multispectral image with panchromatic images. However, this method is only effective when the panchromatic image is highly correlated with the multispectral image; otherwise, it may result in spectral distortion. The MRA method injects spatial information from panchromatic images into LRHS images, which recovers more spectral information. However, this method may also reduce the spatial resolution [46,47]. VO methods incorporate regularization terms for considering a priori information. However, regularization terms that are improperly designed may diminish the efficacy of pansharpening.

Traditional methods face challenges in generating high-fidelity HRMS images due to technical limitations. In contrast, deep learning-based methods provide a more efficient solution for hyperspectral and multispectral image fusion by directly learning the data distribution to recover spatial and spectral details. Deep learning models can learn mappings from multiple pairs of LRHS images and HRMS images to HRHS images. This process is conducted in an end-to-end manner. The superior ability of neural networks in feature learning and non-linear adaptation enables this capability. These methods mainly include convolutional neural network (CNN)-based methods [40,48,49], autoencoder (AE) methods [50,51], generative adversarial network (GAN) methods [52,53,54], flow-based models [55,56] and transformer-based methods [57], etc.

Early methods for using deep learning to fuse hyperspectral images and multispectral images are mainly based on CNNs. The PNN proposed by Masi et al. [58] is an early representative of multispectral image fusion using deep learning. The Spatial–Spectral Reconstruction Network (SSRNet) [42] introduces a loss function specifically designed for spatial and spectral reconstruction. Wei et al. propose the deep residual network DRPNN [59], which introduces improvements for panchromatic image sharpening. In addition, DiCNN [60] introduces skip connections as a detail injection model and improves the fusion effect through a dual CNN framework. MSDCNN [61] adopts a dual-branch network structure and extracts features through convolutional filters of different scales. PanNet, proposed by Yang et al. [62], introduces upsampled multispectral images to transfer spectral information for improving the reconstruction effect. FusionNet [63] implements an end-to-end residual network to optimize high-frequency detail learning. In addition, the cascade network developed by Yang et al. [64] fuses features of different scales through two upsamplings, while MHFNet [65] uses a convolutional expansion algorithm to improve interpretability. Although CNN-based methods have been widely used, there are still some challenges such as high computational complexity, marginal effects, and spatial irrelevance in capturing global dependencies.

Fusion methods based on AEs [50,51] extract features through the encoder and then reconstruct the image through the decoder. The multi-scale deep residual network of Wang et al. [66] adopts a U-shaped structure to optimize the use of scale information. However, the U-Net structure may lose boundary information during feature extraction. Pan-GAN [52], proposed by Ma et al., effectively retains the spectral and spatial information in the fused image by confronting the generator with the spectral discriminator and spatial discriminator. Although GAN has shown potential in remote sensing image fusion, it still faces challenges such as training complexity and mode collapse. Transformers have also been gradually applied to the field of image fusion. Ma et al. [67] first propose a transformer-based fusion framework, highlighting its key role in dealing with long-distance dependencies. The framework combining a CNN and a transformer uses a self-attention mechanism to capture long-distance dependencies while extracting local features through the CNN to achieve feature complementarity [67,68]. However, many methods still rely on CNNs for initial feature extraction and fail to fully utilize the global modeling capabilities of transformers.

In recent years, the diffusion model has gained attention. This model performs well in generative tasks and is widely used in various applications. These include conditional and unconditional generation [69,70,71,72,73,74,75], text-to-image generation [76], audio generation, and image super-resolution [77]. It is also applied in other advanced image processing tasks [78]. Diffusion models have been widely recognized for their stable training process and excellent image generation capabilities. This feature is particularly important for the task of fusing hyperspectral and multispectral images. Diffusion models are capable of generating images with more detail and higher fidelity than GANs and flow-based models [70], offering the advantages of a stable training process and a minimal risk of modal collapse. Therefore, diffusion models can be effectively used to describe a priori information of hyperspectral images and recover high-resolution hyperspectral images, guided by low-resolution hyperspectral and multispectral images.

Current diffusion model-based methods have achieved good accuracy, but there is still a significant issue that needs to be addressed: when dealing with multimodal data, diffusion models struggle to fully capture the complex relationships between modalities, leading to incomplete information integration and a small amount of noise remaining in the generated images. To overcome this problem, an efficient hyperspectral image fusion method based on deep learning is proposed, which fully exploits and integrates the advantageous features of multispectral and hyperspectral images by using diffusion models to effectively improve the fusion performance. A lunar surface dataset for hyperspectral and multispectral image fusion tasks is produced to verify the effectiveness of the proposed method. Experiments on lunar surface datasets also demonstrate the potential of the model in deep space exploration missions, where remote sensing image fusion technology provides high-precision image support for key tasks such as landing site selection and resource exploration. With the high-quality fused images produced by KANDiff, the spacecraft can obtain more detailed information on topography and resource distribution, thus providing a reliable basis for scientific decision-making. With the advancement of future deep space exploration missions, remote sensing image fusion technology is expected to provide more accurate and reliable image analysis support for exploration missions.

The contributions of this work are as follows:

(1) We propose a hyperspectral and multispectral image fusion model called KANDiff, which incorporates KAN and MergeCNN modules at key stages to optimize the fusion result. The KAN module enhances the expression of salient features and dynamically adjusts the learning strategy, thereby preserving image details during the denoising phase of the diffusion model and significantly improving the signal-to-noise ratio of the fused results.

(2) The design of MergeCNN further optimizes the results generated by the diffusion model, improving the clarity and stability of the generated images through deep fusion; it especially excels in detail and edge recovery.

(3) We constructed a hyperspectral and multispectral fusion lunar surface dataset and conducted experiments on this dataset and two public datasets to verify the effectiveness of KANDiff. The experimental results indicate that the proposed method performs well in terms of the fidelity of both spatial and spectral information.

The structure of the rest of the paper is as follows: Section 2 introduces the related research work, Section 3 describes the proposed KANDiff strategy and its network architecture in detail, Section 4 describes the experimental setup and introduces the constructed fusion dataset of the lunar hyperspectral and multispectral datasets, and Section 5 validates the effectiveness of the proposed method through extensive experiments and discusses the experimental results in detail. Finally, Section 6 provides a conclusion.

2. Related Works

2.1. Diffusion Model

Diffusion models are needed for remote sensing image fusion because they can effectively generate high-quality fused images that accurately capture the complex nonlinear relationships between multiple sources of data. Diffusion models have a strong denoising capability through a step-by-step denoising and generation process, which can suppress the residual noise in the fusion process and improve the clarity and accuracy of the images. In addition, the diffusion model has a stable training process, which avoids the common training instability problem of traditional generative models such as GANs and ensures consistent performance across different remote sensing datasets and application scenarios. Its flexible framework also allows combining with other neural network modules to further enhance the integration of multispectral and hyperspectral data, thus meeting the high requirements of remote sensing image processing in terms of spatial resolution, spectral information and data quality. As a result, the diffusion model becomes an ideal tool to address the challenges of remote sensing image fusion for key applications such as environmental monitoring, resource exploration, and deep space exploration.

The basic principles of the diffusion model are briefly reviewed below. The diffusion model generates new samples by simulating a random noise process in the data [78], the core of which is to gradually add noise to the data and denoise it in the generation process, thus recovering the original data or generating new samples. The diffusion process is divided into two stages: forward noise addition and iterative denoising. In the forward process, the input image is corrupted by pure Gaussian noise, and the model tries to remove the added noise and restore the original input. In the backward process, the input is pure Gaussian noise, and the trained model is responsible for removing a small amount of noise gradually. After a sufficiently large time step, the pure noise of the input is restored to the generated image.

q (x_{t} | x_{t - 1}), p_{θ} (x_{t - 1} | x_{t}, c)

, and c denote the noise addition forward process, denoising backward process, and the condition.

The forward noise addition process, shown in Figure 1a, is the process of gradually converting real data into noise. By setting the real data in the dataset as

x_{0}

, a series of data with noise

x_{t}

is formed by adding noise at each time step t. The process can be expressed by the following equation:

x_{t} = \sqrt{α_{t}} x_{0} + \sqrt{1 - α_{t}} ϵ

(1)

where

ϵ \sim N (0, I)

is the noise sampled from the standard normal distribution.

α_{t}

is a parameter defined by the noise plan; it usually takes values within ([0, 1]) and reflects the intensity of the noise. It is generally defined as follows:

α_{t} = 1 - β_{t}

where

β_{t}

is the variance of the noise added at time step t, which is usually linear or cosine decay:

β_{t} = schedule (t)

.

The inverse denoising process as shown in Figure 1b starts from the noise samples

x_{T}

and removes the noise step by step to recover the original data.

During training, our target is to make the network able to accurately predict the noise at a given time step t. The loss function is usually a mean square error (MSE) loss of the form:

L (θ) = E_{t, x_{0}, ϵ} [{∥ϵ - ϵ_{θ} (x_{t}, t)∥}^{2}]

(2)

and the loss function is optimized by comparing the noise predicted by the model with the true noise.

At the beginning of the generation phase, initial noise samples

x_{T}

are sampled from a standard normal distribution:

x_{T} \sim N (0, I)

(3)

Subsequently, the inverse denoising process is carried out iteratively to gradually generate high-quality samples until they reach

x_{0}

. The iterative formula for the process is the following:

x_{t - 1} = \frac{1}{\sqrt{α_{t - 1}}} (x_{t} - \frac{\sqrt{1 - α_{t}}}{\sqrt{1 - α_{T}}} ϵ_{θ} (x_{t}, t))

(4)

where

ϵ_{θ} (x_{t}, t)

is a neural network model that is responsible for predicting the noise at a time step t.

2.2. KAN (Kolmogorov–Arnold Networks)

KAN [79] is an emerging neural network architecture whose design concept is derived from the Kolmogorov–Arnold representation theorem and aims to improve the interpretability and training efficiency of the network. The main innovation of KAN is to replace the fixed activation function in the traditional multilayer perceptron (MLP) with learnable activation functions, which allows the weights of each connection to be dynamically adjusted in a parameterized way, thus capturing the complex relationships in the input data more efficiently.

In the KAN model, the input data are represented as vectors

x = [x_{1}, x_{2}, \dots, x_{n}]

. The output of each hidden layer node is computed by an activation function of the form:

h_{i} (x) = g_{i} (W_{i} \cdot x)

(5)

where

g_{i}

is the activation function of the ith node, while

W_{i}

is the weight associated with that node, which is actually composed of activation functions. For example, the weights can be represented as follows:

w_{i j} (x) = \sum_{k = 0}^{m} a_{k} \cdot B_{k} (x)

(6)

where

B_{k} (x)

is the basis function and

α_{k}

is the parameter to be learned. This representation allows the KAN to adaptively adjust the weights during the learning process to better fit the features of the input data.

Ultimately, the output of the KAN is obtained by weighted summing the outputs of all hidden nodes:

y = \sum_{i = 1}^{M} β_{i} h_{i} (x)

(7)

where

β_{i}

is the output weight of the i node and M is the number of nodes in the hidden layer. On the other hand, the loss function is usually in the form of mean square error:

L = \frac{1}{N} \sum_{j = 1}^{N} {(y_{j} - {\hat{y}}_{j})}^{2}

(8)

Through optimization algorithms such as gradient descent, the model is able to iteratively update the parameters of the activation function and the weights of the output layers, thus minimizing the loss function. KAN shows strong potential in the fields of data fitting, scientific computing, and partial differential equation solving, especially when dealing with nonlinear relationships.

It can effectively capture complex patterns, and at the same time, due to its interpretability, it is easy to understand and analyze the decision-making process of the model. These features make KAN a flexible and efficient alternative in the field of deep learning.

2.3. Wavelet Transform

In remote sensing image fusion, wavelet transform is a widely used technique that can effectively integrate and analyze the spatial and spectral information in multi-source remote sensing data (i.e., hyperspectral and multispectral images) [80,81]. In our research, DB1 wavelet decomposition (i.e., Haar wavelet) is used, which is one of the most fundamental wavelet bases with the advantages of high computational efficiency, simplicity of implementation, and excellent performance in preserving spatial details in the merged images. The basic principle of DB1 wavelet decomposition is to decompose the input image into different frequency components in order to analyze the information of the image at different scales [82]. Specifically, DB1 wavelet decomposition decomposes the image into a low-frequency part and high-frequency parts in three directions, namely horizontal (LH), vertical (HL) and diagonal (HH). The low frequency (LL) part retains the main energy of the image, contains the global information of the image, mainly reflects the smooth part of the image, and is able to capture the overall spectral features of the image. The high-frequency portion, on the other hand, reflects the rapidly changing detail information in the image, including local features such as edges and textures. Horizontal High Frequency (LH) captures the edge information that changes along the horizontal direction in the image, Vertical High Frequency (HL) captures the detail in the vertical direction, and Diagonal High Frequency (HH) extracts the changes in the image in the diagonal direction.

Mathematically, the DB1 wavelet transform is implemented by reducing and filtering the image. Assuming that the original image is I(x,y), the low and high-frequency components of the image are obtained by filtering in the horizontal and vertical directions, respectively. Specifically, the low-frequency component (

L L

) can be expressed as follows:

L L = \sum_{x, y} I (x, y) \cdot ϕ (x) \cdot ϕ (y)

(9)

where

ϕ (x)

is a wavelet basis function used to filter the smooth part of the image. Similarly, the high-frequency portion can be represented as follows:

L H = \sum_{x, y} I (x, y) \cdot ϕ (x) \cdot ψ (y)

(10)

H L = \sum_{x, y} I (x, y) \cdot ψ (x) \cdot ϕ (y)

(11)

H H = \sum_{x, y} I (x, y) \cdot ψ (x) \cdot ψ (y)

(12)

where

ψ (x)

denotes the high-frequency filter of the wavelet, which is used to capture the detailed features of the image.

2.4. Motivation

In the fusion of hyperspectral and multispectral images, although the traditional diffusion models offer good denoising performance, considering the large differences between hyperspectral and multispectral images, it is difficult to fully exploit the complementarity of a single diffusion model. Additionally, a small amount of noise often remains in the images generated by the diffusion model. To improve the fusion effect of the diffusion model, we propose improving the diffusion model architecture by combining KAN and CNN. On the one hand, KAN assists the model in better integrating multimodal information and enhances the flexibility of feature extraction through learnable activation functions. It can adaptively adjust the weights of different features, allowing the model to focus more on the key information, thereby improving the effect of image fusion. On the other hand, using CNN for further fusion after denoising the diffusion model improves the problem of a small amount of noise remaining in the diffusion model, resulting in a smoother and more accurate output. This combination allows the model to be more effective when dealing with noisy images as well as distorted and diverse types of images.

3. Main Method

3.1. Overall Model Architecture

As shown in Figure 2, the model proposed in this study is divided into three main parts: a KAN-based guidance module, a diffusion-based fusion module, and a MergeCNN module. To reduce the computational complexity of the model, we perform differential processing on the hyperspectral image (HSI_GT) and its upsampled version (HSI_UP) and input the result RES into the guidance module. The module first performs convolution processing on RES to extract its data features and then passes the result to the KAN layer for analysis. The convolution kernel size of the layer is 64, and the stride of the layer is 1. The data features are extracted through this layer and the data dimensions are changed so that they can be better fed into the KAN layer. Next, the data are restored to its original dimension through the convTranspose layer, whose parameters are the same as the convolutional layer mentioned, and then the output of the KAN-based guidance module is input into the diffusion model for fusion generation. Compared with the multi-layer perceptron (MLP), KAN has a more flexible connection method, which can identify more complex feature relationships and strengthen important features.

The wavelet transform is used to perform a preliminary fusion of multispectral image (MSI) and HSI_UP, and the fusion result as a condition of the diffusion model is used to guide the image to evolve along the wavelet transform direction. In the specific operation, the result of wavelet fusion is concated with the output of the KAN module and then input into the diffusion model. The overall structure of the diffusion model is similar to the U-Net architecture, consisting of an encoder, middle layers, and a decoder. The encoder and decoder have the same number of layers. The specific structure is shown in Figure 2b. To further extract and enhance important features, a self-attention mechanism is added to each layer. In addition, to prevent the vanishing gradient, residual connections are used between the encoder, middle layers, and decoder layers, and the output of the previous layer is added to the input as the input of the layer.

Through the diffusion model processing, the result is added to HSI_UP to obtain HSI_D. Subsequently, MSI and HSI_D are concatenated and processed by MergeCNN consisting of three layers of convolution. ReLU is used as the activation function after each layer of convolution. To prevent the vanishing gradient, the final fusion result is added to HSI_D to obtain the high-resolution hyperspectral (HRHS).

3.2. Modules

3.2.1. KAN-Based Guidance Module

The KAN-based guidance module we designed is shown in Figure 2a, which mainly consists of a convolutional layer, a KAN layer, and a convTranspose layer. Different from the fully connected structure of the traditional multi-layer perceptron (MLP), KAN adopts a more flexible connection method and has a learnable activation function, making it more adaptable and flexible in data processing. Through the KAN layer, the input image data are converted into a two-dimensional form to extract key information. Subsequently, the extracted data are restored to the original image dimension through the convTranspose layer to facilitate better image fusion in the future and ensure that the dimension of the generated image is consistent with that of the input image.

3.2.2. Diffusion Module

Our diffusion module is shown in Figure 2b. The main architecture is based on U-Net and consists of three parts: encoder, middle layers, and decoder. Both the decoder and encoder adopt a three-layer structure, while the middle layers adopt a two-layer structure. The detailed structure of each layer is shown in Figure 3, which includes the self-attention layer, GroupNorm, SiLU activation function, Dropout, and Conv2d layer. The self-attention layer enables the model to adaptively highlight key information during the learning process so as to more effectively capture the subtle features in the data and improve the quality and detail performance of the generated data. GroupNorm standardizes the data, which helps the model converge and improves performance. In each layer, the SiLU activation function is used. Compared with the traditional sigmoid function, SiLU adds a linear part, which can better fit the linear relationship of the data, thereby accelerating the learning speed and improving stability. To prevent overfitting, a Dropout layer is added after SiLU to randomly discard some neuron outputs to avoid the network’s excessive dependence on specific neurons, thereby enhancing the generalization ability of the model. Then, the data are adjusted to the required dimension through the Conv2d layer. In the blocks used by the encoder, decoder, and intermediate layers, the residual connections are used to set the input of each layer to the sum of the output and input of the previous layer. This design not only effectively prevents gradient vanishing and improves the stability of the model but also enables more efficient information transmission, thereby helping the network learn image features more deeply during training. The introduction of residual connections further enhances the learning ability of the model, making it better at capturing and expressing image features.

3.2.3. MergeCNN

As shown in Figure 2c, the MergeCNN structure consists of a concat module and a three-layer convolutional neural network. The convolution kernel size of each layer is 3, the stride is 1, the padding is 1, and the activation function is ReLU. We observed that the image generated by the diffusion model may have a small amount of residual noise, and because the generation is based on HSI_UP, some MSI information may be lost. To this end, multi-layer convolution and iterative optimization are used to achieve better results in restoring the details and edge information of the image. Compared with using the diffusion model alone, this structure can effectively enhance the clarity and detail of the image. Specifically, after the image generated by the diffusion model is preliminarily processed, it is concat-operated with MSI to form the input of MergeCNN. Through the three-layer convolution operation of MergeCNN, the residual noise can be gradually filtered out, and the information of MSI and HSI_UP can be reintegrated to further improve the quality of the fused image.

3.3. Training Process

The training process of KANDiff covers two parts: the training of the diffusion module and the MergeCNN module. We use a boost training strategy to train these two modules independently. In each epoch of training, the diffusion module is first trained, and its internal parameters are optimized by backpropagation to produce a preliminary fusion result. This result is then used as input to the MergeCNN module to initiate its training. After training and backpropagation, the parameters of the MergeCNN module are tuned to further deeply fuse MSI and HSI_D.

The diffusion model and MergeCNN module are trained sequentially, but their backpropagation is performed independently. It means that when optimizing the parameters of the diffusion model, the parameters of the MergeCNN module are not adjusted at the same time, so the backpropagation results of the two modules do not directly affect each other. The advantage of this training method is that the use of two independent learners allows one module to be trained without interfering with the other, avoiding the need to freeze the parameters.

The training process of the diffusion model module mainly includes two processes: noise addition and denoising. For the noise addition process, the model gradually adds noise to the RES guided by the KAN module until random noise obeying Gaussian distribution is added. We have made important improvements in noise handling, using cosine noise instead of traditional linear noise. The smoothness of cosine noise significantly reduces the possibility of mutations during the generation process and can more effectively maintain the details of the image during the denoising process, reduce blurring, and enhance the clarity and authenticity of the generation results. Compared with linear noise, cosine noise provides a smoother transition in the time dimension, helping to improve the model’s responsiveness to subtle changes.

In addition to noise type improvements, the learnable variance is introduced, which enables the model to dynamically adjust noise levels during the generation process. By learning changes in noise, the model can more flexibly adapt to the needs of different image generation tasks, thereby improving the diversity and accuracy of the generated results. The extended form of the denoising process is as follows:

x_{t - 1} = \frac{1}{\sqrt{α_{t - 1}}} (x_{t} - \frac{\sqrt{1 - α_{t}}}{\sqrt{1 - α_{T}}} ϵ_{θ} (x_{t}, t)) + σ_{t} \cdot η

(13)

where

σ_{t}

is the standard deviation of the noise learned by the model.

η

is the noise sampled from the standard normal distribution. Unlike the traditional fixed variance, the learned variance allows the model to adjust the intensity of the noise according to the state of the current generation phase. It means that the model is able to adaptively choose the most appropriate noise level at different generation steps, thus improving the accuracy and quality of the generation.

In the denoising process, the noisy image is progressively denoised until it is reduced to RES using the KAN module and the diffusion module. Compared to the fixed denoising steps used in DDPM, we introduce an iterative denoising process where multiple iterations are performed at each timestep. This strategy allows the model to achieve more accurate denoising at each step, improving the overall quality of the generated samples. By refining the denoising process over multiple iterations, we introduce the ability of KAN to fully capture the complex relationships between modalities and to efficiently integrate the information when dealing with multimodal data.

In summary, the noise addition process degenerates the data distribution into a Gaussian-distributed random noise. In contrast, the denoising process aims to learn how to remove the random noise generated by the denoising process. The diffusion model is trained by maximizing its variational lower bound (VLB) using a simple supervised loss function in Equation (2).

For the training of the MergeCNN module, we use the mean squared error (MSE) as the loss function:

L_{M e r g e C N N} = E [{(M e r g e C N N (H S I_D, M S I) - G T)}^{2}]

(14)

3.4. Inference Process

The inference process is shown in Figure 2. We expect to learn an RES from random noise that applies to a specific HSI_UP, also take the results of the initial fusion of multispectral image (MSI) and HSI_UP by wavelet transform as a condition by using the trained model to denoise the random noise to obtain the RES, and then sum it with HSI_UP to obtain the HSI_D. Afterward, HSI_D and MSI are concated and fed into the trained MergeCNN; finally, a high spatial resolution hyperspectral image is obtained.

4. Experiments

In this section, a series of experiments is conducted to validate the effectiveness and superiority of the proposed KAN-Diff method. We first describe the experimental details, including the datasets used, the evaluation metrics, and the training process. Then, a dataset of hyperspectral and multispectral fusion images of the lunar surface that we produced is presented, and quantitative and qualitative analyses are performed on two public datasets, CAVE and Harvard, and on our Moon dataset to evaluate the proposed framework.

4.1. CAVE and Harvard

The first dataset is the CAVE dataset, acquired by an Apogee Alta U260 camera (manufactured by Apogee in the United States), containing 32 hyperspectral images (HSI) and 31 corresponding multispectral images (MSI) using red, green, and blue (RGB) channels with an image resolution of 512 × 512 pixels. The dataset covers 31 spectral channels with wavelengths ranging from 400 nm to 700 nm, with 10 nm intervals in each band, covering the visible spectrum. Based on the previous research methodology, we selected 20 of the images as the training set and used the remaining 11 images as the test set (see Figure 4). This way of data division helps to evaluate the performance of the algorithm in hyperspectral image processing.

The Harvard dataset consists of 77 hyperspectral images covering different scenes indoors and outdoors. Each image has a resolution of 1392 × 1040 pixels, contains 31 spectral channels, and the spectral range extends from 420 nm to 720 nm. In our experiments, the upper-left region (512 × 512 pixels) of 20 of these images was cropped, and 10 of them were used for training, while the remaining images were used for testing (see Figure 5).

4.2. Moon

In this paper, we constructed a lunar remote sensing image dataset containing hyperspectral image (HSI) and multispectral image (MSI) data, which can be used for lunar remote sensing image analysis and processing tasks. The data information and dataset production process are described in detail below (see Figure 6).

4.2.1. Data Source

The hyperspectral images in this dataset are obtained from the lunar image data provided by NASA’s Planetary Data System (PDS). By visiting the PDS website, we obtain high spectral resolution data from the Moon Mineralogy Mapper (M3) on board NASA’s Lunar Reconnaissance Orbiter (LRO), an imaging spectrometer that records spectral information of reflected light from the lunar topography, covering 86 continuous spectral channels in the visible to near-infrared bands. The data covers about 99% of the lunar surface globally, with a spatial resolution ranging from 140 to 280 m and spanning the period from October 2008 to May 2009. We subsequently use these rich hyperspectral data for lunar multispectral and hyperspectral image fusion (MHIF) analysis studies.

In addition, we use the Multiband Imager (MI) data from the Japan Aerospace Agency (JAXA) “Kaguya” lunar orbiter, which provides data in nine bands from the visible to the near-infrared with a resolution of 20 m/pixel covering 99% of the lunar surface. The data are divided into longitude ranges, and each file consists of image data (.IMG) and corresponding metadata information (.lbl). The data are calibrated and aligned to NASA’s Planetary Data System (PDS) standards and can be used to support studies of various aspects of the lunar surface, including material composition and topographic features.

4.2.2. Data Preprocessing

Firstly, the GDAL library is used to read the reflectance (RFL) and location (LOC) files of the HSI data in order to obtain the basic information of the data, which includes the number of bands, spatial resolution, etc., and the latitude, longitude, and height information is read from the LOC file, which is used for the subsequent cropping and processing. Based on the coordinates of the region of interest (ROI), the HSI data of the corresponding region are cropped, the first three invalid bands are eliminated, and only the valid bands are retained. Then, based on the geographic range of the HSI data, we calculate the list of MSI image files that need to be downloaded for the same location and download these images from the JAXA open database. We use the requests library to handle the download task. After the download is completed, the files are stored in a folder associated with the HSI data file name for easy management.

In addition, the GDAL library is used to read the nine bands of data from each MSI image file and stitch them into a large numpy array based on the latitude and longitude range of the image to form the complete MSI data. Since the spatial resolutions of MSI and HSI data are different, we use a bilinear interpolation method to resample the MSI data to the same spatial resolution as the HSI data to ensure that the two types of data are spatially aligned.

4.2.3. Data Preservation and Visualization

We save the processed HSI and MSI data as npy format files, and the file naming of MSI is associated with the file name of HSI data, which outputs the latitude and longitude information of the four corners of HSI data, which is, in turn, of reference value for the subsequent research. In addition, in order to check the data quality, the HSI and MSI data are visualized to facilitate the detection of outliers or noise in the data, which provides a basis for subsequent data processing. If there are problems with the data, further data cleaning and preprocessing can be performed. This dataset is produced to prepare for the subsequent fusion study of lunar spectral images.

4.2.4. Dataset Construction

Ground Truth Production: According to the Wald protocol, we first degrade the original hyperspectral image (HSI) to reduce its spatial resolution by interpolation to generate the degraded hyperspectral images. Then, the degraded hyperspectral data are interpolated and up-sampled to HSI_UP (304 × 304 pixels) with the same resolution as that of msi to ensure that the two kinds of data are spatially dimensionally Alignment. We use the original hyperspectral image (HSI) as the Ground Truth (GT) for model evaluation for comparison and assessment of model performance.

Data Normalization: Firstly, the HSI, MSI, and the up-sampled HSI_UP images are subjected to gray value removal by eliminating the highest 2% and lowest 2% gray values. This step helps to remove outliers and noise from the image and improve the quality of the data. The gray value processed image is then linearly stretched to normalize the data values between 0 and 1. This process enhances the image contrast, optimizes the visual performance of the image, and prepares it for subsequent model training.

Dataset Division: The processed hyperspectral images, multispectral images and the corresponding Ground Truth are divided into training set, validation set and test set, the training set contains 170 samples, the validation set contains 58 samples and the test set contains 58 samples.

Data Release: The training set, validation set, and test set are saved as HDF5 format files named train.h5, valid.h5, and test.h5, respectively, for subsequent data reading and processing. Such systematic data preprocessing and partitioning ensures that the model can be trained and evaluated with standardized inputs, and it facilitates data reuse and experimental reproducibility.

4.3. Performance Metrics

We use PSNR, ERGAS, SAM, CC, and SSIM to quantitatively evaluate the fusion results of the compared methods [44].

SAM is a metric reflecting the degree of spectral loss in the fusion results, and the closer the value of SAM is to 0, the lower the spectral distortion of the fused hyperspectral image. Thus, we have the following:

S A M (X_{i}, \hat{X_{i}}) = {cos}^{- 1} (\frac{\sum_{i = 1}^{n} X_{i} \hat{X_{i}}}{\sqrt{\sum_{i = 1}^{n} X_{i}^{2}} \cdot \sqrt{\sum_{i = 1}^{n} {\hat{X_{i}}}^{2}}})

(15)

where

X_{i}

is the reflectance value of HSI_GT spectrum in the ith spectral band and

\hat{X_{i}}

is the reflectance value of the HRHS spectrum in the ith spectral band.

ERGAS is a metric that characterizes the overall error between the fused image and the reference image, and a smaller ERGAS value means a better fusion effect. The ERGAS is formulated as follows:

E R G A S (X_{i}, \hat{X_{i}}) = 100 \times \frac{1}{N} \sqrt{\frac{1}{M} \sum_{i = 1}^{M} {(\frac{R M S E (X_{i}, \hat{X_{i}})}{μ_{\hat{X_{i}}}})}^{2}}

(16)

where N is the ratio between the pixel sizes of the

X_{i}

and

\hat{X_{i}}

, M is the number of spectral bands in the image, RMSE is the root mean square error, and

μ_{\hat{X_{i}}}

is the mean value of

X_{i}

.

PSNR measures the spatial quality of each band in the reconstructed image, and a higher value is better, which indicates that the reconstructed image is closer to the original, meaning better image quality. The PSNR is defined as follows:

P S N R (X_{i}, \hat{X_{i}}) = 10 \cdot {log}_{10} (\frac{max {(X_{i})}^{2}}{M S E (X_{i}, \hat{X_{i}})})

(17)

where MSE is the mean square error.

CC is a metric that describes the spatial distortion of the fused image. The closer its value is to 1, the better the spatial quality of the fused image is. Thus, we have the following:

C C (X_{i}, \hat{X_{i}}) = \frac{\sum_{i = 1}^{n} (X_{i} - μ_{X_{i}}) (\hat{X_{i}} - μ_{\hat{X_{i}}})}{\sqrt{\sum_{i = 1}^{n} (X_{i} - μ_{X_{i}})^{2} \sum_{i = 1}^{n} {(\hat{X_{i}} - μ_{\hat{X_{i}}})}^{2}}}

(18)

SSIM is commonly used to measure the structural similarity of the fused image; its value is between 0 and 1, and the closer it is to 1, the higher the similarity between the two images is. Thus, we have the following:

\begin{matrix} S S I M (X_{i}, \hat{X_{i}}) = \frac{(2 μ_{X_{i}} μ_{\hat{X_{i}}} + C_{1}) (2 σ_{X_{i} \hat{X_{i}}} + C_{2})}{({μ_{X_{i}}}^{2} + μ_{\hat{X_{i}}}^{2} + C_{1}) ({σ_{X_{i}}}^{2} + σ_{\hat{X_{i}}}^{2} + C_{2})} \end{matrix}

(19)

where

σ_{X_{i}}

is the variance of

X_{i}

, and

C_{1}

and

C_{2}

are two fixed constants.

4.4. Implementation Details

Our proposed method is based on Python 3.9.19 and PyTorch 2.2.2 to minimize losses on a Linux operating system with two NVIDIA GeForceRTX 3090 gpu. Regarding the configuration of the KAN part, which is based on the pykan 0.2.6 implementation, the number of grid intervals is 3, and the order of piecewise polynomial is 2. Regarding our diffusion model configuration, the learner used is AdamW, the learning rate is set to be

10^{- 4}

, and a 3-layer model is used for the encoder and the decoder, respectively, with 3 layers. The initial number of channels in the encoder layer was set to 32 on the cave and Harvard datasets, while for the moon dataset, the initial number of channels in the encoder layer was set to 91. After each encoder/decoder layer, multiplication or division adjustments were made using the factors 1, 2, and 4, which corresponded to the sequences of the 3 encoder/decoder layers. In the diffusion process, we set the number of diffusion steps T to 1000.

In the MergeCNN part, a three-layer convolutional neural network is used, and the relu function is used as the activation function for this part. In addition, here again, the AdamW learner with a learning rate of

10^{- 4}

is used to learn the results of the diffusion model part as the input to that part using boost learning.

4.5. Benchmark

In order to evaluate the performance of our method, in this section, we compare the proposed method with 10 commonly used methods, including the optimized Brovey transform with haze correction method (BT-H) [83], as well as some competitive deep learning methods such as PNN [58], DiCNN [60], MSDCNN [61], FusionNet [63], PSRT [57], DRPNN [59], PanNet [62], MOG [84], and DDIF [70]. For a fair comparison, all deep learning methods are trained using the same input pairs. In addition, the choice of relevant hyperparameters is consistent with the original paper.

5. Results and Discussion

In this section, we compare the performance of our method with the most classical and some advanced models to demonstrate the advantages of KANDiff. Finally, we perform ablation experiments to validate the effectiveness of KAN and MergeCNN.

5.1. Results on CAVE Dataset

We conduct experiments to evaluate the performance of KanDiff on the CAVE dataset. Figure 4 shows the 11 test images synthesized in RGB colors.

For qualitative evaluation (see Figure 7), we show the fusion results as well as some error plots to aid visual inspection. Compared to the benchmark method, we can more clearly observe that the colors, shapes of details, edges, etc., in our results are closer to GT and perform better in terms of detail recovery and visual effects. In the error plots, the darker the blue and the lighter the red, the closer it is to GT. By comparing the results in Figure 7 with the relevant close-up shots in the rectangular box, we can see the best performance of the proposed method. The color of the objects in the figure, the shape of the details, and the edges are all closer to the GT. On the other hand, the residual plot shows that the gap between our results and GT is small.

As can be seen in Table 1, KANDiff outperforms the other methods in three out of five metrics, i.e., ERGAS, PSNR, SSIM. Specifically, we observe that compared to a second-best method such as PSRT, it achieves an improvement of 2.83% in ERGAS and 0.75% in PSNR, and compared to the third-best method, MOG-DCN, the improvements are 2.83% and 0.75%; for SAM, ERGAS, PSNR, and SSIM, we achieved 2.78%, 5.05%, 26.57%, and 3.70% improvements, respectively.

In addition, we analyze the spectral accuracy of our fusion image process by spectral vectors. We choose the pixel point as the (38, 73) pixel point of the sixth image in the test set, and Figure 8 depicts the spectral vectors of the 31 bands (38, 73). For convenience, we zoomed on the spectral vectors for 4 bands (bands 12 to 16), see the rectangular boxes in Figure 8. As can be seen in both figures, the spectral vector of the proposed method (dashed line) is the closest to GT (black line).

5.2. Results on Harvard Dataset

We conducted experiments on the Harvard dataset to evaluate the performance of the model. As shown in Figure 9, an image from the test set is selected for the presentation of the fusion results. The images generated by our method are closer to the GT in terms of color, detail shape, and edges. Additionally, it can be seen that the error plot corresponding to our method has darker blue areas and fewer red regions, which further indicates that our results are closer to the GT. The numerical evaluations are reported in Table 2, and it can be seen that our model outperforms the other models in terms of PSNR and SSIM, specifically, our model achieves a 29.43% improvement in PSNR compared to the MOG-DCN model.

By comparing the results in Figure 9 with the corresponding close-up images within the rectangular boxes, it is evident that the proposed method performs the best. The colors of the objects, the shapes of the details, and the edges in the figure are all closer to the ground truth (GT). Additionally, the residual plot indicates that the discrepancy between our results and the ground truth is minimal.

5.3. Results on Moon Dataset

We experimentally evaluate the performance of KanDiff on the Moon dataset. As can be seen in Figure 10, the proposed method excels in image detail reconstruction, exhibiting the darkest blue in the correlation error plot, indicating a closer proximity to GT. As shown in Table 3, the KANDiff model performs well in several key metrics, especially and significantly outperforming the other methods in terms of PSNR, SSIM, and CC. Specifically, our model achieved 28.3864 on PSNR, an improvement of 0.17%, compared to the better-performing PanNet (28.3385). On SSIM, our model achieves 0.8891, which is a 6.93% improvement compared to DICNN (0.8315) and a 2.4% improvement compared to the better-performing DDIF, which indicates that our method possesses a strong advantage in image detail recovery and structural similarity. In addition, the CC is 0.9666, which is a 0.59% improvement over DDIF (0.9609), further illustrating the excellent performance of our model in preserving the consistency of spectral and spatial information. KANDiff has obvious advantages in key metrics such as PSNR and SSIM, especially showing significant performance improvements in image reconstruction quality and perceptual quality.

When comparing the main results in Figure 10 with the enlarged sections highlighted by the rectangular boxes, the superior performance of our proposed method becomes clear. The colors of the objects, the intricacy of their shapes, and the sharpness of the edges in our images more closely match the ground truth (GT). Additionally, the residual plot demonstrates that the difference between our results and the ground truth is minimal.

5.4. Ablation Study

This section will be dedicated to presenting results on ablation studies to evaluate the performance of the KAN module and the MergeCNN module in Kandiff, where experiments using the CAVE dataset were considered for the sake of brevity without compromising the generality of the model.

As shown in Table 4, we first evaluate the performance of the algorithm (DDIF) after removing both the KAN and MergeCNN modules, using it as a benchmark. To assess the performance of the KAN module, we remove the MergeCNN module in KANDiff for the experiment. Compared to DDIF, the algorithm improves on all metrics by introducing only the KAN module. Specifically, compared to the results of DDIF, there is an improvement of 0.06, 0.13, 0.21, 0.0019, and 0.0002 in SAM, ERGAS, PSNR, CC, and SSIM, respectively, which fully indicates the effectiveness of using KAN as a bootstrap module. In addition, to test the effectiveness of the MergeCNN module, we remove the KAN module in KANDiff and incorporate only the MergeCNN module for the experiment. The experimental results show that in incorporating only the MergeCNN module, the algorithm, when compared to DDIF, maintains the same results with DDIF on SSIM and improves 0.04, 0.04, 0.11, and 0.001 on SAM, ERGAS, PSNR, and CC, respectively; this result verifies the effectiveness of the MergeCNN module. We show the performance of KANDiff on the CAVE dataset in Table 4, and it can be seen that compared to DDIF, KANDiff improves the metrics more than either the incorporation of the KAN module or the MergeCNN module alone, so these two modules play a positive role in our KANDiff at the same time.

6. Conclusions

We propose a model for the MHIF task called KANDiff. The model introduces two key components: the KAN and MergeCNN. The activation function of the KAN enhances the representation of salient features. In the denoising stage of the diffusion model, the KAN serves as a guide network. It helps the model understand the complex relationships between modalities, significantly improving the signal-to-noise ratio of the fusion results. MergeCNN excels in recovering detail and edge information. It addresses the issue of remaining noise in the diffusion model through deep feature fusion. This significantly enhances the model’s stability and further optimizes the preliminary results generated by the diffusion model. We conduct experiments on our Moon dataset as well as the CAVE and Harvard public datasets. The quantitative and qualitative results show that the performance of KANDiff outperforms the method without KAN and MergeCNN modules in all metrics. In addition, compared with the transformer-based method, KanDiff improves by 26.5% in PSNR metrics and by 3.7% in SSIM metrics, indicating that KanDiff performs well in preserving the spatial and spectral information of the image. At present, the application of the score-based model is relatively new. In future exploration, we can use a score-based model to fuse hyperspectral images and multispectral images to improve the fusion quality.

Author Contributions

W.L. completed the experiment, made a dataset of the lunar surface, analyzed the results, and participated in the compilation of the article; L.L. designed the overall experiment, proposed the optimization plan and the lunar dataset production plan, and participated in the compilation of the article; M.P. provided the data source and the method to download the data; and R.T. proposed the optimization plan of the experiment, participated in the conception of the experimental plan, and participated in the compilation of the article. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 62471049), the Open Fund of the State Key Laboratory of Remote Sensing Science (Grant No. OFSLRSS202317), the National Key Research and Development Program of China (Grant No. 2022YFF0503100), and the Young Backbone Teacher Support Plan of Beijing Information Science and Technology University (Grant No. YBT202413).

Data Availability Statement

The data provided in this study are publicly available.

Acknowledgments

The authors would like to express their gratitude to the editors and the anonymous reviewers for their insightful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
Vivone, G. Multispectral and hyperspectral image fusion in remote sensing: A survey. Inf. Fusion 2023, 89, 405–417. [Google Scholar] [CrossRef]
Wang, W.; Fu, X.; Zeng, W.; Sun, L.; Zhan, R.; Huang, Y.; Ding, X. Enhanced deep blind hyperspectral image fusion. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 1513–1523. [Google Scholar] [CrossRef] [PubMed]
Arienzo, A.; Vivone, G.; Garzelli, A.; Alparone, L.; Chanussot, J. Full-resolution quality assessment of pansharpening: Theoretical and hands-on approaches. IEEE Geosci. Remote Sens. Mag. 2022, 10, 168–201. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; proceedings, part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral remote sensing data analysis and future challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
Zhu, K.; Sun, Z.; Zhao, F.; Yang, T.; Tian, Z.; Lai, J.; Long, B.; Li, S. Remotely sensed canopy resistance model for analyzing the stomatal behavior of environmentally-stressed winter wheat. ISPRS J. Photogramm. Remote Sens. 2020, 168, 197–207. [Google Scholar] [CrossRef]
Wang, S.; Yue, J.; Liu, J.; Tian, Q.; Wang, M. Large-scale few-shot learning via multi-modal knowledge discovery. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 718–734. [Google Scholar]
Cheng, X.; Zhang, M.; Lin, S.; Zhou, K.; Zhao, S.; Wang, H. Two-stream isolation forest based on deep features for hyperspectral anomaly detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5504205. [Google Scholar] [CrossRef]
Cheng, X.; Huo, Y.; Lin, S.; Dong, Y.; Zhao, S.; Zhang, M.; Wang, H. Deep feature aggregation network for hyperspectral anomaly detection. IEEE Trans. Instrum. Meas. 2024, 73, 5033016. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Gao, L.; Ni, L.; Huang, M.; Chanussot, J. Model-informed Multi-stage Unsupervised Network for Hyperspectral Image Super-resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5516117. [Google Scholar]
Huo, Y.; Cheng, X.; Lin, S.; Zhang, M.; Wang, H. Memory-augmented Autoencoder with Adaptive Reconstruction and Sample Attribution Mining for Hyperspectral Anomaly Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5518118. [Google Scholar] [CrossRef]
West, H.; Quinn, N.; Horswell, M. Remote sensing for drought monitoring & impact assessment: Progress, past challenges and future opportunities. Remote Sens. Environ. 2019, 232, 111291. [Google Scholar]
Khan, A.; Vibhute, A.D.; Mali, S.; Patil, C.H. A systematic review on hyperspectral imaging technology with a machine and deep learning methodology for agricultural applications. Ecol. Inform. 2022, 69, 101678. [Google Scholar] [CrossRef]
Hossain, M.D.; Chen, D. Segmentation for Object-Based Image Analysis (OBIA): A review of algorithms and challenges from remote sensing perspective. ISPRS J. Photogramm. Remote Sens. 2019, 150, 115–134. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Carrino, T.A.; Crósta, A.P.; Toledo, C.L.B.; Silva, A.M. Hyperspectral remote sensing applied to mineral exploration in southern Peru: A multiple data integration approach in the Chapi Chiara gold prospect. Int. J. Appl. Earth Obs. Geoinf. 2018, 64, 287–300. [Google Scholar] [CrossRef]
Bishop, C.A.; Liu, J.G.; Mason, P.J. Hyperspectral remote sensing for mineral exploration in Pulang, Yunnan Province, China. Int. J. Remote Sens. 2011, 32, 2409–2426. [Google Scholar] [CrossRef]
Wu, C.; Du, B.; Cui, X.; Zhang, L. A post-classification change detection method based on iterative slow feature analysis and Bayesian soft fusion. Remote Sens. Environ. 2017, 199, 241–255. [Google Scholar] [CrossRef]
Chavez, P.; Sides, S.C.; Anderson, J.A. Comparison of three different methods to merge multiresolution and multispectral data- Landsat TM and SPOT panchromatic. Photogramm. Eng. Remote Sens. 1991, 57, 295–303. [Google Scholar]
Shahdoosti, H.R.; Ghassemian, H. Combining the spectral PCA and spatial PCA fusion methods by an optimal filter. Inf. Fusion 2016, 27, 150–160. [Google Scholar] [CrossRef]
Shah, V.P.; Younan, N.H.; King, R.L. An efficient pan-sharpening method via a combined adaptive PCA approach and contourlets. IEEE Trans. Geosci. Remote Sens. 2008, 46, 1323–1335. [Google Scholar] [CrossRef]
Tu, T.M.; Su, S.C.; Shyu, H.C.; Huang, P.S. A new look at IHS-like image fusion methods. Inf. Fusion 2001, 2, 177–186. [Google Scholar] [CrossRef]
Laben, C.A.; Brower, B.V. Process for Enhancing the Spatial Resolution of Multispectral Imagery Using Pan-Sharpening. US Patent 6,011,875, 4 January 2000. [Google Scholar]
Aiazzi, B.; Baronti, S.; Selva, M. Improving component substitution pansharpening through multivariate regression of MS + Pan data. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3230–3239. [Google Scholar] [CrossRef]
Qu, J.; Li, Y.; Dong, W. Hyperspectral pansharpening with guided filter. IEEE Geosci. Remote Sens. Lett. 2017, 14, 2152–2156. [Google Scholar] [CrossRef]
Nunez, J.; Otazu, X.; Fors, O.; Prades, A.; Pala, V.; Arbiol, R. Multiresolution-based image fusion with additive wavelet decomposition. IEEE Trans. Geosci. Remote Sens. 1999, 37, 1204–1211. [Google Scholar] [CrossRef]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A.; Selva, M. MTF-tailored multiscale fusion of high-resolution MS and Pan imagery. Photogramm. Eng. Remote Sens. 2006, 72, 591–596. [Google Scholar] [CrossRef]
Vivone, G.; Restaino, R.; Chanussot, J. Full scale regression-based injection coefficients for panchromatic sharpening. IEEE Trans. Image Process. 2018, 27, 3418–3431. [Google Scholar] [CrossRef] [PubMed]
Otazu, X.; González-Audícana, M.; Fors, O.; Núñez, J. Introduction of sensor spectral response into image fusion methods. Application to wavelet-based methods. IEEE Trans. Geosci. Remote Sens. 2005, 43, 2376–2385. [Google Scholar] [CrossRef]
Vivone, G.; Restaino, R.; Chanussot, J. A regression-based high-pass modulation pansharpening approach. IEEE Trans. Geosci. Remote Sens. 2017, 56, 984–996. [Google Scholar] [CrossRef]
Duran, J.; Buades, A.; Coll, B.; Sbert, C. A nonlocal variational model for pansharpening image fusion. SIAM J. Imaging Sci. 2014, 7, 761–796. [Google Scholar] [CrossRef]
Fang, F.; Li, F.; Shen, C.; Zhang, G. A variational approach for pan-sharpening. IEEE Trans. Image Process. 2013, 22, 2822–2834. [Google Scholar] [CrossRef] [PubMed]
Moeller, M.; Wittman, T.; Bertozzi, A.L. A variational approach to hyperspectral image fusion. In Proceedings of the Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XV, SPIE, Orlando, FL, USA, 13–17 April 2009; Volume 7334, pp. 502–511. [Google Scholar]
Fu, X.; Lin, Z.; Huang, Y.; Ding, X. A variational pan-sharpening with local gradient constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10265–10274. [Google Scholar]
He, X.; Condat, L.; Bioucas-Dias, J.M.; Chanussot, J.; Xia, J. A new pansharpening method based on spatial and spectral sparsity priors. IEEE Trans. Image Process. 2014, 23, 4160–4174. [Google Scholar] [CrossRef] [PubMed]
Deng, L.J.; Feng, M.; Tai, X.C. The fusion of panchromatic and multispectral remote sensing images via tensor-based sparse modeling and hyper-Laplacian prior. Inf. Fusion 2019, 52, 76–89. [Google Scholar] [CrossRef]
Wu, Z.C.; Huang, T.Z.; Deng, L.J.; Hu, J.F.; Vivone, G. VO+ Net: An adaptive approach using variational optimization and deep learning for panchromatic sharpening. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5401016. [Google Scholar] [CrossRef]
Liu, X.; Liu, Q.; Wang, Y. Remote sensing image fusion based on two-stream fusion network. Inf. Fusion 2020, 55, 1–15. [Google Scholar] [CrossRef]
Guo, Q.; Li, S.; Li, A. An efficient dual spatial–spectral fusion network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5412913. [Google Scholar] [CrossRef]
Zhang, X.; Huang, W.; Wang, Q.; Li, X. SSR-NET: Spatial–spectral reconstruction network for hyperspectral and multispectral image fusion. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5953–5965. [Google Scholar] [CrossRef]
Wang, X.; Hu, Q.; Jiang, J.; Ma, J. A group-based embedding learning and integration network for hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5541416. [Google Scholar] [CrossRef]
Xu, Q.; Li, Y.; Nie, J.; Liu, Q.; Guo, M. UPanGAN: Unsupervised pansharpening based on the spectral and spatial loss constrained generative adversarial network. Inf. Fusion 2023, 91, 31–46. [Google Scholar] [CrossRef]
Su, X.; Li, J.; Hua, Z. Transformer-based regression network for pansharpening remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5407423. [Google Scholar] [CrossRef]
Khan, M.M.; Chanussot, J.; Condat, L.; Montanvert, A. Indusion: Fusion of multispectral and panchromatic images using the induction scaling technique. IEEE Geosci. Remote Sens. Lett. 2008, 5, 98–102. [Google Scholar] [CrossRef]
Vivone, G.; Alparone, L.; Chanussot, J.; Dalla Mura, M.; Garzelli, A.; Licciardi, G.A.; Restaino, R.; Wald, L. A critical comparison among pansharpening algorithms. IEEE Trans. Geosci. Remote Sens. 2014, 53, 2565–2586. [Google Scholar] [CrossRef]
Zhou, M.; Huang, J.; Fu, X.; Zhao, F.; Hong, D. Effective pan-sharpening by multiscale invertible neural network and heterogeneous task distilling. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5411614. [Google Scholar] [CrossRef]
Peng, S.; Deng, L.J.; Hu, J.F.; Zhuo, Y.W. Source-Adaptive Discriminative Kernels based Network for Remote Sensing Pansharpening. In Proceedings of the IJCAI, Vienna, Austria, 23–29 July 2022; pp. 1283–1289. [Google Scholar]
Kingma, D.P. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Yu, W.; Chen, C.; Liang, P.; Guo, X.; Jiang, J. Pan-GAN: An unsupervised pan-sharpening method for remote sensing image fusion. Inf. Fusion 2020, 62, 110–120. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, NSW, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 13–19 June 2020; pp. 8110–8119. [Google Scholar]
Kingma, D.P.; Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems; MIT Press: Montreal, QC, Canada, 2018; Volume 31. [Google Scholar]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using real nvp. arXiv 2016, arXiv:1605.08803. [Google Scholar]
Deng, S.Q.; Deng, L.J.; Wu, X.; Ran, R.; Hong, D.; Vivone, G. PSRT: Pyramid shuffle-and-reshuffle transformer for multispectral and hyperspectral image fusion. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503715. [Google Scholar] [CrossRef]
Masi, G.; Cozzolino, D.; Verdoliva, L.; Scarpa, G. Pansharpening by convolutional neural networks. Remote Sens. 2016, 8, 594. [Google Scholar] [CrossRef]
Wei, Y.; Yuan, Q.; Shen, H.; Zhang, L. Boosting the accuracy of multispectral image pansharpening by learning a deep residual network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1795–1799. [Google Scholar] [CrossRef]
He, L.; Rao, Y.; Li, J.; Chanussot, J.; Plaza, A.; Zhu, J.; Li, B. Pansharpening via detail injection based convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 1188–1204. [Google Scholar] [CrossRef]
Yuan, Q.; Wei, Y.; Meng, X.; Shen, H.; Zhang, L. A multiscale and multidepth convolutional neural network for remote sensing imagery pan-sharpening. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 978–989. [Google Scholar] [CrossRef]
Yang, J.; Fu, X.; Hu, Y.; Huang, Y.; Ding, X.; Paisley, J. PanNet: A deep network architecture for pan-sharpening. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5449–5457. [Google Scholar]
Deng, L.J.; Vivone, G.; Jin, C.; Chanussot, J. Detail injection-based deep convolutional neural networks for pansharpening. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6995–7010. [Google Scholar] [CrossRef]
Yang, Y.; Tu, W.; Huang, S.; Lu, H. PCDRN: Progressive cascade deep residual network for pansharpening. Remote Sens. 2020, 12, 676. [Google Scholar] [CrossRef]
Xie, Q.; Zhou, M.; Zhao, Q.; Xu, Z.; Meng, D. MHF-Net: An interpretable deep network for multispectral and hyperspectral image fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1457–1473. [Google Scholar] [CrossRef]
Wang, W.; Zhou, Z.; Liu, H.; Xie, G. MSDRN: Pansharpening of multispectral images via multi-scale deep residual network. Remote Sens. 2021, 13, 1200. [Google Scholar] [CrossRef]
Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
Wang, Z.; Chen, Y.; Shao, W.; Li, H.; Zhang, L. SwinFuse: A residual swin transformer fusion network for infrared and visible images. IEEE Trans. Instrum. Meas. 2022, 71, 5016412. [Google Scholar] [CrossRef]
Meng, Q.; Shi, W.; Li, S.; Zhang, L. Pandiff: A novel pansharpening method based on denoising diffusion probabilistic model. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5611317. [Google Scholar] [CrossRef]
Cao, Z.; Cao, S.; Deng, L.J.; Wu, X.; Hou, J.; Vivone, G. Diffusion model with disentangled modulations for sharpening multispectral and hyperspectral images. Inf. Fusion 2024, 104, 102158. [Google Scholar] [CrossRef]
Li, X.; Thickstun, J.; Gulrajani, I.; Liang, P.S.; Hashimoto, T.B. Diffusion-lm improves controllable text generation. Adv. Neural Inf. Process. Syst. 2022, 35, 4328–4343. [Google Scholar]
Karras, T.; Aittala, M.; Aila, T.; Laine, S. Elucidating the design space of diffusion-based generative models. Adv. Neural Inf. Process. Syst. 2022, 35, 26565–26577. [Google Scholar]
Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; Norouzi, M. Palette: Image-to-image diffusion models. In Proceedings of the ACM SIGGRAPH 2022 Conference Proceedings, Vancouver, BC, Canada, 7–11 August 2022; pp. 1–10. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Chen, N.; Zhang, Y.; Zen, H.; Weiss, R.J.; Norouzi, M.; Chan, W. Wavegrad: Estimating gradients for waveform generation. arXiv 2020, arXiv:2009.00713. [Google Scholar]
Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4713–4726. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
Phung, H.; Dao, Q.; Tran, A. Wavelet diffusion models are fast and scalable image generators. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10199–10208. [Google Scholar]
Zhang, Y.; Hong, G. An IHS and wavelet integrated approach to improve pan-sharpening visual quality of natural colour IKONOS and QuickBird images. Inf. Fusion 2005, 6, 225–234. [Google Scholar] [CrossRef]
Daubechies, I. Ten lectures on wavelets. In Society for Industrial and Applied Mathematics; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1992. [Google Scholar]
Lolli, S.; Alparone, L.; Garzelli, A.; Vivone, G. Haze correction for contrast-based multispectral pansharpening. IEEE Geosci. Remote Sens. Lett. 2017, 14, 2255–2259. [Google Scholar] [CrossRef]
Dong, W.; Zhou, C.; Wu, F.; Wu, J.; Shi, G.; Li, X. Model-guided deep hyperspectral image super-resolution. IEEE Trans. Image Process. 2021, 30, 5754–5768. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The forward diffusion process and the backward process (a) of the diffusion model (b).

Figure 2. KANDiff flowchart: The difference between HSI_GT and HSI_UP (RES) is used as the input of the diffusion model. MSI and HSI_UP are first fused through wavelet transform as the condition of the diffusion model. The KAN module is used as a guide. The output of the diffusion model is added to HSI_UP to obtain HSI_D. Finally, HSI_D and MSI are concatenated and input to the MergeCNN module. (a) The KAN-based guidance module. (b) The diffusion module. (c) The MergeCNN module.

Figure 3. The details of the diffusion module: (a) depicts the structure of the encoder, middle layers, and decoder, and (b) depicts the structure of the block. ‘cond’ represents the condition of the diffusion model, ‘res’ is the difference between HSI_GT and HSI_UP, and ‘SiLu’ is an activation function.

Figure 4. Test images from the CAVE dataset. (a) Balloons. (b) Flowers. (c) Chart and stuffed toy. (d) Clay. (e) Fake and real beers. (f) Jelly beans. (g) Fake and real lemon slices. (h) Fake and real tomatoes. (i) Feathers. (j) Hairs. (k) compact disc (CD).

Figure 5. Test images from the Harvard dataset. (a) Tree. (b) Door. (c) Window. (d) Backpack. (e) Bikes. (f) Wall. (g) Sofa1. (h) Fence. (i) Sofa2. (j) Parcels.

Figure 6. Test images from the Moon dataset.

Figure 7. The first and third rows of the four-row legend show the GT of “Feathers” in the CAVE dataset and the true colour representation of the fusion results for the KANDiff and 10 compared methods, respectively. The second and fourth rows show the residuals between the fusion results of the different methods and GT. (a) GT. (b) KANDiff. (c) PSRT. (d) DRPNN. (e) MOG-DCN. (f) DICNN. (g) DDIF. (h) FisionNet. (i) MSDCNN. (j) PNN. (k) PanNet. (l) BDSD.

Figure 8. Spectral vectors of the GT and the benchmark.

Figure 9. The first and third rows of the four-row legend show the GT of a test image from the Harvard dataset and the true color representation of the fusion results for the 10 comparison methods and KANDiff, respectively. The second and fourth rows show the residuals between the fusion results of the various methods and GT. (a) GT. (b) KANDiff. (c) PSRT. (d) DRPNN. (e) MOG-DCN. (f) DICNN. (g) DDIF. (h) FisionNet. (i) MSDCNN. (j) PNN. (k) PanNet. (l) BDSD.

Figure 10. The first and third rows of the four-row legend show the GT of a test image from the Moon dataset and the true color representation of the fusion results for the 10 comparison methods and KANDiff, respectively. The second and fourth rows show the residuals between the fusion results of the various methods and GT. (a) GT. (b) KANDiff. (c) PSRT. (d) DRPNN. (e) MOG-DCN. (f) DICNN. (g) DDIF. (h) FisionNet. (i) MSDCNN. (j) PNN. (k) PanNet. (l) BDSD.

Table 1. Quantitative results on the CAVE dataset comparing some classical and some advanced approaches to image fusion. The best results are bolded, and the second-best results are underlined.

Method	Results of the Test Set (CAVE)
Method	SAM	ERGAS	PSNR	CC	SSIM
BDSD	20.1990	13.6788	31.0707	0.5656	0.6751
PNN	3.4741	3.1986	38.7736	0.9880	0.9330
MSDCNN	2.9897	2.8949	39.9050	0.9898	0.9455
FusionNet	3.1401	2.7181	40.2041	0.9894	0.9464
PanNet	3.1709	2.8659	40.1297	0.9896	0.9472
DICNN	2.8914	2.5000	41.0076	0.9894	0.9539
DRPNN	2.8488	2.3889	41.5133	0.9913	0.9554
MOG-DCN	2.7686	2.4370	41.1254	0.9908	0.9580
PSRT	2.6348	2.3855	42.0750	0.9910	0.9604
DDIF	2.8598	2.4864	52.8545	0.9623	0.9957
Kandiff	2.6937	2.3199	53.2564	0.9645	0.9959

Table 2. Quantitative results on the Harvard dataset comparing some of the classical and some advanced approaches to image fusion. The best results are bolded, and the second-best results are underlined.

Method	Results of the Test Set (Harvard)
Method	SAM	ERGAS	PSNR	CC	SSIM
BDSD	17.3106	11.5576	30.9710	0.4185	0.7214
PNN	2.7208	2.2261	36.3448	0.9948	0.9464
MSDCNN	2.5962	2.1066	36.8498	0.9952	0.9502
FusionNet	2.6584	2.0977	36.7939	0.9951	0.9492
PanNet	2.5381	2.0481	37.0526	0.9953	0.9507
DICNN	2.5222	2.0137	31.1957	0.9952	0.9506
DRPNN	2.4793	1.9424	37.3477	0.9954	0.9519
MOG-DCN	2.3422	1.8763	37.6848	0.9958	0.9561
PSRT	2.5794	2.0794	36.9512	0.9949	0.9466
DDIF	2.7213	2.1712	48.6701	0.8448	0.9847
KANDiff	2.6806	2.1566	48.7768	0.8462	0.9851

Table 3. Quantitative results on the Moon dataset comparing some classical and some advanced approaches to image fusion. The best results are in bold, and the next best results are underlined.

Method	Results of the Test Set (Moon)
Method	SAM	ERGAS	PSNR	CC	SSIM
BDSD	12.3555	8.0568	17.7414	0.6575	0.3267
PNN	3.6148	2.6051	28.1159	0.9584	0.8198
MSDCNN	3.2555	2.5778	28.2307	0.9590	0.8245
FusionNet	3.4679	2.5610	28.2724	0.9595	0.8227
PanNet	3.4316	2.5432	28.3385	0.9600	0.8239
DICNN	3.2897	2.6634	28.1659	0.9563	0.8315
DRPNN	3.3304	2.8583	27.4946	0.9498	0.8092
MOG-DCN	3.2815	2.7940	27.6940	0.9514	0.8122
PSRT	3.3215	2.6767	28.1634	0.9556	0.8315
DDIF	3.7597	2.8887	27.5838	0.9609	0.8683
KANDiff	3.4788	2.6815	28.3864	0.9666	0.8891

Table 4. Quantitative results on the CAVE dataset comparing the experimental results of DDIF and DDIF combined with different modules. The best results are bolded, and the second-best results are underlined.

Method	Results of the Test Set (CAVE)
Method	SAM	ERGAS	PSNR	CC	SSIM
DDIF	2.8598	2.4864	52.8545	0.9623	0.9957
KAN + DDIF	2.7969	2.3583	53.0663	0.9642	0.9959
DDIF + MergeCNN	2.8185	2.4411	52.9659	0.9634	0.9957
KANDiff	2.6937	2.3199	53.2564	0.9645	0.9959

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, W.; Li, L.; Peng, M.; Tao, R. KANDiff: Kolmogorov–Arnold Network and Diffusion Model-Based Network for Hyperspectral and Multispectral Image Fusion. Remote Sens. 2025, 17, 145. https://doi.org/10.3390/rs17010145

AMA Style

Li W, Li L, Peng M, Tao R. KANDiff: Kolmogorov–Arnold Network and Diffusion Model-Based Network for Hyperspectral and Multispectral Image Fusion. Remote Sensing. 2025; 17(1):145. https://doi.org/10.3390/rs17010145

Chicago/Turabian Style

Li, Wei, Lu Li, Man Peng, and Ran Tao. 2025. "KANDiff: Kolmogorov–Arnold Network and Diffusion Model-Based Network for Hyperspectral and Multispectral Image Fusion" Remote Sensing 17, no. 1: 145. https://doi.org/10.3390/rs17010145

APA Style

Li, W., Li, L., Peng, M., & Tao, R. (2025). KANDiff: Kolmogorov–Arnold Network and Diffusion Model-Based Network for Hyperspectral and Multispectral Image Fusion. Remote Sensing, 17(1), 145. https://doi.org/10.3390/rs17010145

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

KANDiff: Kolmogorov–Arnold Network and Diffusion Model-Based Network for Hyperspectral and Multispectral Image Fusion

Abstract

1. Introduction

2. Related Works

2.1. Diffusion Model

2.2. KAN (Kolmogorov–Arnold Networks)

2.3. Wavelet Transform

2.4. Motivation

3. Main Method

3.1. Overall Model Architecture

3.2. Modules

3.2.1. KAN-Based Guidance Module

3.2.2. Diffusion Module

3.2.3. MergeCNN

3.3. Training Process

3.4. Inference Process

4. Experiments

4.1. CAVE and Harvard

4.2. Moon

4.2.1. Data Source

4.2.2. Data Preprocessing

4.2.3. Data Preservation and Visualization

4.2.4. Dataset Construction

4.3. Performance Metrics

4.4. Implementation Details

4.5. Benchmark

5. Results and Discussion

5.1. Results on CAVE Dataset

5.2. Results on Harvard Dataset

5.3. Results on Moon Dataset

5.4. Ablation Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI