A Multi-Level Cross-Attention Image Registration Method for Visible and Infrared Small Unmanned Aerial Vehicle Targets via Image Style Transfer

Jiang, Wen; Pan, Hanxin; Wang, Yanping; Li, Yang; Lin, Yun; Bi, Fukun

doi:10.3390/rs16162880

Open AccessArticle

A Multi-Level Cross-Attention Image Registration Method for Visible and Infrared Small Unmanned Aerial Vehicle Targets via Image Style Transfer

by

Wen Jiang

,

Hanxin Pan

,

Yanping Wang

^*,

Yang Li

,

Yun Lin

and

Fukun Bi

Radar Monitoring Technology Laboratory, School of Information Science and Technology, North China University of Technology, Beijing 100144, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 2880; https://doi.org/10.3390/rs16162880

Submission received: 21 June 2024 / Revised: 31 July 2024 / Accepted: 2 August 2024 / Published: 7 August 2024

(This article belongs to the Special Issue Deep Learning and Computer Vision in Remote Sensing-III)

Download

Browse Figures

Versions Notes

Abstract

:

Small UAV target detection and tracking based on cross-modality image fusion have gained widespread attention. Due to the limited feature information available from small UAVs in images, where they occupy a minimal number of pixels, the precision required for detection and tracking algorithms is particularly high in complex backgrounds. Image fusion techniques can enrich the detailed information for small UAVs, showing significant advantages under extreme lighting conditions. Image registration is a fundamental step preceding image fusion. It is essential to achieve accurate image alignment before proceeding with image fusion to prevent severe ghosting and artifacts. This paper specifically focused on the alignment of small UAV targets within infrared and visible light imagery. To address this issue, this paper proposed a cross-modality image registration network based on deep learning, which includes a structure preservation and style transformation network (SPSTN) and a multi-level cross-attention residual registration network (MCARN). Firstly, the SPSTN is employed for modality transformation, transferring the cross-modality task into a single-modality task to reduce the information discrepancy between modalities. Then, the MCARN is utilized for single-modality image registration, capable of deeply extracting and fusing features from pseudo infrared and visible images to achieve efficient registration. To validate the effectiveness of the proposed method, comprehensive experimental evaluations were conducted on the Anti-UAV dataset. The extensive evaluation results validate the superiority and universality of the cross-modality image registration framework proposed in this paper, which plays a crucial role in subsequent image fusion tasks for more effective target detection.

Keywords:

image registration; small UAV targets; cross-modality image; image fusion; deep learning

1. Introduction

In recent years, the rapid development of small UAV technology has made them important tools to improve efficiency and solve problems, with immense potential for applications in various fields. However, as their application scope continues to expand, incidents of unauthorized UAV flights are becoming increasingly frequent, posing significant safety risks to individuals and society. Consequently, the pursuit of advanced target detection technologies for small UAVs has become an imperative and pressing endeavor.

Given the safety risks posed by small UAVs, the application of target detection technology has become particularly crucial. However, traditional target detection methods are limited by many factors such as image quality and ambient lighting, resulting in poor detection accuracy and stability under adverse environmental conditions. More importantly, because small UAVs have smaller sizes, the accuracy requirements for target detection algorithms are significantly higher. Figure 1 shows two different types of UAVs and small UAVs flying in different scenarios.

In recent years, image fusion technology has gradually become an effective means to address such issues. Image fusion technology utilizes multi-source image information, such as visible images, infrared images, radar images, etc., to perform image fusion, resulting in enriched information in the fused image. When using visible images, infrared images, or radar images individually for target detection each modality has its own limitations. Visible images are affected by lighting conditions and may not provide clear target information in low light or adverse weather. Infrared images can be affected by changes in the surface temperature of objects, making it difficult to identify key targets when their temperature is similar to the surrounding environment. Radar images can be affected by terrain and noise, making it difficult to accurately detect targets in complex terrains or under strong interference. Therefore, it is necessary to combine information from multiple sensors to improve the accuracy and robustness of small UAV detection. By comprehensively utilizing cross-modality information, image fusion technology can effectively detect small UAVs under various lighting conditions, enhancing the reliability and stability of detection.

However, due to the differences in the imaging mechanisms of different sensors, the images obtained from these sensors are often misaligned spatially. Directly using misaligned images for fusion may lead to severe ghosting phenomena, resulting in poor quality of the fused image, such as is presented in Figure 2. Figure 2a shows the two main modalities studied in this paper: visible images and infrared images. Figure 2b compares the misalignment caused by imaging differences from different sensors. Therefore, it is very necessary to perform image registration before image fusion.

Image registration refers to the process of spatially aligning images obtained from different sensors to ensure that their scales and coordinate information are the same. It geometrically aligns two images, referred to as the reference and sensed images [1]. During the registration process, it is necessary to consider factors such as errors, distortions, and positional shifts of different sensors to ensure that the fused image can accurately reflect the target’s position and features. The current mainstream image registration methods are mainly divided into feature-based registration and area-based registration [2]. Image registration directly impacts the quality of image fusion, making the selection of the appropriate registration method crucial. The choice between feature-based and area-based methods is often contingent upon the specific characteristics of the images and the requirements of the application. Feature-based methods rely on unique points or edges within the images to establish correspondences, while area-based methods consider the overall intensity distribution for alignment.

Feature-based registration methods involve extracting significant features from the image, such as prominent area features, line features, and point features, to obtain the key feature points of the image. These key feature points reflect the parts of the image with high discriminability. Then, corresponding descriptors are generated to describe these feature points, which are used to express the local features of the key points in the image. These features capture key information in the image, such as texture, corners, and edges. An effective descriptor should provide a unique and stable representation for the feature points in the image, maintaining consistency even when the image undergoes transformations such as rotation, scaling, and brightness changes. Therefore, descriptors enable the matching of visual features in different images. By comparing the descriptors of feature points in different images, it can be determined whether they correspond to the same physical location. Traditional feature matching algorithms such as SIFT [3], RIFT [4], and OS-SIFT [5] have been proven to be very effective in single-modality image matching, but they cannot be directly applied to cross-modality matching tasks.

To address this issue, Cui et al. [6] adopted a soft feature detection method to identify key feature points in thermal infrared and visible images, effectively extracting key feature points from two different modalities of images that are conducive to matching. Deng et al. [7] proposed a novel coupled feature detection and description method, which re-coupled the independent loss functions of detection and description through a mutual weighting strategy. In addition, a super detector was proposed, which has a large receptive field and a learnable non-maximum suppression layer, to enhance the detector’s ability to capture global information and the saliency of key points. Feature-based methods do not directly process image intensity values but use more advanced feature information.

Area-based methods do not involve a feature detection step but rather obtain the deformation field between the reference and shifted images through image similarity measurement algorithms. The deformation field describes the transformation relationship of each point in the image from its original position to its target position. By establishing an accurate deformation field, the pixel points in one image can be transformed to the corresponding positions in another image, thus achieving precise alignment between images. The calculation of image similarity is closely related to the distribution of pixel data on the image and the appearance of the image. Designing a similarity calculation function to optimize cross-modality image registration tasks has proven to be quite challenging [8]. Therefore, an effective and widely used approach is to first transform cross-modality images into single-modality images before proceeding with subsequent registration tasks.

CycleGAN [9] is a variant of generative adversarial networks that allows unsupervised image-to-image translation. Luo et al. [10] proposed an MPTN module to generate pseudo infrared images, which uses the CycleGAN framework but replaces the original discriminator with a U-Net. Additionally, an SCN module was designed to correct structural information bias in the generated pseudo infrared images. Xu at al. [11] combines image registration and fusion using a coarse-to-fine approach. The process involves mapping cross-modality images into a shared space, performing global corrections with affine transformations at multiple scales, and refining the alignment with a local non-rigid deformation field. Fused image feedback enhances accuracy, resulting in high-quality aligned and fused images. Di Wang et al. [12] proposed a registration network incorporating a coarse-to-fine approach. They utilized the cycle-consistent perceptual style transfer network (CPSTN) to generate pseudo-infrared images. In the C2F-DFE module, a deformation field is predicted, which is used for resampling to reconstruct the registered infrared image.

However, many popular algorithms focus on global image registration and often neglect local details, which is disadvantageous for registering small UAV targets. Therefore, image registration for small UAV targets represents a task of significant academic importance and remains an unresolved challenge. In this paper, small UAVs exhibit two prominent characteristics. Physically, the UAVs are relatively small in size and lightweight. In terms of imagery, the small UAV target occupies a limited portion of the image frame, which restricts the availability of feature information related to the UAV.

In response to the unique characteristics of small UAVs, this paper proposes an innovative image registration framework specifically calibrated for small UAV targets. In the SPSTN module, a novel structural consistency loss is introduced to preserve the detailed information of the UAVs and the structural integrity of the building backgrounds in the generated pseudo-images, thereby enhancing the quality of the pseudo-images. In the MCARN module, residual connections are used to extract image features, effectively preserving the details of small UAV targets. Additionally, the integration of cross-attention mechanisms enables the model to focus on the most relevant feature regions between different parts of the image pairs, capturing the corresponding features in both modalities. A coarse-to-fine registration approach is adopted to achieve precise image registration. This framework demonstrates exceptional proficiency in capturing the nuanced features of small targets, thereby enabling precise and sophisticated image registration.

The main contributions of our work are summarized as follows:

Different from many current methods that focus on global image registration, this paper primarily investigated the registration of small UAV targets in localized regions. It analyzed the challenges associated with small UAVs in the registration task and proposed a robust cross-modality image registration framework specifically designed for small UAV targets.
An innovative registration model was introduced, which transforms cross-modality images into single-modality ones through SPSTN and performs registration of single-modality images with MCARN. The model effectively maintains the structure of small UAVs after modal transformation and improves the extraction capability of important details for small UAVs.
To validate the performance of the proposed method, the network was compared with several popular image registration networks. Experimental results, based on common evaluation metrics, demonstrated that our method outperforms other state-of-the-art methods.

The arrangement of the rest of the paper is as follows: Section 2 introduces some popular deep learning cross-modality image registration algorithms. Section 3 presents the proposed cross-modality image registration network for small UAV targets. Section 4 evaluates the experimental results on the UAV dataset, including comparative experiments and ablation studies. Finally, the conclusion is presented in Section 6.

2. Related Work

2.1. Cross-Modality Image Transformation

In recent years, deep learning-based methods have achieved significant advancements in the field of cross-modality image registration. These methods leverage deep neural networks to automatically learn effective feature representations, thereby better handling the complex relationships between modalities. For instance, convolutional neural networks (CNNs) have been extensively studied for feature extraction [13]. Currently, widely used deformation field-based image registration methods typically perform image alignment at the pixel and feature levels, which poses significant challenges in choosing appropriate similarity metrics for cross-modality tasks [14].

The application of generative adversarial networks in cross-modality transformation has demonstrated enormous potential [15]. Using generative adversarial networks to transform cross-modality images into single-modality ones to improve the accuracy of similarity metrics has become a widely accepted choice.

The generative adversarial network is used to generate pseudo infrared images that are visually more consistent with the target modality (e.g., infrared images) [16]. These pseudo images are then used for single-modality image registration. The core idea of this method is to learn the mapping between the two modalities, generating pseudo images with similar characteristics to the target modality, thus simplifying the subsequent registration process [17]. A typical workflow for generating pseudo infrared images is illustrated in Figure 3.

In summary, the method of generating pseudo images to transform the cross-modality registration task into a single-modality registration task provides an effective solution to the issue of appearance and modality differences in cross-modality image registration. The successful application of this method demonstrates the great potential of deep learning in the field of cross-modality image registration and offers new insights and directions for further research [18]. Future research can focus on further optimizing the performance of the generative networks and exploring more types of inter-modality mappings to enhance registration outcomes and expand the scope of applications.

2.2. Cross-Modality Image Registration

Cross-modality image registration becomes complex due to the appearance variations induced by different modalities. This issue arises in multi-sensor and medical images [19]. Typically, image registration methods can be categorized into traditional methods and deep learning-based methods. Traditional methods can be broadly classified into three categories: transformation-based methods, measure-based methods, and optimization methods [20].

Transformation-based methods involve manual analysis of common characteristics in cross-modality images and extraction of descriptors to enhance consistency. These methods require the meticulous design of transformation models to balance accuracy and optimization. Measure-based methods, including correlation-based and information theory-based metrics, attempt to quantify similarities between modalities. Correlation-based metrics assume a linear relationship between the intensity distributions of cross-modal images, which limits the applicability of correlation-based measurement methods [21]. Information theory-based metrics, such as mutual information, face challenges in identifying the global maximum across the entire search space [22]. Optimization methods focus on designing appropriate objective functions and structures, but they often require extensive parameter tuning and the consideration of various factors when handling complex cross-modal data [23,24].

Traditional cross-modality image registration methods primarily rely on handcrafted features and similarity metrics, such as mutual information, gradient information [25], and phase correlation [26]. While these methods can alleviate modality differences to some extent, they are often less effective when dealing with highly nonlinear and complex modality transformations [1]. Moreover, traditional methods usually require precise initial alignment and significant computational resources, adding difficulty in practical applications.

To address the limitations of traditional methods, deep learning techniques have been increasingly adopted in cross-modality image registration. These techniques leverage the power of neural networks to automate feature extraction and transformation model design, thereby improving the registration process. For instance, Wei et al. [27] proposed a gradient-guided multispectral image registration model that can effectively address the registration challenges caused by changes in modality information. Arar et al. [28] proposed a multi-modality image registration method based on an image-to-image translation network. These aforementioned approaches are all grounded in deep learning. Evidently, deep learning-based image registration methods have gained widespread acceptance and hold significant research importance.

Despite the considerable research on cross-modality image registration, existing methods typically focus on registering entire scenes within images and do not pay adequate attention to small targets within the images. Additionally, existing research on small UAV target registration is relatively scarce. Therefore, this paper primarily investigated the task of small UAV target image registration. The main challenges in small UAV target image registration are as follows:

Low Spatial Coverage: The diminutive size of small UAV targets in imagery results in minimal pixel coverage, which restricts the availability of discernible feature information. This scarcity complicates the feature extraction process, often leading to insufficient clarity in details and a consequent difficulty in identifying robust features.
High Variability in Viewing Geometry: The dynamic nature of UAV flight introduces substantial variability in viewing angles and elevations, which manifests as significant disparities in the perspective and geometric transformations of UAV targets within the image frame.
Intense Environmental Interference: The occurrence of occlusions, shadows, and reflections within complex and mutable environmental contexts exerts a more pronounced influence on the registration of small UAV targets compared to other image registration scenarios, potentially leading to greater registration challenges.

Given these challenges, effective registration of small UAV targets requires advanced methods that address the limitations inherent in low spatial coverage, high variability in viewing geometry, and intense environmental interference. Addressing these issues involves developing robust feature extraction techniques capable of handling limited pixel information, designing algorithms that can accommodate dynamic changes in viewing perspectives, and incorporating adaptive strategies to mitigate the impact of environmental factors. By focusing on these aspects, the proposed methods aim to enhance the accuracy and reliability of small-target image registration, ultimately contributing to more effective monitoring and analysis of UAVs in various operational contexts.

3. The Cross-Modality Image Registration Network

The cross-modality registration paradigm (as shown in Figure 4) for small UAV targets in various backgrounds is proposed. SPSTN is an improvement based on CPSTN [12], which innovatively introduces structural consistency loss to more effectively preserve the structural features of both the UAV and the building background. MCARN is designed as a multi-level registration network with a coarse-to-fine approach, integrating a cross-attention mechanism.

Specifically, SPSTN takes visible images as input and generates pseudo-infrared images consistent with the infrared modality. Then, the pseudo-infrared images and infrared images are input into MCARN, which predicts a spatial deformation field. This spatial deformation field is then applied to the infrared images to produce the registered infrared images.

3.1. Structure Preservation and Style Transformation Network

The SPSTN is designed for modality transformation, which introduces a structural consistency loss to better maintain the structure of small UAVs. Specifically, the original infrared images

I_{i r}

and visible images

I_{v i s}

are input into the SPSTN for training. The pseudo infrared images

I_{i \tilde{r}}

can be obtained, which are then used as inputs for the subsequent module. The mathematical expression of the SPSTN can be denoted as:

I_{i \tilde{r}} = T_{θ} (I_{v i s})

(1)

where

T

denotes the generator function of the SPSTN and

θ

denotes the mode of SPSTN.

In the SPSTN, resnet_9blocks, which can more effectively preserve deep network feature information, was chosen as the generator. Resnet_9blocks consists of 9 resnet blocks. The framework of a resnet block is shown in Figure 5.

Loss Function. The loss function of SPSTN includes structural consistency loss

L_{s t r}

, style loss

L_{s t y}

, and edge loss

L_{e d g e}

. The structural consistency loss

L_{s t r}

, intended to ensure that pseudo images maintain similarity with the original images in both structural features and pixel values, can be expressed as:

L_{s t r} = λ_{A} \times λ_{i d} | | I_{i r} - G_{A} (I_{v i s}) | |_{1} + λ_{B} \times λ_{i d} | | I_{v i s} - G_{B} (I_{i r}) | |_{1}

(2)

where

G_{A}

denotes the generator that transforms visible images into pseudo infrared images (modality B to A),

G_{B}

denotes the generator that transforms pseudo infrared images into visible images,

λ_{A}

denotes the weight for cycle loss B to A,

λ_{B}

denotes the weight for cycle loss A to B, and

λ_{i d}

is the weight of the identity loss. The structural consistency loss ensures that pseudo images are more realistic at the pixel level, effectively preserving the structure of small UAVs. In this paper,

λ_{A}

and

λ_{B}

are set to 0.2, and

λ_{i d}

is set to 0.1.

The style loss

L_{s t y}

controls the cycle consistency of SPSTN at the feature level, which utilizes the Gram matrix

β

to calculate the style differences between the generated image and the original image. The Gram matrix captures the correlations between different features within an image, helping to reduce checkerboard artifacts in the generated images. Style loss

L_{s t y}

can be formulated as follows:

\begin{array}{l} L_{s t y}^{ψ_{j}} & = ω_{j} | | β_{ψ_{j}} (I_{v i s}) - β_{ψ_{j}} (G_{B} (G_{A} (I_{v i s}))) | |^{2} \\ + ω_{j} | | β_{ψ_{j}} (I_{i r}) - β_{ψ_{j}} (G_{A} (G_{B} (I_{i r}))) | |^{2} \end{array}

(3)

where

β

represents the Gram matrix,

ψ_{j}

represents the VGG-19 network and

j \in {2,7, 12,21,30}

corresponds to the j-th layer in VGG-19.

ω \in {\frac{1}{16}, \frac{1}{16}, \frac{1}{8}, 1,1}

is the weight corresponding to each layer.

The edge loss

L_{e d g e}

focuses on the edge information of images by computing the Charbonnier feature distance between the gradients of two image pairs. This design enhances the edge details in the generated images, thereby improving the visual quality and structural consistency of the images. Edge loss

L_{e d g e}

can be formulated as follows:

L_{e d g e} = | | \nabla I_{i \tilde{r}} - \nabla I_{i r} | |_{c h a r} + | | \nabla I_{v \tilde{i} s} - \nabla I_{v i s} | |_{c h a r}

(4)

where

\nabla

represents the Laplacian operator;

I_{i \tilde{r}}

and

I_{i r}

represent the generated pseudo infrared image and the infrared image, respectively;

I_{v \tilde{i} s}

and

I_{v i s}

represent the pseudo visible image and the original visible image, respectively; and

{| | \cdot | |}_{c h a r}

denotes the Charbonnier loss.

The total loss is defined as:

L_{s p s t n} = L_{s t y} + λ_{1} L_{e d g} + λ_{2} L_{s t r}

(5)

where

λ_{1}

and

λ_{2}

are the weights and are set to 8.

3.2. Cross-Attention Residual Registration Network

The MCARN is a coarse-fine registration network that focuses on small UAV targets. Infrared images may provide more detailed information in certain areas and visible images may offer richer texture information in other areas. The MCARN effectively and fully utilizes the complementary information between different modality images through a cross-attention mechanism, improving the accuracy of image registration. Besides, small scale feature maps are utilized to allow MCARN to learn features at different scales, which is especially important for the registration of small UAV targets. The entire registration network is divided into four modules, including the cross-modality feature extraction network (CFEN), coarse registration network (CRN), fine registration network (FRN), and re-sampler.

The CFEN consists of three convolutional blocks and a cross-attention mechanism layer, with the pseudo-infrared image generated by SPSTN and the infrared image as inputs. Firstly, the convolutional blocks extract the feature information. Subsequently, the feature maps are flattened into one-dimensional vectors, which serve as inputs to the cross-attention mechanism layer. Through the cross-attention mechanism, the infrared and visible feature maps focus on each other’s key information, and the CFEN effectively extracts the detailed information of a small UAV. Therefore, the model can more accurately align the image features. In the CRN, the network extracts features of the inputs and predicts a coarse matching cost

c_{1}

, which measures the similarity between different inputs. Then, in order to enable the network to better register the targets of small UAVs, in the FRN, the output feature maps of the last layer of the CRN, which has been resized to half of its original size, is concatenated with

c_{1}

along the channel dimension. Then, convolutional layers with residual connections predict the fine matching cost

c_{2}

. Finally, the coarse matching cost

c_{1}

and fine matching cost

c_{2}

, are summed to obtain the spatial transform

φ

for registration. The spatial transform is used to obtain registered infrared images in the re-sampler module. The registration process can be expressed as:

c_{1} = ϕ_{c o} (I_{i \tilde{r}}, I_{i r})

(6)

c_{2} = ϕ_{f i} (C o n c a t (I_{i \tilde{r}}, I_{i r}, c_{1}))

(7)

φ = c_{1} + c_{2}

(8)

I_{r e g} = φ \times I_{i r}

(9)

where

ϕ_{c o}

denotes CRN,

ϕ_{f i}

denotes FRN,

I_{i \tilde{r}}

denotes the pseudo infrared image,

I_{r e g}

denotes the registered image,

φ

denotes the spatial transform,

c_{1}

denotes the coarse matching cost, and

c_{2}

denotes the fine matching cost.

Cross-attention mechanism. The small number of pixels occupied by a small UAV target in the image and the unclear features greatly increases the difficulty of feature extraction. As shown in Figure 6, a cross-attention mechanism is introduced in the registration network. By allowing features in one image to attend to all features in another image, cross-attention integrates the global context, which is crucial for accurate feature matching. The mechanism provides robustness against variations in viewpoint, illumination, and occlusions by dynamically adjusting the attention weights based on the global information from both images. Cross-attention excels in scenarios with low-texture or repetitive patterns where traditional feature detectors may fail. The global receptive field helps to disambiguate features by considering their relationships with other features across the images. The advantages of these cross-attention mechanisms can be efficiently applied to image registration tasks, especially for small targets. In the cross-attention mechanism, given the feature maps

f_{i}

and

f_{j}

, they are first flattened into one-dimensional vectors. Then, these vectors are linearly projected into Query, Key, and Value matrices. Next, the dot product of the Query and Key matrices is computed, and the softmax function is applied to the result to generate attention weights. Finally, these attention weights are multiplied by the Value matrix to obtain the weighted feature representation. The cross-attention layer is denoted as:

Q = W_{Q} f_{i}

(10)

K = W_{K} f_{j}

(11)

V = W_{V} f_{j}

(12)

A t t e n t i o n (Q, K, V) = s o f t \max (Q K^{T}) V

(13)

where Q denotes the query vector, K denotes the key vector, and V denotes the value vector. The query vector seeks information from other vectors, the key vector determines the most relevant value vectors in relation to the query, and the value vector holds the actual information to be retrieved.

To reduce the computational complexity, a linear transformer is employed [29]. By replacing the original dot-product attention mechanism, it reduces the computational complexity from

O (N^{2})

to

O (N)

. The kernel function of the linear transformer is defined as:

s i m (Q, K) = ϕ (Q) \cdot ϕ {(K)}^{T}

(14)

ϕ (\cdot) = e l u (\cdot) + 1

(15)

Loss Function. MCARN employs a bidirectional constraint loss function, ensuring that the image to be registered is aligned with the distorted pseudo infrared image at the feature level. The

L_{s i m}

is defined as:

L_{s i m} = | | ψ_{j} (I_{i r}^{r e g}) - ψ_{j} (I_{i \tilde{r}}) | |_{1} + λ_{r e} | | ψ_{j} (γ (φ \times I_{i \tilde{r}})) - ψ_{j} (I_{i r}) | |_{1}

(16)

where

λ_{r e}

denotes a weight, which is set to 0.2. Besides, similar to CGRP, to obtain a smooth distorted image, a smoothness loss is defined as:

L_{s m o o t h} = | | \nabla φ | |_{1}

(17)

where

\nabla

represents the Laplacian operator. Thus, the total registration loss is defined as:

L_{r e g} = L_{s i m} + λ_{s m} L_{s m o o t h}

(18)

where

λ_{s m}

is set to 10.

4. Experiments and Results

4.1. Dataset Description

The cross-modality registration network was trained using the open-source Anti-UAV dataset. The Anti-UAV dataset includes six distinct UAV models, two different modalities (infrared and visible), and a variety of background environments. The dataset is stored in MP4 video format with a frame rate of 25 frames per second. To facilitate network training, the videos were segmented into sequences, each containing 1000 frames. To minimize the impact of textual elements on the registration task, text was removed from the images, preserving the integrity of the registration task as much as possible. As shown in Figure 7, to reduce interference from irrelevant factors in the image registration process, the images were cropped to a size of 320 × 156 pixels. Ultimately, 2374 image pairs, carefully selected from diverse scenarios to form the final dataset, included 1902 images for training and 472 images for testing.

4.2. Implement Details and Metrics

Our model was implemented in PyTorch, with all experiments conducted on a GeForce RTX 3060Ti GPU. For the SPSTN, we employed the Adam optimizer for training over 200 epochs. During each epoch, a batch of 8 image patches was randomly selected. The learning rate was set to 0.0002 for the first 100 epochs and then adjusted to 0.0001 for the subsequent 100 epochs. For the MCARN, the training setup consisted of 600 epochs with a batch size of 16, and the learning rate was set to 0.0002.

To quantitatively evaluate the performance of the proposed model on the Anti-UAV dataset, this paper used metrics including the structural similarity index measure (SSIM), mean squared error (MSE), normalized cross correlation (NCC), mutual information (MI), and normalized mutual information (NMI). These metrics effectively and comprehensively assess the visual quality and information content of the images, providing a thorough evaluation of the model’s effectiveness.

SSIM is a metric designed to assess the perceptual quality between two images. It is a commonly used image quality evaluation metric that focuses on quantifying the similarity between images. The SSIM formula is given by:

S S I M (X, Y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2}))}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(19)

where

μ_{x}

and

μ_{y}

denote the mean values of images X and Y, respectively;

σ_{x}^{2}

and

σ_{y}^{2}

denote the mean variances of images X and Y;

σ_{x y}

denotes the covariance between the two images; and

C_{1}

and

C_{2}

are stabilizing constants used to prevent the denominator from becoming zero, typically related to the dynamic range of the images.

MSE is a commonly used metric to measure the difference between the predicted values and the actual observed values, used to assess the model’s fit on a given dataset. MSE is calculated by taking the average of the squared differences between the predicted values and the actual observed values. The formula is expressed as follows:

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(20)

where

n

is the number of pixels,

y_{i}

is the actual pixel value, and

{\hat{y}}_{i}

is the pixel value predicted by the model.

NCC is a metric used to measure the similarity between two images. It is a variation of the cross-correlation that normalizes the result to account for variations in lighting and contrast between the images. NCC is particularly useful in image processing tasks such as template matching and image registration because it provides a robust measure of similarity that is less sensitive to changes in illumination. The formula is expressed as follows:

N C C (f, g) = \frac{\sum_{x, y} (f (x, y) - \bar{f}) (g (x, y) - \bar{g})}{\sqrt{\sum_{x, y} {(f (x, y) - \bar{f})}^{2} {(g (x, y) - \bar{g})}^{2}}}

(21)

where

f (x, y)

and

g (x, y)

are the pixel values of the two images at position

(x, y)

, and

\bar{f}

and

\bar{g}

are the mean pixel values of the two images, respectively. The NCC values range from −1 to 1, where 1 indicates perfect similarity, −1 indicates perfect dissimilarity, and 0 indicates no correlation.

MI is a metric used to quantify the amount of information obtained about one image through another image. It measures the dependency or shared information between two images and is widely used in image processing tasks such as image registration, where the goal is to align two images. A higher MI value indicates a greater amount of shared information and, thus, a higher degree of similarity between the two images. The formula for MI is given by:

I (X; Y) = \sum_{x \in X} \sum_{Y \in y} p (x, y) \log (\frac{p (x, y)}{p (x) p (y)})

(22)

where

X

and

Y

are the two images being compared,

p (x, y)

is the joint probability distribution of the pixel values in images

X

and

Y

, and

p (x)

and

p (y)

are the marginal probability distributions of the pixel values in images

X

and

Y

, respectively.

NMI is a metric used to measure the similarity between two images by comparing the information content shared between them. It is derived from mutual information (MI) but is normalized to account for the differences in image sizes and scales, making it a robust metric for various image processing tasks such as image registration and segmentation. The formula for NMI is given by:

N M I (X, Y) = \frac{2 I (X; Y)}{H (X) + H (Y)}

(23)

where

I (X; Y)

is the mutual information between images X and Y, and

H (X)

and

H (Y)

are the entropies of images X and Y, respectively.

4.3. Performance Analysis

4.3.1. Experiments of Modality Transformation Network

This paper compares SPSTN with CPSTN from CGRP and CycleGAN, adopting SSIM, MI, and MSE as evaluation metrics. The comparison results of these three algorithms are shown in Figure 8. It is noteworthy that CPSTN is currently one of the most outstanding modality transformation models and CycleGAN is a classic modality transformation model.

In all four scenes, the pseudo-infrared images generated by SPSTN exhibit the best visual representation of UAV structures. The pseudo-infrared images generated by CPSTN in Scene 1 and Scene 3, and those generated by CycleGAN in Scene 3 and Scene 4, fail to effectively preserve the structural information of the UAVs, resulting in significant distortions. Additionally, the architectural backgrounds in the pseudo-infrared images generated by SPSTN are the clearest. In contrast, the architectural structures in the pseudo-infrared images generated by CPSTN and CycleGAN show varying degrees of blurriness and smearing. Therefore, SPSTN consistently demonstrates superior image transformation capabilities. It effectively preserves the style and structure of the images, ensuring clear differentiation between the background and UAV structures, while maintaining the intricate details of both.

The comparison results of the three metrics are shown in Table 1. The top-performing results are highlighted in bold font for clarity. The metrics are calculated based on the infrared images and pseudo infrared images generated by different methods. For the three metrics SSIM, IM, and MSE, they were improved by 0.0029, 0.0346, and 0.2541, respectively, by SPSTN as compared to the next best-performing metrics. In brief, SPSTN achieves the best performance across all three metrics, demonstrating its versatility and stability.

4.3.2. Experiments of Cross-Modality Registration Network

The visual registration effects of weighted fusion of unregistered and registered image pairs, which is shown in Figure 9, are compared in this paper. In the fusion images of unregistered pairs, small UAVs exhibit severe ghosting and blurring. In contrast, in the fusion images of registered pairs, the artifacts are significantly reduced, demonstrating the network’s effectiveness in registering small UAV targets. As shown in Table 2, MCARN is compared with NEMAR [28], GCMR, UMF-CMGR, and VoxelMorph [14], adopting NCC, SSIM, NMI and MSE as evaluation metrics. Notably, all registration algorithms are based on deformation fields to ensure a fair comparison. Besides, the metrics calculated in the first row are based on the pseudo infrared images and infrared images obtained after SPSTN. MCARN achieved the best performance across all metrics, demonstrating improvements in each, indicating that this method is more effective in performance than existing image registration methods.

The comparison results of MCARN with NEMAR, GCMR, UMF-CMGR, and VoxelMorph are shown in Table 2, with evaluation metrics including NCC, SSIM, NMI, and MSE. The results show that our method surpasses other state-of-the-art methods across all common evaluation metrics. Notably, all registration algorithms are based on deformation fields to ensure a fair comparison. Besides, the metrics calculated in the first row are based on the pseudo infrared images and infrared images obtained after the SPSTN.

MCARN achieved the best performance across all metrics. VoxelMorph exhibited deterioration in both the NCC and MSE metrics. NEMAR showed deterioration in both the SSIM and MSE metrics. GCMR experienced deterioration in SSIM. UMF-CMGR and the proposed method showed improvements across all metrics, but the improvements were higher with the proposed method. Compared to UMF-CMGR, the proposed method improved the metrics by 0.0177, 0.0042, 0.0073, and 0.9159, respectively.

4.4. Ablation Study and Analysis

To evaluate the effectiveness of the proposed model, ablation studies were conducted on the Anti-UAV dataset. The purpose of these studies was to ascertain the impact of each constituent element on the overall performance of the network ensemble.

4.4.1. Ablation Experiments of the SPSTN Module

To maintain the structural details of small UAV targets and buildings, the SPSTN module innovatively introduces structural consistency loss.

As shown in Figure 10,

I_{i \tilde{r}}

is the pseudo-infrared image generated by SPSTN, and

I_{i \tilde{r}}^{'}

is the pseudo infrared image generated after removing the structural consistency loss.

Pseudo-infrared image

I_{i \tilde{r}}

is able to effectively preserve the structural information of a small UAV target, while the structures of buildings are also effectively maintained. However, after removing the structural consistency loss, the generated pseudo infrared image

I_{i \tilde{r}}^{'}

shows blurriness and deformation in the building structures.

As shown in Table 3, we quantitatively evaluated the quality of pseudo-infrared images under different designs, including SSIM, MI, and MSE. In Table 3, STN represents the modal transformation network without the structural consistency loss. Compared to STN, SPSTN improved the metrics by 0.0161, 0.0244, and 0.4078, respectively. Ablation experiments indicated that structural consistency loss can effectively enhance the quality of pseudo images. These results demonstrate that the quality of the pseudo images generated by SPSTN comprehensively surpassed that of the pseudo-images generated by the other two methods.

4.4.2. Ablation Experiments of the MCARN Module

MCARN innovatively introduces the cross-attention mechanism into the registration task, enhancing feature extraction by simultaneously focusing on key information in both infrared and optical images. By leveraging the correlations between the cross-modality images, it improves the registration accuracy.

To validate the effectiveness of the cross-attention mechanism, the ablation study compared the registration performance of the complete MCARN with that of the MCARN without the cross-attention mechanism (MRN). The visualization results are shown in Figure 11.

To clearly illustrate the comparison result, dashed reference lines were added to the images. It is evident that without the cross-attention mechanism, the small UAVs were not accurately registered. Conversely, the complete MCARN achieved more precise registration of small UAVs in detail. Table 4 presents the registration metrics for both the complete MCARN and the variant without the cross-attention mechanism. All metrics showed a decline when the cross-attention mechanism was removed. Therefore, the cross-attention mechanism significantly enhances the registration performance, allowing for more precise alignment of small UAVs. This mechanism is particularly effective at capturing small targets and detailed information, even in complex backgrounds. Additionally, it increases the network’s adaptability to various environments and conditions, thereby enhancing overall registration performance. This ability to finely align small UAVs demonstrates the significant advantage of incorporating cross-attention in the registration process.

Quantitative evaluations, including NCC, SSIM, NMI, and MSE, were adopted to compare the advantages of cross-attention mechanisms. The comparison results are shown in Table 4. The experimental results indicate that the cross-attention mechanism benefits cross-modality image registration tasks.

5. Discussion

In this paper, we proposed a framework combining modality transformation and cross-modality registration to address the challenge of recognizing and tracking multi-modal unmanned aerial vehicle (UAV) images. Through an in-depth examination and experimentation with the SPSTN and MCARN models, we achieved significant results, while also identifying several areas for further discussion.

The SPSTN model demonstrated excellent performance in the modality transformation task. Our experimental results, compared with CPSTN and CycleGAN, showed that SPSTN outperformed the other models in metrics such as Structural Similarity Index (SSIM), Mutual Information (MI), and Mean Squared Error (MSE). This indicates that our SPSTN model better preserves key features during the transformation process, especially in handling the details of UAVs and architectural backgrounds. However, despite the superior performance of SPSTN, there are still instances of detail loss and blurriness in high-complexity scenes. This suggests that future research could benefit from incorporating more advanced feature extraction and enhancement techniques to further improve the transformation quality.

The MCARN model also exhibited strong performance in the cross-modality registration task. When compared to existing algorithms like NEMAR, GCMR, UMF-CMGR, and VoxelMorph, MCARN excelled across all evaluation metrics. This demonstrates MCARN’s effectiveness in aligning images of different modalities, thereby enhancing registration accuracy. However, despite MCARN’s excellent performance in static image registration, challenges remain in real-world applications, particularly in handling dynamic video sequences. Future research could explore integrating temporal information to optimize the MCARN model for dynamic scenarios.

While the Anti-UAV dataset used in this study encompasses various UAV models and background environments, the diversity and complexity of UAVs and environments in real-world applications far exceed our experimental settings. To enhance the model’s generalizability, future work could incorporate more diverse datasets, including a wider range of UAV types and complex backgrounds. Moreover, optimizing the model’s robustness and real-time performance will be crucial to meet the demands of practical applications.

Even though this study made progress in multi-modal UAV image recognition and tracking, several issues warrant further exploration. These include better integrating the modality transformation and registration processes to improve overall system efficiency and accuracy, addressing various noises and interferences in real-world applications, and designing more efficient training and inference algorithms. Addressing these challenges will provide new insights and directions for the future development of multi-modal image processing and UAV applications.

In summary, while this study has proposed and validated new methods, it has also highlighted new directions and challenges for future research. We believe that with continuous technological advancements and further in-depth studies, the field of multi-modal UAV image processing will witness significant progress.

6. Conclusions

This paper proposed a novel image registration algorithm for cross-modality small UAVs that effectively performs image registration for infrared and visible images of small UAV targets, completing crucial preparatory tasks for image fusion and target detection. The complete registration network comprises two components: the SPSTN module and the MCARN module. The SPSTN module introduces a structural loss function to effectively maintain and differentiate the structures of small UAVs and backgrounds. The MCARN module is a multi-level registration network that integrates a cross-attention mechanism. It is a coarse-to-fine registration network that predicts matching costs and resamples the images to produce registered images. This paper validated the effectiveness and versatility of the proposed method by comparing it with various modality transformation algorithms and image registration algorithms. Ablation experiments illustrated that the cross-attention mechanism significantly improves registration accuracy.

Author Contributions

Conceptualization, W.J. and Y.W.; Data curation, W.J. and H.P.; Investigation, W.J. and Y.L. (Yang Li); Methodology, W.J.; Resources, Y.L. (Yun Lin) and F.B.; Supervision, Y.W.; Writing—original draft, W.J.; Writing—review and editing, W.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

This work was supported in part by the Natural Science Foundation of China (Key Program) under Grants 62131001 and (General Program) under Grants 62371005. It is also sup-ported by the Beijing Natural Science Foundation under Grant 4234082 and the Yuxiu Innovation Project of NCUT (Project No. 2024NCUTYXCX119).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zitová, B.; Flusser, J. Image registration methods: A survey. Image Vis. Comput. 2003, 21, 977–1000. [Google Scholar] [CrossRef]
Li, N.; Li, Y.; Jiao, J. Multimodal remote sensing image registration based on adaptive multi-scale PIIFD. Multimed. Tools Appl. 2024, 1–13. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Li, J.; Hu, Q.; Ai, M. RIFT: Multi-modal image matching based on radiation-invariant feature transform. IEEE Trans. Image Process. 2020, 29, 3296–3310. [Google Scholar] [CrossRef] [PubMed]
Xiang, Y.; Wang, F.; You, H. OS-SIFT: A Robust SIFT-like algorithm for high-resolution optical-to-SAR Image registration in suburban areas. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3078–3090. [Google Scholar] [CrossRef]
Cui, S.; Ma, A.; Wan, Y.; Zhong, Y.; Luo, B.; Xu, M. Cross-Modality Image Matching Network with Modality-Invariant Feature Representation for Airborne-Ground Thermal Infrared and Visible Datasets. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Deng, Y.; Ma, J. ReDFeat: Recoupling Detection and Description for Cross-Modal Feature Learning. IEEE Trans. Image Process. 2023, 32, 591–602. [Google Scholar] [CrossRef] [PubMed]
Tang, H.; Yuan, C.; Li, Z.; Tang, J. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognit. 2022, 130, 108792. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Luo, Y.; Cha, H.; Zuo, L.; Cheng, P.; Zhao, Q. General cross-modality registration framework for visible and infrared UAV target image registration. Sci. Rep. 2023, 13, 12941. [Google Scholar] [CrossRef]
Xu, H.; Yuan, J.; Ma, J. MURF: Mutually Reinforcing Multi-modal Image Registration and Fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12148–12166. [Google Scholar] [CrossRef]
Wang, D.; Liu, J.; Fan, X.; Liu, R. Unsupervised Misaligned Infrared and Visible Image Fusion via Cross-Modality Image Generation and Registration. arXiv 2022, arXiv:2205.11876. [Google Scholar]
Haskins, G.; Kruger, U.; Yan, P. Deep learning in medical image registration: A survey. Mach. Vis. Appl. 2020, 31, 8. [Google Scholar] [CrossRef]
Balakrishnan, G.; Zhao, A.; Sabuncu, M.R.; Guttag, J.; Dalca, A.V. An unsupervised learning model for deformable medical image registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9252–9260. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Wolterink, J.M.; Dinkla, A.M.; Savenije, M.H.; Seevinck, P.R.; van den Berg, C.A.; Išgum, I. Deep MR to CT synthesis using unpaired data. In Simulation and Synthesis in Medical Imaging: Second International Workshop, SASHIMI 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, 10 September 2017; Proceedings 2; Springer International Publishing: New York, NY, USA, 2017; pp. 14–23. [Google Scholar]
Hu, Y.; Modat, M.; Gibson, E.; Li, W.; Ghavami, N.; Bonmati, E.; Wang, G.; Bandula, S.; Moore, C.M.; Emberton, M.; et al. Weakly-supervised convolutional neural networks for cross-modal image registration. Med. Image Anal. 2018, 49, 1–13. [Google Scholar] [CrossRef] [PubMed]
Studholme, C.; Hill, D.; Hawkes, D. An overlap invariant entropy measure of 3D medical image alignment. Pattern Recognit. 1999, 32, 71–86. [Google Scholar] [CrossRef]
Maes, F.; Collignon, A.; Vandermeulen, D.; Marchal, G.; Suetens, P. Cross-modality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 1997, 16, 187–198. [Google Scholar] [CrossRef] [PubMed]
Mattes, D.; Haynor, D.R.; Vesselle, H.; Lewellen, T.K.; Eubank, W. PET-CT image registration in the chest using free-form deformations. IEEE Trans. Med. Imaging 2003, 22, 120–128. [Google Scholar] [CrossRef] [PubMed]
Wells, W.M.; Viola, P.; Atsumi, H.; Nakajima, S.; Kikinis, R. Multi-modal volume registration by maximization of mutual information. Med. Image Anal. 1996, 1, 35–51. [Google Scholar] [CrossRef]
Myronenko, A.; Song, X. Intensity-based image registration by minimizing residual complexity. IEEE Trans. Med. Imaging 2010, 29, 1882–1891. [Google Scholar] [CrossRef]
Rueckert, D.; Sonoda, L.I.; Hayes, C.; Hill, D.L.G.; Leach, M.O.; Hawkes, D.J. Nonrigid registration using free-form deformations: Application to breast MR images. IEEE Trans. Med. Imaging 1999, 18, 712–721. [Google Scholar] [CrossRef]
Pluim, J.P.W.; Maintz, J.B.A.; Viergever, M.A. Mutual-information-based registration of medical images: A survey. IEEE Trans. Med. Imaging 2003, 22, 986–1004. [Google Scholar] [CrossRef] [PubMed]
Reddy, B.S.; Chatterji, B.N. An FFT-based technique for translation, rotation, and scale-invariant image registration. IEEE Trans. Image Process. 1996, 5, 1266–1271. [Google Scholar] [CrossRef] [PubMed]
Wei, Z.; Jung, C.; Su, C. RegiNet: Gradient guided multispectral image registration using convolutional neural networks. Neurocomputing 2020, 415, 193–200. [Google Scholar] [CrossRef]
Arar, M.; Ginger, Y.; Danon, D.; Bermano, A.H.; Cohen-Or, D. Unsupervised multi-modal image registration via geometry preserving image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 13410–13419. [Google Scholar]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020. [Google Scholar]

Figure 1. Overview of small UAVs flying in various backgrounds.

Figure 2. Comparison of cross-modality images and registration and unregistration images.

Figure 3. The workflow for generating pseudo infrared images by CycleGAN.

Figure 4. The workflow of the cross-modality image registration network for small UAV targets.

Figure 5. The algorithm framework of a resnet block.

Figure 6. Demonstration of the cross-attention mechanism.

Figure 7. Anti-UAV datasets.

Figure 8. Visualization results of the comparison in various scenarios.

Figure 9. Visualization results of the comparison between registered and unregistered image pairs.

Figure 10. Visualization results of the comparison between SPSTN and STN.

Figure 11. Visualization results of the comparison between MCARN and MRN.

Table 1. Comparison of metrics for different image generation models.

Metrics	SSIM↑	MI↑	MSE↓
CycleGAN	0.5688	1.3319	94.0021
CPSTN	0.5718	1.3573	93.8813
SPSTN	0.5747	1.3919	93.6272

Table 2. Comparison of cross-modality registration algorithms.

Method	NCC↑	SSIM↑	NMI↑	MSE↓
Misaligned Input	0.7033	0.5747	0.1538	93.6272
NEMAR	0.7183	0.4808	0.1599	93.5440
GCMR	0.7175	0.5144	0.1927	92.9808
UMF-CMGR	0.7079	0.5932	0.1894	92.7404
VoxelMorph	0.6656	0.4691	0.1769	96.6120
Ours	0.7256	0.5974	0.1967	91.8245

Table 3. Quantitative evaluation of the anti-UAV test.

Metrics	SSIM↑	MI↑	MSE↓
STN	0.5750	1.3779	95.1424
SPSTN	0.5911	1.4023	93.5502

Table 4. Quantitative evaluation of the efficiency of cross-attention mechanism in our method.

Metrics	NCC↑	SSIM↑	NMI↑	MSE↓
Misaligned Input	0.7033	0.5747	0.1538	93.6272
MRN	0.7128	0.5932	0.1755	92.9808
MCARN	0.7256	0.5974	0.1967	91.8245

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, W.; Pan, H.; Wang, Y.; Li, Y.; Lin, Y.; Bi, F. A Multi-Level Cross-Attention Image Registration Method for Visible and Infrared Small Unmanned Aerial Vehicle Targets via Image Style Transfer. Remote Sens. 2024, 16, 2880. https://doi.org/10.3390/rs16162880

AMA Style

Jiang W, Pan H, Wang Y, Li Y, Lin Y, Bi F. A Multi-Level Cross-Attention Image Registration Method for Visible and Infrared Small Unmanned Aerial Vehicle Targets via Image Style Transfer. Remote Sensing. 2024; 16(16):2880. https://doi.org/10.3390/rs16162880

Chicago/Turabian Style

Jiang, Wen, Hanxin Pan, Yanping Wang, Yang Li, Yun Lin, and Fukun Bi. 2024. "A Multi-Level Cross-Attention Image Registration Method for Visible and Infrared Small Unmanned Aerial Vehicle Targets via Image Style Transfer" Remote Sensing 16, no. 16: 2880. https://doi.org/10.3390/rs16162880

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Level Cross-Attention Image Registration Method for Visible and Infrared Small Unmanned Aerial Vehicle Targets via Image Style Transfer

Abstract

1. Introduction

2. Related Work

2.1. Cross-Modality Image Transformation

2.2. Cross-Modality Image Registration

3. The Cross-Modality Image Registration Network

3.1. Structure Preservation and Style Transformation Network

3.2. Cross-Attention Residual Registration Network

4. Experiments and Results

4.1. Dataset Description

4.2. Implement Details and Metrics

4.3. Performance Analysis

4.3.1. Experiments of Modality Transformation Network

4.3.2. Experiments of Cross-Modality Registration Network

4.4. Ablation Study and Analysis

4.4.1. Ablation Experiments of the SPSTN Module

4.4.2. Ablation Experiments of the MCARN Module

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI