1. Introduction
RS is a technology that automatically collects information on the Earth’s surface from a distance using sensors mounted on satellites, aircraft, or drones. It is widely used in environmental monitoring, urban planning, agricultural management, military reconnaissance, disaster response, and geological exploration. This technology offers several advantages, including a broad coverage area, strong real-time data acquisition capabilities, and high cost-effectiveness.
In the research on RS image target detection, the detection methods for rotating targets are critical [
1,
2,
3]. Traditional detection methods often rely on horizontal anchor boxes, such as the SSD [
4], YOLO [
5,
6,
7], and Faster-RCNN [
8] detection networks. This can cause misalignment between the target and the proposal region when dealing with rotating targets, leading to a decrease in detection accuracy. To address this issue, researchers have proposed various methods involving rotating anchor boxes to improve the performance of RS detection networks. One approach is to use rotating anchor points instead of traditional horizontal anchor points to match the orientation of the target. For example, Rotated RPN [
9] improves detection accuracy by pre-setting anchor points at different angles to cover the various directions in which targets may appear. Furthermore, certain techniques mitigate the impact of horizontal proposal regions by directly regressing the coordinates of the detection box’s four vertices. An example of this is the Gliding Vertex [
10] method, which can accommodate targets of arbitrary orientations. Although methods involving rotating anchor points and vertex regression have improved the detection performance of rotating targets, they still face issues with computational complexity. To enhance detection efficiency, several studies are investigating the reduction of computational complexity without compromising or even improving the accuracy of detection. This includes developing new loss functions to train better networks, as well as designing more efficient network structures to handle rotating targets.
In the realm of RS detection, the focus of many researchers is on enhancing the network’s ability to capture the directional characteristics of targets during the feature extraction phase. To transcend the constraints inherent in conventional convolutional neural networks, which struggle to recognize orientation features when dealing with rotating targets, approaches involving rotating convolution kernels and rotating anchor boxes have been introduced. These techniques adjust the orientation or shape of the convolution kernels to match the arbitrary rotation angles of targets to enhance detection accuracy. For instance, the R2CNN [
11] method identifies rotating targets by detecting the first two corners in a clockwise sequence along with the rectangle’s height. CenterNet [
12], on the other hand, locates the center of the target through keypoint detection and then expands outward from this center to determine the target’s bounding box. These innovative approaches not only refine the convolution operation and the regression of anchor boxes for more comprehensive target recognition but also drive the evolution of network architectures. This includes the adoption of direction-sensitive convolution kernels and rotating pooling layers, which significantly boost the model’s adaptability to rotational changes and its recognition capabilities. However, current technologies still grapple with challenges in the design of feature extractors. In particular, there is a need to strike an appropriate balance between the number of rotation angles and the resulting increase in computational costs. Additionally, there are issues with the fluctuations in the loss function when rotating anchor boxes with very high aspect ratios.
In conclusion, the existing methods for rotating convolutions and bounding box regression have not adequately tackled the problem of aligning the feature and sample levels of detection networks with the angular spatial traits of targets. More precisely, at the feature level, it is difficult for the angles of rotating convolution kernels to align with the angles of targets to accommodate their angular variations. As depicted in
Figure 1a, approaches that pre-set fixed rotation angles have difficulty in achieving a balance between computational cost and the number of pre-set angles. Having an excessive number of pre-set angles escalates the computational burden, while having too few can result in inaccurate extraction of directional features. In contrast, our proposed method adopts a data-driven strategy to dynamically acquire the angles, generating a coarse representation of the target’s direction within the image. This angle would guide the rotation of the convolution kernels to precisely capture the target’s directional features.
At the sample level, anchor boxes must be aligned with the orientation of the target and avoid problems arising from inconsistent directionality in angle regression during training, the cyclical nature of angle changes (from 360 degrees back to 0 degrees), and the substantial variations when regressing high-aspect-ratio (slender) anchor boxes from their pre-set positions to the target positions, which can give rise to drastic changes in the loss function and cause unstable training.
Figure 1b highlights the challenges of current methods that regress anchor boxes using angles. When angle regression is employed for anchor boxes, two scenarios may occur: First, a minor clockwise rotation of the red box could align it with the target box (green box), as indicated by the rotation of the yellow box. Second, a significant counterclockwise rotation of the red box could align it with the target box (green box), as shown by the rotation of the blue box. Encountering the second scenario can lead to a problem where the large rotation angle causes drastic changes in the loss function. Our proposed method, however, abandons the use of angles for the direction regression of anchor boxes, choosing instead to directly use two offset values to regress the center points of each side of the original anchor box. This approach avoids the difficulties in regression due to inconsistent rotation directions and the challenges related to rotating high-aspect-ratio anchor boxes.
Through the analysis, it is clear that no matter how we design methods for rotating convolutions and anchor box regression, we cannot avoid the challenge of aligning feature and sample information with the spatial orientation of targets in RS detection tasks. Compared to existing rotating detection networks, it is particularly important to develop a dynamic, target-angle-sensitive feature extractor and an anchor box regression method that is not sensitive to angles. To address this, we introduce the Dynamic Direction Learning R-CNN (DDL R-CNN), which is composed of two main parts: the dynamic direction learning (DDL) module and the Boundary Center Region Offset Proposal Network (BC-ROPN). First, the DDL module is integrated into the network’s backbone to extract directional features of targets from initial feature maps and convert them into angles. Then, the rotating convolutional kernels adjust their rotation angles based on the output of this module, enabling them to accurately capture the target’s directional information. Following this, BC-ROPN uses the boundary center features of the oriented bounding rectangle, along with the rotation angles and angle weights obtained from the DDL module, to generate boundary center offsets. These offsets direct the regression of the rotating anchor boxes within the confines of the oriented bounding rectangle. This approach results in a smoother convergence of the loss function and yields more nuanced and precise outcomes, effectively circumventing the issue of abrupt fluctuations in the loss function values. The novel contributions we present are outlined below.
- (1)
We carried out a systematic analysis of the issue that the feature-level and sample-level alignment in existing RS target detection networks makes it difficult to match the spatial orientation of targets.
- (2)
We introduced a rotating feature extractor that uses the early extraction of target directional features to pre-set the rotation direction of dynamic directional convolution kernels. By assigning an angle weight to each pre-set angle based on these features, we enhance the effectiveness of the convolution kernels during rotation. This method reduces redundant extraction of directional features compared to fixed pre-set rotation methods, conserves resources, and improves the efficacy of extracting directional features.
- (3)
We proposed an anchor box convergence approach that utilizes the early extracted directional feature information to generate boundary center offsets for oriented bounding rectangle anchor boxes. This offset regression method mitigates the drastic changes in the loss function caused by angle regression and ensures that anchor boxes converge more tightly and accurately.
3. Method
In this section, we first introduce the design and functionality of the routing function (
Section 3.1).
Figure 2b illustrates its overall structure, where the input image features are processed by the function to generate a set of two-dimensional data. One dimension pertains to the angles of targets within the image, and the other dimension corresponds to the angle weights associated with those data. We then further discuss the structure and implementation details of the DDL module (
Section 3.2).
Figure 2a shows the overall architecture of this module, which uses the routing function to generate a set of two-dimensional data from the input feature maps, combined with a set of original convolution kernel parameters, to ultimately produce a new set of feature maps
. Finally,
Section 3.3 explains the overall network structure of the proposed DDL R-CNN and provides detailed implementation details of BC-ROPN.
Figure 3 presents the overall architecture of the network, which is based on a two-stage detector using FPN for feature extraction and processing of images. It incorporates the anchor box boundary center offsets extracted from the feature maps generated by the DDL module before anchor box regression. This enables anchor box regression based on the boundary center points of the target’s maximum rectangle and ultimately outputs the image results.
3.1. Routing Function
The conventional convolutional approach, which employs consistent parameters and sliding window directions for feature extraction in different RS images, struggles to effectively extract the spatial orientation features of targets that are rotated. The prevalent method to tackle this challenge involves rotating a convolutional kernel through a set of predefined angles to extract the multi-directional spatial features of the targets. However, this approach, which is dependent on pre-set angles, restricts the model’s capacity to adapt to targets at any arbitrary angle. This limitation can lead to less accurate and incomplete extraction of spatial orientation features, ultimately diminishing the model’s ability to generalize across different scenarios.
To tackle this challenge, we have introduced the routing function, which carries out detailed feature extraction based on the input feature maps. By integrating spatial information and smoothly processing angle data, it generates
, offering a preliminary estimate of the target’s spatial orientation. This enhances the robustness and multi-directional capability of extracting spatial features of targets. Like the process for generating
, creating the combined weights
also involves detailed feature extraction and spatial information integration. However, these spatial features need to be mapped non-linearly to produce suitable combination weights (
) that fall within the range of 0 to 1. These weights are subsequently employed to amalgamate the output feature maps derived from convolutional kernels at diverse rotation angles. The weights indicate the contribution of each kernel to the target’s overall spatial feature representation, enabling the model to adaptively emphasize features in certain directions while suppressing others that might cause interference. This approach allows the model to synthesize information from different directions, resulting in a comprehensive feature representation that captures the spatial characteristics of the target more effectively. The general architecture of our proposed routing function is shown in
Figure 2b.
Specifically, the feature map
is initially subjected to a lightweight deep convolution with a
kernel, followed by a linear mapping [
2] and ReLU activation. This process begins with the refined extraction of features:
For Equation (
1), the asterisk (·) denotes the depthwise convolution operation.
K is a convolution kernel with a size of
. LN stands for layer normalization.
b is the bias term of the convolution layer. ReLU is the activation function. Firstly, convolution effectively captures local spatial characteristics from the input feature map. The convergence rate and stability of the network during training can be improved by layer normalization. Then, the ReLU activation enhances the model’s ability to capture non-linearities, allowing the network to detect more complex patterns. The formula generates the activated features
, which are then pooled into a
-dimensional feature vector using average pooling. The averaged feature vector is passed on to two different branches.
For Equation (
2),
is the output feature map from the previous layer. The average pooling operation is a key step in the feature extraction process, which helps to compress data, integrate global information, and provide the appropriate input size for subsequent network layers. The first branch is dedicated to predicting rotation angles, comprising a linear layer coupled with a softsign activation function. In order to avoid obtaining biased angle estimates, the bias term of this linear layer was intentionally set to 0.
For Equation (
3),
is the weight of the linear layer.
v is a
-dimensional feature vector.
is the predicted rotation angle. The softsign activation function provides non-linear transformation, allowing the function to extract complex patterns and features. This design enables the angle prediction branch to respond coarsely to the directional changes of the input feature vector, enhancing the convolutional kernel’s sensitivity to target orientation changes, thereby improving the accuracy of spatial directional feature extraction by the rotating convolutional kernel and its generalization ability towards multi-directional targets. Known as the combination weight prediction branch, the second branch assumes the responsibility of forecasting the combination weights
. This branch employs a linear layer that incorporates bias and a sigmoid activation function.
For Equation (
4),
is the weight of the linear layer.
is the bias of the linear layer.
v is a
-dimensional feature vector.
is the predicted combination weights. The sigmoid activation function compresses the output to the interval (0, 1), which is suitable for representing weights. The combination weight prediction branch learns how to dynamically weight different directional feature responses based on the input features. The routing function commences with initialization from a truncated normal distribution, characterized by a mean of 0 and a standard deviation of 0.2, which ensures that the module generates smaller values at the beginning of the training process.
3.2. Dynamic Directional Learning (DDL) Module
Previous detectors typically rotate convolutional kernels at fixed angles based on prior knowledge, which presents difficulties in dynamically adjusting to targets with diverse orientations. To enable the detector to automatically perceive arbitrary changes in the target direction, a DDL module is proposed. This module has
n kernels
, each with the shape
. It uses the
generated by the routing function to guide the rotation of the convolutional kernels at different angles, thereby extracting more refined spatial features of the target. Finally, the spatial features extracted by these differently rotated convolutional kernels are fused through weighting. Referencing
Figure 2, the detailed operations are delineated below:
Firstly, a set of predicted rotation angles
obtained from the routing function is used to rotate each of the
n kernels individually.
For Equation (
5),
represents the rotation angle of
, and
is the rotated kernel. The Rotate() function rotates the kernel
by the predicted angle
. This function defines the counterclockwise (clockwise) direction as positive and uses bilinear interpolation to fill in the sampling locations missing due to rotation.
These rotated kernels are generally computed by performing convolutions with the input feature map and then summing the resulting feature maps element by element.
are the combination weights predicted by the routing function, * denotes the convolution operation, and
y is the combined output feature map. However, this approach requires individual convolution operations for each kernel, followed by an addition step, potentially resulting in reduced computational efficiency. Consequently, we reformulate the aforementioned Equation (
6) as follows:
This implies that convolving the input features separately and summing their outputs (Equation (
6)) is equivalent to multiplying these kernels with their respective combination weights (as shown in
Figure 2a) and summing them before performing a single convolution operation (Equation (
7)). In Equation (
7), the re-convolution calculation occurs only once, whereas in Equation (
6) it does not. This strategy significantly reduces computational complexity while maintaining the same feature expression capability of the model.
3.3. Dynamic Direction Learning R-CNN (DDL R-CNN)
The goal of DDL R-CNN is to leverage a lightweight, fully convolutional network to merge features from different levels and produce a set of sparse, oriented proposal regions. The original mechanism for rotating objects had ambiguous choices for angle regression, leading to substantial losses when large angles were selected. Additionally, with high-aspect-ratio targets, even small predicted angle changes could cause dramatic shifts in the loss function, resulting in unstable training. To address these challenges, the innovative rotation mechanism leverages the existing oriented proposal regions and the composite feature maps from the DDL module to enable a compact and rapid regression of rotated bounding boxes. This strategy not only significantly reduces the volume of parameters but also ensures that the loss function values remain within a reasonable range.
The specific network structure is shown in
Figure 3. The model assimilates the five-tiered features from the FPN, labeled as
, and attaches a consistent head to the feature set at each level. This head includes a
convolutional layer, along with two parallel
convolutional layers. Across all feature levels, we designate three horizontal anchor points at each spatial location, each associated with three distinct aspect ratios: 1:2, 1:1, 2:1. The pixel areas for these anchors on
are
, and
, respectively. The position of each anchor point
a is specified by a four-dimensional vector
, with
identifying the anchor’s central coordinates and
and
representing its width and height, respectively. One of the parallel
convolutional layers serves as the regression branch, which integrates the directional feature maps
produced by the DDL module, and then merges these features and outputs offsets for the proposal regions relative to the anchor points
. For every location on the feature maps, we produce A proposal regions, where A equals the number of anchors per location, specified as 3 in this paper. This causes the regression branch to output 6A results. The directional proposal regions are derived from decoding these regression results. The decoding procedure is detailed below:
The center coordinate of the predicted candidate region is , and the dimensions of its outer bounding box are expressed as width w and height h. The center point offsets at the top (bottom) and right (left) of the outer bounding box are denoted as and , respectively.
Finally, based on the characteristics of the parameters
, we designed a new oriented proposal region representation scheme, known as the Boundary Center Offset Representation, which is the BC-ROPN part shown in
Figure 3. The process is illustrated in the right (tail) diagram of
Figure 1b. The oriented bounding box
O serves as the outer rectangle and the midpoint of its bounding box is indicated by the yellow dot. The vertices of the oriented bounding box
O are represented by the black points. Specifically, we use the six parameters
calculated through Equation (
8) to represent the oriented bounding box. With these six parameters, we can assign four corrected (offset) vertex coordinate sets
for each proposal region. Here,
is the offset of
relative to the midpoint of the top edge
. By symmetry,
indicates the offset of
relative to the midpoint of the bottom edge
.
represents the offset of
relative to the midpoint of the right edge
, and
is the offset of
relative to the midpoint of the left edge
. Therefore, the coordinates of the four vertices are expressed as follows:
This method of representation enables the regression of each directional proposal region by predicting the parameters of the external bounding box and the parameters for deducing the midpoint offsets .
4. Experiment
In this section, we conduct an empirical assessment of the proposed network integrated with a DDL module across various detection frameworks. We begin in
Section 4.1 with a thorough description of the datasets utilized in our experiments.
Section 4.2 details the experimental setup and parameter configurations. The evaluation metrics to be employed in the experiments are explained in
Section 4.3.
Section 4.4 includes ablation studies and visualization of results that further substantiate the efficacy of our proposed method. Finally,
Section 4.5 presents the key findings of our approach alongside comparisons with other methods for rotating target detection on two widely used datasets.
4.1. Dataset
For a thorough evaluation of the capabilities of our proposed method in detecting oriented targets, it was subjected to testing on two highly regarded standard datasets: UCAS-AOD [
22] and HRSC2016 [
23].
The UCAS-AOD dataset, issued by the University of Chinese Academy of Sciences, is dedicated to the detection tasks of aircraft and vehicles in RS imagery. It consists of 1000 images of aircraft and 510 images of vehicles, with a total of 7482 aircraft and 7114 vehicles annotated. The images, derived from Google Earth satellite imagery, ensure the authenticity and diversity of the scenes. All images are uniformly sized at pixels, stored in .png format, with the entire dataset weighing in at approximately 3.48 GB. This dataset mainly focuses on medium- and small-sized targets in RS images, such as aircraft and vehicles. The targets in these images are typically multi-directional and densely packed, which presents challenges for detection algorithms and also makes it an ideal dataset for assessing the performance of oriented target detection algorithms. In our experiments, since UCAS-AOD emphasizes improving the detection accuracy of multi-directional targets, we performed some preprocessing on the dataset’s annotation files using a rotated bounding box format of . And the dataset was divided as follows: 60% training set, 20% validation set, and 20% test set.
The HRSC2016 dataset is a collection focused on ship detection in high-resolution RS imagery. It comprises 1061 images of varying sizes, with resolutions ranging from 300 × 300 to 1500 × 900 pixels, encompassing ship instances of various sizes and orientations. Known for their challenging detection tasks, ships in the images may appear slender and densely packed side by side, and there are numerous ship-like distractions such as waves, rectangular warehouses, and slender piers. The dataset offers ample training and testing material with its image count and instance count, making it a vital resource for evaluating and enhancing the performance of ship detection algorithms. In this experiment, the dataset was divided into 60% as the training set, 20% as the validation set, and 20% as the test set.
4.2. Parameter Setting and Tuning Methods
We trained our model using an RTX 2080Ti with a batch size of 4. The experimental results were achieved with the mmdetection [
24] toolkit. Both ResNet50 [
25] and ResNet101 [
25] served as our backbone networks, which were pre-trained on ImageNet [
26]. Data augmentation included horizontal and vertical flipping. We utilized the SGD optimizer for our network, setting the learning rate at 0.01, momentum at 0.9, and weight decay at 0.0001. The training was conducted for 36 epochs with R-CNN [
27], with a gradual warm-up of the learning rate to its peak of 0.01 during the initial 1–4 epochs, followed by a gradual reduction until training completion. The Non-Maximum Suppression (NMS) threshold was set to 0.8. For the UCAS-AOD dataset, we resized the original images to
pixels. For the HRSC2016 dataset, we adjusted the original images to approximately maintain an aspect ratio of
, with the shorter side resized to 800 pixels and the longer side restricted to no more than 1333 pixels.
In this experiment, to optimize the performance of our proposed oriented object detection method and ensure the reproducibility of the results, we employed a variety of comprehensive tuning strategies. Initially, we utilized the mmsplict tool from the mmdetection package to perform data augmentation, which included techniques such as random horizontal flipping, vertical flipping, random rotation, scaling, color jittering, and random cropping. These methods aimed to increase the diversity of the training data, enhance the model’s adaptability to different scenarios and object orientations, and reduce overfitting to the training set. In terms of hyperparameter tuning, we meticulously documented the basis and range of initial hyperparameter choices and employed methods like grid search and random search for optimization, ensuring the identification of the optimal hyperparameter configuration to boost model performance. Additionally, we analyzed the specific impact of each hyperparameter on model performance and ultimately determined the best parameter settings. To further mitigate the risk of overfitting, we introduced Dropout layers to enhance the model’s robustness, implemented L2 weight regularization to limit model complexity, used early stopping to monitor validation set performance and prevent overfitting, and applied batch normalization to reduce internal covariate shift, accelerate the training process, and improve the model’s generalization capabilities. These strategies ensured the model’s accuracy and stability when dealing with complex backgrounds and multi-directional targets. Through these integrated tuning methods, the experiment achieved high detection accuracy while maintaining consistency across different settings, laying a solid foundation for future research and applications.
4.3. Experimental Evaluation Index
The mean average precision (mAP) value was used in this experiment to assess the accuracy of the model. We chose mAP as the core evaluation metric for this object detection experiment because it takes into account both the classification and localization performance of the model. Specifically, for each category, we first calculated its recall (
R) and precision (
P):
True negative, false negative, true positive, and false positive are denoted as
,
,
and
, respectively. Precision (
P) is the proportion of correctly identified targets among all detected targets. Recall (R) is the proportion of instances that the model is able to identify. While higher values of both precision (
P) and recall (
R) are desirable, there is an inherent trade-off between them; enhancing precision may affect recall, and vice versa. Thus, achieving a balance between precision and recall is essential. By examining the relationship between precision and recall at various classification thresholds and plotting their corresponding points, a precision–recall (PR) curve can be generated to assess the model’s classification performance. This curve represents the equilibrium between the classifier’s precision in identifying positive instances and its comprehensiveness in capturing positive instances. The area under the curve (AP value) indicates the model’s discriminative power for a certain class, with a higher AP value signifying superior performance. The AP value is calculated by integrating the area under the curve, using the following equation specifically:
The mAP is calculated by constructing precision–recall (PR) curves for each category, determining the area under these curves for different categories, and then averaging all the average precision (AP) values to assess the performance of a multi-class classification model. The specific definition is as follows:
where
is the number of classes in the dataset, and
is the AP value for class
i.
Additionally, this experiment takes into account the number of floating-point operations (FLOPs) and the number of model parameters (Params) as metrics for evaluation. FLOPs are employed to gauge the computational complexity of the model, specifically referring to the total count of FLOPs executed during the forward pass of the network. A lower FLOP count indicates higher computational efficiency for the model. Meanwhile, Params is used to measure the size of the storage space the network requires.
4.4. Ablation Experiment
To further comprehend the contributions of each component within our proposed method for detecting oriented targets, we carried out ablation experiments on the UCAS-AOD dataset. The aim of these experiments was to assess the effects of DDL module and BC-ROPN on the overall performance.
4.4.1. Evaluation of Different Approaches
Table 1 presents the results from the UCAS-AOD dataset. The baseline model achieved an mAP of 89.62% without any additional modules. This is because the conventional R-CNN [
27] convolutional structure only performs fixed multi-angle rotations on the convolutional kernels, which struggle to effectively model targets with diverse orientations across different categories. When the DDL module was introduced alone, the performance increased to 89.96%. This indicates that extracting rough directional features from the targets in the images and generating various rotation angles and weights for the convolutional kernels provide more accurate spatial directional information for subsequent feature extraction. Furthermore, when BC-ROPN was integrated into the model alone, the mAP reached 90.14%. This suggests that the rotating anchor box mechanism used by the network allows for clearer and more efficient rotation directions, avoiding the blurring of extracted directional features, which could lead to excessively high loss function values. Lastly, when both the DDL module and BC-ROPN were used in conjunction, the model’s mAP reached 90.82%, which was the best performance among all experimental configurations. This result confirms the synergistic effect of the two modules, and their combined action helps enhance the model’s ability to extract spatial directional features of rotating targets.
Additionally,
Table 1 also includes information regarding the detection speed (efficiency) and the number of parameters. The baseline model has 41.14 M parameters and achieves 211.43 G FLOPs. When the DDL module is introduced, the Params rise to 65.87 M, and the FLOPs also increase to 211.85 G compared to the baseline model. This indicates that the DDL module, which extracts spatial directional features in advance and generates rotation angles and angle weights based on these features, does increase the model’s Params to some extent, but it does not significantly raise the computational complexity. When BC-ROPN is introduced alone, the Params decrease to 40.74M and the FLOPs drop to 211.41 G. This is because BC-ROPN employs a lightweight fully convolutional network to integrate different features and uses a new rotation box regression mechanism instead of the original angle regression, resulting in a reduction in Params and a slight decrease in FLOPs compared to the baseline model. After introducing both the DDL module and BC-ROPN, the Params and FLOPs increased to 74.38 M and 211.97 G, respectively. However, this variation is within an acceptable range, once again indicating the synergistic effect of the DDL module and BC-ROPN, where the advanced extraction of spatial directional features by the DDL module is applied to feature fusion in BC-ROPN and ultimately used for weighted fine-tuning of the rotation boxes. This approach not only improves detection accuracy but also maintains high computational efficiency.
On the HRSC2016 dataset, we observed similar trends. The combined use of different modules achieved better performance compared to using a single module alone. The integration of DDL and BC-ROPN enhanced the model’s capacity to extract spatial directional features, resulting in improved accuracy in localization regression and classification, all while maintaining high computational efficiency and rapid image detection speeds. The experimental results indicate that no conflicts exist among the proposed modules, and peak performance of the model is achieved when all recommended methods are utilized in conjunction.
4.4.2. Ablation Experiments on the Number of Cores Inside the Module
To gain a deeper understanding of how the number of kernels within the module affects overall performance, we conducted ablation studies on the DDL module using the UCAS-AOD dataset. Based on the baseline model (R-CNN), this experiment varied the number of kernels embedded in the DDL module within the model to assess changes in the Params, FLOPs, and mAP. The model was evaluated with a single kernel configuration using ResNet-50 (R50) [
25] and ResNet-101 (R101) [
25] as the backbone networks. Subsequently, we inserted the DDL module into the R50 backbone network and gradually increased the number of kernels to observe changes in performance.
Table 2 presents the specific experimental results under different kernel configurations. It can be observed that when the number of kernels in the model is within 6 and R50 is used as the backbone network, both the Params and FLOPs increase slightly with the addition of kernels. For instance, as the number of kernels increases from 1 to 6, the Params and FLOPs of the DDL-R50 model rise from 41.18 M to 96.52 M and from 211.45 G to 212.08 G, respectively. More importantly, the mAP increases from 76.97% to 77.38%, showing a consistent upward trend. This indicates that the more kernels there are, the stronger the model’s feature representation capability and, consequently, the higher the detection accuracy.
However, when the number of kernels exceeds 6 to 10, there is an increase in both the model’s Params(M) and FLOPs (G), from 96.52 M to 144.53 M and from 212.06 G to 212.68 G, respectively, while the mAP hardly improves. This indicates that having more than 6 kernels not only fails to enhance the model’s feature representation capacity but also leads to a significant waste of computational resources due to excessive redundant feature expressions, resulting in reduced operational efficiency for the model.
Overall, the experimental results show that by increasing the number of cores in the new method to 6, the performance can be significantly improved while maintaining computational efficiency. That is, the model applying the DDL module achieves a good balance between the Params and computational complexity when the number of kernels is set to 6.
4.4.3. Experimental Evaluation of BC-ROPN
To evaluate the recall performance of BC-ROPN in object detection tasks, we carried out experiments on the UCAS-AOD validation set using R-CNN+BC-ROPN as the experimental model and ResNet-50 [
25] as the backbone network. With the intersection over union (IoU) threshold set to 0.5 for matching with the ground-truth box, we tallied the recall rates of the model when selecting 300, 1000, and 2000 proposal boxes as inputs, denoted as
,
, and
, respectively. The detailed experimental results can be found in
Table 3. The results show that BC-ROPN achieved a recall rate of 92.80% with 2000 proposal boxes. When the number of proposal boxes was reduced to 1000, the recall rate only declined slightly by 0.6%. However, when the number of proposal boxes further decreased to 300, the recall rate experienced a significant drop. By balancing the inference speed and detection accuracy, we selected 1000 proposal boxes as the standard input for BC-ROPN in our experiments, and thus attained a relative equilibrium between efficiency and performance.
4.5. Comparison with Other New Technologies
In this experimental setup, we benchmarked our proposed DDL R-CNN against 8 target detection networks on the UCAS-AOD dataset and 14 methodologies on the HRSC2016 dataset.
Table 4 reports the detailed comparison results for UCAS-AOD. On the UCAS-AOD dataset, our method achieved precision rates of 90.73% for vehicles and 90.91% for aircraft, with an overall mAP of 90.82%. Compared to advanced methods such as RetinaNet, RoI Trans, RIDet-Q, and CFC-Net, our method outperformed them in average precision by 4.32%, 1.87%, 1.59%, and 1.33%, respectively. The experimental results are shown in
Figure 4, demonstrating our method’s effective detection of both aircraft and cars. The three images on the left half show detection boxes closely fitting the edges of aircraft of various sizes and orientations, including densely packed or interlaced areas where overlapping regions are distinguished by the detection boxes. The three images on the right half also show similar results, with densely distributed and multi-directional cars being represented by compact detection boxes. This outcome validates the efficacy of our approach in tackling complex detection tasks that encompass multi-directional, densely packed, or small targets; shape-like distractors; and occlusions due to light or shadows.
The comparative visualization results on the UCAS-AOD dataset are depicted in
Figure 5, illustrating the detection outcomes of RetinaNet, R2CNN, and O-RCNN in contrast with the method proposed in this study. These methods exhibit significant differences in terms of anchor box size convergence and angular accuracy. Specifically, RetinaNet shows misidentification when the target color is similar to the background, as observable in the first and third images of the first row. Additionally, RetinaNet still misses the detection of small targets, as shown in the second and fourth images of the first row. This indicates its limitations in detecting small-sized targets against complex backgrounds. The detection results of R2CNN reveal the network’s performance constraints under specific conditions. When the target color is similar to the background color, R2CNN’s detection effect is also suboptimal. As seen in the first and second images of the first row, the network fails to accurately distinguish targets that are close in color to the background, demonstrating that R2CNN’s recognition ability is compromised in low-color-contrast situations. Moreover, R2CNN also performs poorly in identifying small targets, as illustrated in the fourth image of the first row, aligning with the challenges mentioned in the literature regarding small-target detection—small targets occupy fewer pixels in the image and have fewer distinct features, increasing the difficulty of detection. Although O-RCNN’s detection results are similar to those of our method, the network still faces recognition challenges when the target color is similar to the background color, as indicated by the red dashed boxes in the first and third images of the first row. Specifically, the issues with O-RCNN mirror those of our method; when targets are close in color to the background, O-RCNN’s detection performance significantly declines. This challenge highlights that even advanced object detection algorithms struggle to avoid misjudgments and omissions in complex background conditions.
Furthermore, as shown in
Table 5, our method also demonstrated excellent performance on the HRSC2016 dataset. Employing R-50-FPN as the backbone network, our method acquires an mAP of 97.81%, representing the highest value among all the methods being compared. Compared to methods such as Gliding Vertex, PIoU, and R3Det, our method outperformed them in mAP by 9.61%, 8.61%, and 8.55%, respectively. The specific experimental results are shown in
Figure 6, where the detection boxes tightly mark the direction and edges of the ships and show strong discrimination ability for similar objects, such as elongated piers, warehouses, water waves, and containers. Similar to the situation on the UCAS-AOD dataset, the results on this dataset indicate that existing methods with rotating convolutional kernels, such as Rotated RPN, use multiple predefined and fixed rotation angles for feature extraction. In port scenarios, the orientation of ships is arbitrarily distributed at 0° to 360°, and these predefined rotation angles are difficult to effectively align with the actual ships, leading to inaccurate directional feature information during feature extraction. Moreover, our method refines the rotation by incorporating the extracted directional features on the target’s bounding rectangle, resulting in more compact rotation boxes compared to direct angle-based rotation methods like PIoU, and thus enabling the extraction of more accurate features.
Further comparative visualization results on the HRSC2016 dataset are shown in
Figure 7, where the detection outcomes of RetinaNet, R2CNN, and O-RCNN are contrasted with the method proposed in this study. Significant differences are observed in the aspects of small-target recognition, anchor box convergence, and the accuracy of rotation direction. Specifically, the method proposed in this study excels in the recognition of small targets. For instance, as demonstrated by the red dashed box in the first image of the first row, RetinaNet failed to detect a small target that was successfully identified by our method. RetinaNet also shows insufficient performance in identifying targets that are similar in color to the background, as revealed by the red dashed box in the third image of the first row. Moreover, the misjudgment phenomenon of RetinaNet in the recognition of similar target objects, as shown by the red dashed box in the last image of the first row, is avoided by our method. Compared with R2CNN, although the overall detection performance is similar, specific cases reveal its limitations, such as the misidentification of a pier as a ship in the second image of the first row, and the failure to accurately identify ships, as shown by the red dashed boxes in the third and fourth images of the first row, indicating R2CNN’s deficiency in anchor box convergence. Compared with O-RCNN, the advantage of our method in small-target recognition is evident, as shown in the first image of the first row, where O-RCNN failed to effectively recognize small-sized targets and also exhibited misjudgment in similar-target recognition, as demonstrated in the third image of the first row. Overall, our method outperforms these three networks in terms of overall anchor box convergence effects and the accuracy of rotation direction, demonstrating the superiority of our method in object detection tasks.
Table 6 provides a complexity analysis of existing methods trained on the UCAS-AOD dataset, including FLOPs, Params, and mAP. Compared to two-stage algorithms such as RoI Transformer, our algorithm achieves a precision increase of 1.87%. Moreover, compared to advanced single-stage algorithms like R3Det-KLD, our algorithm reduces FLOPs by 124.03G and enhances mAP by 6.9%. Compared to single-stage detection methods, DDL R-CNN demonstrates significant advantages in accuracy and feature fusion, effectively enhancing the recognition and localization of targets in complex scenes through two-stage refined processing. At the same time, when compared with dual-stage detection methods, DDL R-CNN maintains high accuracy while optimizing parameters and computational complexity, achieving faster inference speeds and making it more advantageous in applications requiring real-time performance. Furthermore, DDL R-CNN effectively mitigates overfitting through data augmentation and regularization techniques, enhancing the model’s generalization and robustness. Finally, these results further validate the accuracy of our method in extracting multi-directional features in high-resolution RS image detection tasks with multiple interferences.
5. Conclusions
This paper introduces a novel two-stage rotating detection network called DDL R-CNN for detecting multi-directional image targets. The method consists of a dynamic direction learning module and a boundary center region offset generation network, which addresses the angle matching issues at the feature level and the training instability issues at the sample level present in existing methods, and has undergone extensive experimentation. The experimental results showed that the mAP was 90.82% in the UCAS-AOD dataset and 97.81% in the HRSC2016 dataset. Compared with the existing advanced two-stage rotational detection network, the proposed method significantly improves the detection accuracy of multi-directional targets and maintains a certain level of competitiveness in terms of efficiency compared with the single-stage rotational detection network.
Future research is expected to be carried out in the following directions. Cross-domain applications could see DDL R-CNN applied to the medical image analysis field, such as in tumor detection or organ segmentation, where the adaptive nature of the network could significantly enhance detection accuracy due to the varied shapes and orientations of targets. Video object detection is another area where DDL R-CNN could be explored, particularly in dynamically changing scenes for detecting and tracking targets, with the integration of temporal information allowing the model to better handle target motion and changes. Additionally, small-object detection in specific scenarios like surveillance videos or drone-captured images remains a challenge, and research could focus on optimizing DDL R-CNN to improve recognition capabilities for these small targets. Lastly, multimodal data fusion, combining DDL R-CNN with other data types such as LiDAR and infrared images, could be investigated to enhance the model’s understanding of complex scenes. These directions not only pave the way for more accurate and robust detection systems but also expand the applicability of DDL R-CNN across various domains.