DDL R-CNN: Dynamic Direction Learning R-CNN for Rotated Object Detection

Su, Weixian; Jing, Donglin

doi:10.3390/a18010021

Open AccessArticle

DDL R-CNN: Dynamic Direction Learning R-CNN for Rotated Object Detection

by

Weixian Su

¹ and

Donglin Jing

^2,*

¹

Faculty of Engineering, University of Hong Kong, Hong Kong 999077, China

²

The School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(1), 21; https://doi.org/10.3390/a18010021

Submission received: 10 December 2024 / Revised: 28 December 2024 / Accepted: 2 January 2025 / Published: 4 January 2025

(This article belongs to the Special Issue Algorithms for Image Processing and Machine Vision)

Download

Browse Figures

Versions Notes

Abstract

:

Current remote sensing (RS) detectors often rely on predefined anchor boxes with fixed angles to handle the multi-directional variations of targets. This approach makes it challenging to accurately select regions of interest and extract features that align with the direction of the targets. Most existing regression methods also adopt angle regression to match the attributes of remote sensing detectors. Due to the inconsistent regression direction and massive anchor boxes with a high aspect ratio, the extracted target features change greatly, the loss function changes drastically, and the training is unstable. However, existing RS detectors and regression techniques have not been able to effectively balance the precision of directional feature extraction with the complexity of the models. To address these challenges, this paper introduces a novel approach known as Dynamic Direction Learning R-CNN (DDL R-CNN), which comprises a dynamic direction learning (DDL) module and a boundary center region offset generation network (BC-ROPN). The DDL module pre-extracts the directional features of targets to provide a coarse estimation of their angles and the corresponding weights. This information is used to generate rotationally aligned anchor boxes that better model the directional features of the targets. BC-ROPN represents an innovative method for anchor box regression. It utilizes the central features of the maximum bounding rectangle’s width and height, along with the coarse angle estimation and weights derived from DDL module, to refine the orientation of the anchor box. Our method has been proven to surpass existing rotating detection networks in extensive testing across two widely used remote sensing detection datasets, namely UCAS-AOD and HRSC2016.

Keywords:

remote sensing detection; anchor box regression; directional feature extraction; high-aspect-ratio detection

Graphical Abstract

1. Introduction

RS is a technology that automatically collects information on the Earth’s surface from a distance using sensors mounted on satellites, aircraft, or drones. It is widely used in environmental monitoring, urban planning, agricultural management, military reconnaissance, disaster response, and geological exploration. This technology offers several advantages, including a broad coverage area, strong real-time data acquisition capabilities, and high cost-effectiveness.

In the research on RS image target detection, the detection methods for rotating targets are critical [1,2,3]. Traditional detection methods often rely on horizontal anchor boxes, such as the SSD [4], YOLO [5,6,7], and Faster-RCNN [8] detection networks. This can cause misalignment between the target and the proposal region when dealing with rotating targets, leading to a decrease in detection accuracy. To address this issue, researchers have proposed various methods involving rotating anchor boxes to improve the performance of RS detection networks. One approach is to use rotating anchor points instead of traditional horizontal anchor points to match the orientation of the target. For example, Rotated RPN [9] improves detection accuracy by pre-setting anchor points at different angles to cover the various directions in which targets may appear. Furthermore, certain techniques mitigate the impact of horizontal proposal regions by directly regressing the coordinates of the detection box’s four vertices. An example of this is the Gliding Vertex [10] method, which can accommodate targets of arbitrary orientations. Although methods involving rotating anchor points and vertex regression have improved the detection performance of rotating targets, they still face issues with computational complexity. To enhance detection efficiency, several studies are investigating the reduction of computational complexity without compromising or even improving the accuracy of detection. This includes developing new loss functions to train better networks, as well as designing more efficient network structures to handle rotating targets.

In the realm of RS detection, the focus of many researchers is on enhancing the network’s ability to capture the directional characteristics of targets during the feature extraction phase. To transcend the constraints inherent in conventional convolutional neural networks, which struggle to recognize orientation features when dealing with rotating targets, approaches involving rotating convolution kernels and rotating anchor boxes have been introduced. These techniques adjust the orientation or shape of the convolution kernels to match the arbitrary rotation angles of targets to enhance detection accuracy. For instance, the R2CNN [11] method identifies rotating targets by detecting the first two corners in a clockwise sequence along with the rectangle’s height. CenterNet [12], on the other hand, locates the center of the target through keypoint detection and then expands outward from this center to determine the target’s bounding box. These innovative approaches not only refine the convolution operation and the regression of anchor boxes for more comprehensive target recognition but also drive the evolution of network architectures. This includes the adoption of direction-sensitive convolution kernels and rotating pooling layers, which significantly boost the model’s adaptability to rotational changes and its recognition capabilities. However, current technologies still grapple with challenges in the design of feature extractors. In particular, there is a need to strike an appropriate balance between the number of rotation angles and the resulting increase in computational costs. Additionally, there are issues with the fluctuations in the loss function when rotating anchor boxes with very high aspect ratios.

In conclusion, the existing methods for rotating convolutions and bounding box regression have not adequately tackled the problem of aligning the feature and sample levels of detection networks with the angular spatial traits of targets. More precisely, at the feature level, it is difficult for the angles of rotating convolution kernels to align with the angles of targets to accommodate their angular variations. As depicted in Figure 1a, approaches that pre-set fixed rotation angles have difficulty in achieving a balance between computational cost and the number of pre-set angles. Having an excessive number of pre-set angles escalates the computational burden, while having too few can result in inaccurate extraction of directional features. In contrast, our proposed method adopts a data-driven strategy to dynamically acquire the angles, generating a coarse representation of the target’s direction within the image. This angle would guide the rotation of the convolution kernels to precisely capture the target’s directional features.

At the sample level, anchor boxes must be aligned with the orientation of the target and avoid problems arising from inconsistent directionality in angle regression during training, the cyclical nature of angle changes (from 360 degrees back to 0 degrees), and the substantial variations when regressing high-aspect-ratio (slender) anchor boxes from their pre-set positions to the target positions, which can give rise to drastic changes in the loss function and cause unstable training. Figure 1b highlights the challenges of current methods that regress anchor boxes using angles. When angle regression is employed for anchor boxes, two scenarios may occur: First, a minor clockwise rotation of the red box could align it with the target box (green box), as indicated by the rotation of the yellow box. Second, a significant counterclockwise rotation of the red box could align it with the target box (green box), as shown by the rotation of the blue box. Encountering the second scenario can lead to a problem where the large rotation angle causes drastic changes in the loss function. Our proposed method, however, abandons the use of angles for the direction regression of anchor boxes, choosing instead to directly use two offset values to regress the center points of each side of the original anchor box. This approach avoids the difficulties in regression due to inconsistent rotation directions and the challenges related to rotating high-aspect-ratio anchor boxes.

Through the analysis, it is clear that no matter how we design methods for rotating convolutions and anchor box regression, we cannot avoid the challenge of aligning feature and sample information with the spatial orientation of targets in RS detection tasks. Compared to existing rotating detection networks, it is particularly important to develop a dynamic, target-angle-sensitive feature extractor and an anchor box regression method that is not sensitive to angles. To address this, we introduce the Dynamic Direction Learning R-CNN (DDL R-CNN), which is composed of two main parts: the dynamic direction learning (DDL) module and the Boundary Center Region Offset Proposal Network (BC-ROPN). First, the DDL module is integrated into the network’s backbone to extract directional features of targets from initial feature maps and convert them into angles. Then, the rotating convolutional kernels adjust their rotation angles based on the output of this module, enabling them to accurately capture the target’s directional information. Following this, BC-ROPN uses the boundary center features of the oriented bounding rectangle, along with the rotation angles and angle weights obtained from the DDL module, to generate boundary center offsets. These offsets direct the regression of the rotating anchor boxes within the confines of the oriented bounding rectangle. This approach results in a smoother convergence of the loss function and yields more nuanced and precise outcomes, effectively circumventing the issue of abrupt fluctuations in the loss function values. The novel contributions we present are outlined below.

(1): We carried out a systematic analysis of the issue that the feature-level and sample-level alignment in existing RS target detection networks makes it difficult to match the spatial orientation of targets.
(2): We introduced a rotating feature extractor that uses the early extraction of target directional features to pre-set the rotation direction of dynamic directional convolution kernels. By assigning an angle weight to each pre-set angle based on these features, we enhance the effectiveness of the convolution kernels during rotation. This method reduces redundant extraction of directional features compared to fixed pre-set rotation methods, conserves resources, and improves the efficacy of extracting directional features.
(3): We proposed an anchor box convergence approach that utilizes the early extracted directional feature information to generate boundary center offsets for oriented bounding rectangle anchor boxes. This offset regression method mitigates the drastic changes in the loss function caused by angle regression and ensures that anchor boxes converge more tightly and accurately.

2. Related Works

RS targets are characterized by their multi-directional nature. The existing methods that employ convolutional kernels sliding along the horizontal direction [13,14] are not adept at detecting targets oriented in various directions. When it comes to detecting the target, the sampling points may not align properly with the targets, causing the extracted features to potentially blend with background features, which in turn fails to accurately convey the targets’ posture information. To tackle this challenge, the prevailing approaches center around enhancing rotational feature extraction, representing bounding boxes in a rotational format, and refining loss functions. These improvements aim to achieve more precise and tighter localization of targets.

2.1. Research on Rotation Feature Extraction

Ding proposed the RoI Transformer algorithm [15], focusing on supervised learning of rotated regions of interest (RRoIs) combined with feature extraction that is sensitive to position alignment. The RPN network predicts the regression parameters for rotated anchor boxes based on the features from horizontal anchor boxes, converting horizontal anchor boxes into rotated ones. This approach facilitates the creation of more accurate rotated region proposals, eliminating the necessity to increase the number of anchors in the PRN.

The Rotational Equivariant Detector (ReDet) [16] merges rotational equivariance with rotation invariance in feature extraction. By integrating a rotational equivariant network into the backbone, it generates features that are equivariant to rotations and introduces a rotation-invariant RoI Align module (RiRoI Align). This module adeptly extracts rotation-invariant features from the equivariant features across spatial and directional dimensions, tailored to the specific region of interest. This implementation of rotation invariance optimizes the precision of detecting rotating targets.

2.2. Research on the Representation of Rotating Boxes

RROI pooling [17] enhances the generation of proposal boxes in various orientations by rotating anchor points, capturing the directional features of targets, and improving the precision of bounding box localization. This concept is then expanded to a rotated RoI pooling layer, which allows for the extraction of more distinct directional features from targets that are inclined or rotated, significantly boosting the accuracy and robustness of target detection in RS imagery.

The Dynamic Refinement Network (DRN) [18] introduces two innovative elements: the Feature Selection Module (FSM) and the Dynamic Refinement Head (DRH). The FSM is designed to relay accurate and noise-reduced target shape and directional features to the detector. Meanwhile, the DRH dynamically adjusts its internal parameters based on the input target characteristics to better accommodate objects of different shapes and sizes and refine the detection of object locations and categories.

R3Det [19] enhances its detection capabilities by integrating rotated anchor points with horizontal anchor frames and addresses the issue of feature misalignment prevalent in single-stage detectors through the introduction of the Feature Refinement Module (FRM). The FRM has been integrated to determine position information for the refinement anchor via feature interpolation, with the entire feature map being reconstructed on a pixel-by-pixel basis to ensure feature alignment. Moreover, the FRM also reduces the number of bounding boxes in the refinement phase, speeding up the model training process.

2.3. Research on Loss Functions

The Rotation-Sensitive Detector (RSDet) [20] tackles the issue of abrupt angle transitions from 360 degrees to 0 degrees by incorporating a corrective term in addition to the center point loss and L1 loss. This innovation ensures that the L1 loss remains symmetrical (continuous) at points of sharp angle changes, preventing the loss function from experiencing discontinuities during sudden angle parameter shifts and enabling the target box to converge more compactly and precisely.

Yang proposed the GWD [21] (Gaussian Wasserstein Distance) algorithm, which transforms the rotating regression bounding boxes into two-dimensional Gaussian distributions. Calculating the Gaussian Wasserstein distance enables an efficient approximation of the gradient backpropagation learning for the rotation IoU-induced loss. GWD possesses two pivotal attributes: the equivalence of bounding box definitions and a differentiable IoU loss approximation. These enable it to offer valuable learning insights even in scenarios where bounding boxes do not overlap, and to tackle boundary discontinuities and square-like anomaly challenges regardless of the specific definition of the bounding box.

Compared to the above methods, the focus of the DDL R-CNN proposed in this paper is the design of a feature extractor that is sensitive to directional features and an anchor box regression method that is insensitive to directional features. This approach aims to bolster the representational capacity of the backbone feature extractor and reduce the computational complexity of generating anchor boxes, particularly in scenarios involving the detection of rotating targets.

3. Method

In this section, we first introduce the design and functionality of the routing function (Section 3.1). Figure 2b illustrates its overall structure, where the input image features are processed by the function to generate a set of two-dimensional data. One dimension pertains to the angles of targets within the image, and the other dimension corresponds to the angle weights associated with those data. We then further discuss the structure and implementation details of the DDL module (Section 3.2). Figure 2a shows the overall architecture of this module, which uses the routing function to generate a set of two-dimensional data from the input feature maps, combined with a set of original convolution kernel parameters, to ultimately produce a new set of feature maps

y (Δ α, Δ β)

. Finally, Section 3.3 explains the overall network structure of the proposed DDL R-CNN and provides detailed implementation details of BC-ROPN. Figure 3 presents the overall architecture of the network, which is based on a two-stage detector using FPN for feature extraction and processing of images. It incorporates the anchor box boundary center offsets extracted from the feature maps generated by the DDL module before anchor box regression. This enables anchor box regression based on the boundary center points of the target’s maximum rectangle and ultimately outputs the image results.

3.1. Routing Function

The conventional convolutional approach, which employs consistent parameters and sliding window directions for feature extraction in different RS images, struggles to effectively extract the spatial orientation features of targets that are rotated. The prevalent method to tackle this challenge involves rotating a convolutional kernel through a set of predefined angles to extract the multi-directional spatial features of the targets. However, this approach, which is dependent on pre-set angles, restricts the model’s capacity to adapt to targets at any arbitrary angle. This limitation can lead to less accurate and incomplete extraction of spatial orientation features, ultimately diminishing the model’s ability to generalize across different scenarios.

To tackle this challenge, we have introduced the routing function, which carries out detailed feature extraction based on the input feature maps. By integrating spatial information and smoothly processing angle data, it generates

θ

, offering a preliminary estimate of the target’s spatial orientation. This enhances the robustness and multi-directional capability of extracting spatial features of targets. Like the process for generating

θ

, creating the combined weights

λ

also involves detailed feature extraction and spatial information integration. However, these spatial features need to be mapped non-linearly to produce suitable combination weights (

λ

) that fall within the range of 0 to 1. These weights are subsequently employed to amalgamate the output feature maps derived from convolutional kernels at diverse rotation angles. The weights indicate the contribution of each kernel to the target’s overall spatial feature representation, enabling the model to adaptively emphasize features in certain directions while suppressing others that might cause interference. This approach allows the model to synthesize information from different directions, resulting in a comprehensive feature representation that captures the spatial characteristics of the target more effectively. The general architecture of our proposed routing function is shown in Figure 2b.

Specifically, the feature map

X \in [C_{i n}, H, W]

is initially subjected to a lightweight deep convolution with a

3 \times 3

kernel, followed by a linear mapping [2] and ReLU activation. This process begins with the refined extraction of features:

\begin{matrix} x^{'} = ReLU (LN (X \cdot K) + b); R e L U = \frac{1}{1 + e^{- x}}; LN (x + ϵ) = \frac{x - μ}{\sqrt{σ^{2} + ϵ}} γ + β \end{matrix}

(1)

For Equation (1), the asterisk (·) denotes the depthwise convolution operation. K is a convolution kernel with a size of

3 \times 3

. LN stands for layer normalization. b is the bias term of the convolution layer. ReLU is the activation function. Firstly, convolution effectively captures local spatial characteristics from the input feature map. The convergence rate and stability of the network during training can be improved by layer normalization. Then, the ReLU activation enhances the model’s ability to capture non-linearities, allowing the network to detect more complex patterns. The formula generates the activated features

x^{'}

, which are then pooled into a

C_{i n}

-dimensional feature vector using average pooling. The averaged feature vector is passed on to two different branches.

\begin{matrix} v = Average Pooling (X^{'}); A v e r a g e P o o l i n g = y_{i j} = \frac{1}{h \times w} \sum_{m = 0}^{h - 1} \sum_{n = 0}^{w - 1} x_{(i \times s + m), (j \times s + n)} \end{matrix}

(2)

For Equation (2),

x^{'}

is the output feature map from the previous layer. The average pooling operation is a key step in the feature extraction process, which helps to compress data, integrate global information, and provide the appropriate input size for subsequent network layers. The first branch is dedicated to predicting rotation angles, comprising a linear layer coupled with a softsign activation function. In order to avoid obtaining biased angle estimates, the bias term of this linear layer was intentionally set to 0.

\begin{matrix} θ = softsign (W_{θ} v); softsign (x) = \frac{x}{1 + | x |} \end{matrix}

(3)

For Equation (3),

W_{θ}

is the weight of the linear layer. v is a

C_{i n}

-dimensional feature vector.

θ

is the predicted rotation angle. The softsign activation function provides non-linear transformation, allowing the function to extract complex patterns and features. This design enables the angle prediction branch to respond coarsely to the directional changes of the input feature vector, enhancing the convolutional kernel’s sensitivity to target orientation changes, thereby improving the accuracy of spatial directional feature extraction by the rotating convolutional kernel and its generalization ability towards multi-directional targets. Known as the combination weight prediction branch, the second branch assumes the responsibility of forecasting the combination weights

λ

. This branch employs a linear layer that incorporates bias and a sigmoid activation function.

\begin{matrix} λ = sigmoid (W_{λ} v + b_{λ}); sigmoid (x) = \frac{1}{1 + e^{- x}} \end{matrix}

(4)

For Equation (4),

W_{λ}

is the weight of the linear layer.

b_{λ}

is the bias of the linear layer. v is a

C_{i n}

-dimensional feature vector.

λ

is the predicted combination weights. The sigmoid activation function compresses the output to the interval (0, 1), which is suitable for representing weights. The combination weight prediction branch learns how to dynamically weight different directional feature responses based on the input features. The routing function commences with initialization from a truncated normal distribution, characterized by a mean of 0 and a standard deviation of 0.2, which ensures that the module generates smaller values at the beginning of the training process.

3.2. Dynamic Directional Learning (DDL) Module

Previous detectors typically rotate convolutional kernels at fixed angles based on prior knowledge, which presents difficulties in dynamically adjusting to targets with diverse orientations. To enable the detector to automatically perceive arbitrary changes in the target direction, a DDL module is proposed. This module has n kernels

(w_{1}, \dots, w_{n})

, each with the shape

[C_{o u t}, C_{i n}, k, k]

. It uses the

(θ, λ)

generated by the routing function to guide the rotation of the convolutional kernels at different angles, thereby extracting more refined spatial features of the target. Finally, the spatial features extracted by these differently rotated convolutional kernels are fused through weighting. Referencing Figure 2, the detailed operations are delineated below:

Firstly, a set of predicted rotation angles

θ = [θ_{1}, θ_{2}, \dots, θ_{n}]

obtained from the routing function is used to rotate each of the n kernels individually.

\begin{matrix} w_{i}^{'} = Rotate (W_{i}; θ_{i}), i = 1, 2, 3, \dots, n,; R o t a t e (θ) = [\begin{matrix} cos (θ) & - sin (θ) \\ sin (θ) & cos (θ) \end{matrix}] \end{matrix}

(5)

For Equation (5),

θ_{i}

represents the rotation angle of

w_{i}

, and

w_{i}^{'}

is the rotated kernel. The Rotate() function rotates the kernel

w_{i}

by the predicted angle

θ_{i}

. This function defines the counterclockwise (clockwise) direction as positive and uses bilinear interpolation to fill in the sampling locations missing due to rotation.

These rotated kernels are generally computed by performing convolutions with the input feature map and then summing the resulting feature maps element by element.

\begin{matrix} y = λ_{1} (w_{1}^{'} * x) + λ_{2} (w_{2}^{'} * x) + \dots + λ_{n} (w_{n}^{'} * x), \end{matrix}

(6)

λ = [λ_{1}, λ_{2}, \dots, λ_{n}]

are the combination weights predicted by the routing function, * denotes the convolution operation, and y is the combined output feature map. However, this approach requires individual convolution operations for each kernel, followed by an addition step, potentially resulting in reduced computational efficiency. Consequently, we reformulate the aforementioned Equation (6) as follows:

\begin{matrix} y = (λ_{1} w_{1}^{'} + λ_{2} w_{2}^{'} + \dots + λ_{n} w_{n}^{'}) * x, \end{matrix}

(7)

This implies that convolving the input features separately and summing their outputs (Equation (6)) is equivalent to multiplying these kernels with their respective combination weights (as shown in Figure 2a) and summing them before performing a single convolution operation (Equation (7)). In Equation (7), the re-convolution calculation occurs only once, whereas in Equation (6) it does not. This strategy significantly reduces computational complexity while maintaining the same feature expression capability of the model.

3.3. Dynamic Direction Learning R-CNN (DDL R-CNN)

The goal of DDL R-CNN is to leverage a lightweight, fully convolutional network to merge features from different levels and produce a set of sparse, oriented proposal regions. The original mechanism for rotating objects had ambiguous choices for angle regression, leading to substantial losses when large angles were selected. Additionally, with high-aspect-ratio targets, even small predicted angle changes could cause dramatic shifts in the loss function, resulting in unstable training. To address these challenges, the innovative rotation mechanism leverages the existing oriented proposal regions and the composite feature maps from the DDL module to enable a compact and rapid regression of rotated bounding boxes. This strategy not only significantly reduces the volume of parameters but also ensures that the loss function values remain within a reasonable range.

The specific network structure is shown in Figure 3. The model assimilates the five-tiered features from the FPN, labeled as

P 2, P 3, P 4, P 5, P 6

, and attaches a consistent head to the feature set at each level. This head includes a

3 \times 3

convolutional layer, along with two parallel

1 \times 1

convolutional layers. Across all feature levels, we designate three horizontal anchor points at each spatial location, each associated with three distinct aspect ratios: 1:2, 1:1, 2:1. The pixel areas for these anchors on

P 2, P 3, P 4, P 5, P 6

are

32^{2}, 64^{2}, 128^{2}, 256^{2}

, and

512^{2}

, respectively. The position of each anchor point a is specified by a four-dimensional vector

a = (a_{x}, a_{y}, a_{w}, a_{h})

, with

(a_{x}, a_{y})

identifying the anchor’s central coordinates and

a_{w}

and

a_{h}

representing its width and height, respectively. One of the parallel

1 \times 1

convolutional layers serves as the regression branch, which integrates the directional feature maps

y (Δ α, Δ β)

produced by the DDL module, and then merges these features and outputs offsets for the proposal regions relative to the anchor points

δ = (δ_{x}, δ_{y}, δ_{w}, δ_{h}, δ_{α} \cdot y_{Δ α}, δ_{β} \cdot y_{Δ β})

. For every location on the feature maps, we produce A proposal regions, where A equals the number of anchors per location, specified as 3 in this paper. This causes the regression branch to output 6A results. The directional proposal regions are derived from decoding these regression results. The decoding procedure is detailed below:

\begin{matrix} \{\begin{matrix} y = a_{y} + δ_{y} \cdot a_{h}, x = a_{x} + δ_{x} \cdot a_{w} \\ h = a_{h} \cdot e^{δ h}, ω = a_{ω} \cdot e^{δ ω} \\ Δ_{α} = ω \cdot δ_{α} \cdot y_{Δ_{α}}, Δ_{β} = h \cdot δ_{β} \cdot y_{Δ_{β}} \end{matrix} \end{matrix}

(8)

The center coordinate of the predicted candidate region is

(x, y)

, and the dimensions of its outer bounding box are expressed as width w and height h. The center point offsets at the top (bottom) and right (left) of the outer bounding box are denoted as

Δ α

and

Δ β

, respectively.

Finally, based on the characteristics of the parameters

(Δ α, Δ β, w, h, x, y)

, we designed a new oriented proposal region representation scheme, known as the Boundary Center Offset Representation, which is the BC-ROPN part shown in Figure 3. The process is illustrated in the right (tail) diagram of Figure 1b. The oriented bounding box O serves as the outer rectangle and the midpoint of its bounding box is indicated by the yellow dot. The vertices of the oriented bounding box O are represented by the black points. Specifically, we use the six parameters

O = (Δ α, Δ β, w, h, x, y)

calculated through Equation (8) to represent the oriented bounding box. With these six parameters, we can assign four corrected (offset) vertex coordinate sets

v = (v_{1}, v_{2}, v_{3}, v_{4})

for each proposal region. Here,

Δ α

is the offset of

v_{1}

relative to the midpoint of the top edge

(x, y - \frac{h}{2})

. By symmetry,

- Δ α

indicates the offset of

v_{3}

relative to the midpoint of the bottom edge

(x, y + \frac{h}{2})

.

Δ β

represents the offset of

v_{2}

relative to the midpoint of the right edge

(x + \frac{w}{2}, y)

, and

- Δ β

is the offset of

v_{4}

relative to the midpoint of the left edge

(x - \frac{w}{2}, y)

. Therefore, the coordinates of the four vertices are expressed as follows:

\begin{matrix} \{\begin{matrix} v_{1} = (Δ α, 0) + (x, y - \frac{h}{2}) \\ v_{2} = (0, Δ β) + (x + \frac{w}{2}, y) \\ v_{3} = (- Δ α, 0) + (x, y + \frac{h}{2}) \\ v_{4} = (0, - Δ β) + (x - \frac{w}{2}, y) \end{matrix} \end{matrix}

(9)

This method of representation enables the regression of each directional proposal region by predicting the parameters of the external bounding box

(w, h, x, y)

and the parameters for deducing the midpoint offsets

(Δ α, Δ β)

.

4. Experiment

In this section, we conduct an empirical assessment of the proposed network integrated with a DDL module across various detection frameworks. We begin in Section 4.1 with a thorough description of the datasets utilized in our experiments. Section 4.2 details the experimental setup and parameter configurations. The evaluation metrics to be employed in the experiments are explained in Section 4.3. Section 4.4 includes ablation studies and visualization of results that further substantiate the efficacy of our proposed method. Finally, Section 4.5 presents the key findings of our approach alongside comparisons with other methods for rotating target detection on two widely used datasets.

4.1. Dataset

For a thorough evaluation of the capabilities of our proposed method in detecting oriented targets, it was subjected to testing on two highly regarded standard datasets: UCAS-AOD [22] and HRSC2016 [23].

The UCAS-AOD dataset, issued by the University of Chinese Academy of Sciences, is dedicated to the detection tasks of aircraft and vehicles in RS imagery. It consists of 1000 images of aircraft and 510 images of vehicles, with a total of 7482 aircraft and 7114 vehicles annotated. The images, derived from Google Earth satellite imagery, ensure the authenticity and diversity of the scenes. All images are uniformly sized at

1280 \times 659

pixels, stored in .png format, with the entire dataset weighing in at approximately 3.48 GB. This dataset mainly focuses on medium- and small-sized targets in RS images, such as aircraft and vehicles. The targets in these images are typically multi-directional and densely packed, which presents challenges for detection algorithms and also makes it an ideal dataset for assessing the performance of oriented target detection algorithms. In our experiments, since UCAS-AOD emphasizes improving the detection accuracy of multi-directional targets, we performed some preprocessing on the dataset’s annotation files using a rotated bounding box format of

(x, y, w, h, θ)

. And the dataset was divided as follows: 60% training set, 20% validation set, and 20% test set.

The HRSC2016 dataset is a collection focused on ship detection in high-resolution RS imagery. It comprises 1061 images of varying sizes, with resolutions ranging from 300 × 300 to 1500 × 900 pixels, encompassing ship instances of various sizes and orientations. Known for their challenging detection tasks, ships in the images may appear slender and densely packed side by side, and there are numerous ship-like distractions such as waves, rectangular warehouses, and slender piers. The dataset offers ample training and testing material with its image count and instance count, making it a vital resource for evaluating and enhancing the performance of ship detection algorithms. In this experiment, the dataset was divided into 60% as the training set, 20% as the validation set, and 20% as the test set.

4.2. Parameter Setting and Tuning Methods

We trained our model using an RTX 2080Ti with a batch size of 4. The experimental results were achieved with the mmdetection [24] toolkit. Both ResNet50 [25] and ResNet101 [25] served as our backbone networks, which were pre-trained on ImageNet [26]. Data augmentation included horizontal and vertical flipping. We utilized the SGD optimizer for our network, setting the learning rate at 0.01, momentum at 0.9, and weight decay at 0.0001. The training was conducted for 36 epochs with R-CNN [27], with a gradual warm-up of the learning rate to its peak of 0.01 during the initial 1–4 epochs, followed by a gradual reduction until training completion. The Non-Maximum Suppression (NMS) threshold was set to 0.8. For the UCAS-AOD dataset, we resized the original images to

1024 \times 1024

pixels. For the HRSC2016 dataset, we adjusted the original images to approximately maintain an aspect ratio of

1333 \times 800

, with the shorter side resized to 800 pixels and the longer side restricted to no more than 1333 pixels.

In this experiment, to optimize the performance of our proposed oriented object detection method and ensure the reproducibility of the results, we employed a variety of comprehensive tuning strategies. Initially, we utilized the mmsplict tool from the mmdetection package to perform data augmentation, which included techniques such as random horizontal flipping, vertical flipping, random rotation, scaling, color jittering, and random cropping. These methods aimed to increase the diversity of the training data, enhance the model’s adaptability to different scenarios and object orientations, and reduce overfitting to the training set. In terms of hyperparameter tuning, we meticulously documented the basis and range of initial hyperparameter choices and employed methods like grid search and random search for optimization, ensuring the identification of the optimal hyperparameter configuration to boost model performance. Additionally, we analyzed the specific impact of each hyperparameter on model performance and ultimately determined the best parameter settings. To further mitigate the risk of overfitting, we introduced Dropout layers to enhance the model’s robustness, implemented L2 weight regularization to limit model complexity, used early stopping to monitor validation set performance and prevent overfitting, and applied batch normalization to reduce internal covariate shift, accelerate the training process, and improve the model’s generalization capabilities. These strategies ensured the model’s accuracy and stability when dealing with complex backgrounds and multi-directional targets. Through these integrated tuning methods, the experiment achieved high detection accuracy while maintaining consistency across different settings, laying a solid foundation for future research and applications.

4.3. Experimental Evaluation Index

The mean average precision (mAP) value was used in this experiment to assess the accuracy of the model. We chose mAP as the core evaluation metric for this object detection experiment because it takes into account both the classification and localization performance of the model. Specifically, for each category, we first calculated its recall (R) and precision (P):

\begin{matrix} R = \frac{T P}{F N + T P}, P = \frac{T P}{F P + T P} \end{matrix}

(10)

True negative, false negative, true positive, and false positive are denoted as

T N

,

F N

,

T P

and

F P

, respectively. Precision (P) is the proportion of correctly identified targets among all detected targets. Recall (R) is the proportion of instances that the model is able to identify. While higher values of both precision (P) and recall (R) are desirable, there is an inherent trade-off between them; enhancing precision may affect recall, and vice versa. Thus, achieving a balance between precision and recall is essential. By examining the relationship between precision and recall at various classification thresholds and plotting their corresponding points, a precision–recall (PR) curve can be generated to assess the model’s classification performance. This curve represents the equilibrium between the classifier’s precision in identifying positive instances and its comprehensiveness in capturing positive instances. The area under the curve (AP value) indicates the model’s discriminative power for a certain class, with a higher AP value signifying superior performance. The AP value is calculated by integrating the area under the curve, using the following equation specifically:

\begin{matrix} A P = \int_{0}^{1} P (R) d R \end{matrix}

(11)

The mAP is calculated by constructing precision–recall (PR) curves for each category, determining the area under these curves for different categories, and then averaging all the average precision (AP) values to assess the performance of a multi-class classification model. The specific definition is as follows:

\begin{matrix} m A P = \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} A P_{i} \end{matrix}

(12)

where

N_{c}

is the number of classes in the dataset, and

A P_{i}

is the AP value for class i.

Additionally, this experiment takes into account the number of floating-point operations (FLOPs) and the number of model parameters (Params) as metrics for evaluation. FLOPs are employed to gauge the computational complexity of the model, specifically referring to the total count of FLOPs executed during the forward pass of the network. A lower FLOP count indicates higher computational efficiency for the model. Meanwhile, Params is used to measure the size of the storage space the network requires.

4.4. Ablation Experiment

To further comprehend the contributions of each component within our proposed method for detecting oriented targets, we carried out ablation experiments on the UCAS-AOD dataset. The aim of these experiments was to assess the effects of DDL module and BC-ROPN on the overall performance.

4.4.1. Evaluation of Different Approaches

Table 1 presents the results from the UCAS-AOD dataset. The baseline model achieved an mAP of 89.62% without any additional modules. This is because the conventional R-CNN [27] convolutional structure only performs fixed multi-angle rotations on the convolutional kernels, which struggle to effectively model targets with diverse orientations across different categories. When the DDL module was introduced alone, the performance increased to 89.96%. This indicates that extracting rough directional features from the targets in the images and generating various rotation angles and weights for the convolutional kernels provide more accurate spatial directional information for subsequent feature extraction. Furthermore, when BC-ROPN was integrated into the model alone, the mAP reached 90.14%. This suggests that the rotating anchor box mechanism used by the network allows for clearer and more efficient rotation directions, avoiding the blurring of extracted directional features, which could lead to excessively high loss function values. Lastly, when both the DDL module and BC-ROPN were used in conjunction, the model’s mAP reached 90.82%, which was the best performance among all experimental configurations. This result confirms the synergistic effect of the two modules, and their combined action helps enhance the model’s ability to extract spatial directional features of rotating targets.

Additionally, Table 1 also includes information regarding the detection speed (efficiency) and the number of parameters. The baseline model has 41.14 M parameters and achieves 211.43 G FLOPs. When the DDL module is introduced, the Params rise to 65.87 M, and the FLOPs also increase to 211.85 G compared to the baseline model. This indicates that the DDL module, which extracts spatial directional features in advance and generates rotation angles and angle weights based on these features, does increase the model’s Params to some extent, but it does not significantly raise the computational complexity. When BC-ROPN is introduced alone, the Params decrease to 40.74M and the FLOPs drop to 211.41 G. This is because BC-ROPN employs a lightweight fully convolutional network to integrate different features and uses a new rotation box regression mechanism instead of the original angle regression, resulting in a reduction in Params and a slight decrease in FLOPs compared to the baseline model. After introducing both the DDL module and BC-ROPN, the Params and FLOPs increased to 74.38 M and 211.97 G, respectively. However, this variation is within an acceptable range, once again indicating the synergistic effect of the DDL module and BC-ROPN, where the advanced extraction of spatial directional features by the DDL module is applied to feature fusion in BC-ROPN and ultimately used for weighted fine-tuning of the rotation boxes. This approach not only improves detection accuracy but also maintains high computational efficiency.

On the HRSC2016 dataset, we observed similar trends. The combined use of different modules achieved better performance compared to using a single module alone. The integration of DDL and BC-ROPN enhanced the model’s capacity to extract spatial directional features, resulting in improved accuracy in localization regression and classification, all while maintaining high computational efficiency and rapid image detection speeds. The experimental results indicate that no conflicts exist among the proposed modules, and peak performance of the model is achieved when all recommended methods are utilized in conjunction.

4.4.2. Ablation Experiments on the Number of Cores Inside the Module

To gain a deeper understanding of how the number of kernels within the module affects overall performance, we conducted ablation studies on the DDL module using the UCAS-AOD dataset. Based on the baseline model (R-CNN), this experiment varied the number of kernels embedded in the DDL module within the model to assess changes in the Params, FLOPs, and mAP. The model was evaluated with a single kernel configuration using ResNet-50 (R50) [25] and ResNet-101 (R101) [25] as the backbone networks. Subsequently, we inserted the DDL module into the R50 backbone network and gradually increased the number of kernels to observe changes in performance.

Table 2 presents the specific experimental results under different kernel configurations. It can be observed that when the number of kernels in the model is within 6 and R50 is used as the backbone network, both the Params and FLOPs increase slightly with the addition of kernels. For instance, as the number of kernels increases from 1 to 6, the Params and FLOPs of the DDL-R50 model rise from 41.18 M to 96.52 M and from 211.45 G to 212.08 G, respectively. More importantly, the mAP increases from 76.97% to 77.38%, showing a consistent upward trend. This indicates that the more kernels there are, the stronger the model’s feature representation capability and, consequently, the higher the detection accuracy.

However, when the number of kernels exceeds 6 to 10, there is an increase in both the model’s Params(M) and FLOPs (G), from 96.52 M to 144.53 M and from 212.06 G to 212.68 G, respectively, while the mAP hardly improves. This indicates that having more than 6 kernels not only fails to enhance the model’s feature representation capacity but also leads to a significant waste of computational resources due to excessive redundant feature expressions, resulting in reduced operational efficiency for the model.

Overall, the experimental results show that by increasing the number of cores in the new method to 6, the performance can be significantly improved while maintaining computational efficiency. That is, the model applying the DDL module achieves a good balance between the Params and computational complexity when the number of kernels is set to 6.

4.4.3. Experimental Evaluation of BC-ROPN

To evaluate the recall performance of BC-ROPN in object detection tasks, we carried out experiments on the UCAS-AOD validation set using R-CNN+BC-ROPN as the experimental model and ResNet-50 [25] as the backbone network. With the intersection over union (IoU) threshold set to 0.5 for matching with the ground-truth box, we tallied the recall rates of the model when selecting 300, 1000, and 2000 proposal boxes as inputs, denoted as

R_{300}

,

R_{1000}

, and

R_{2000}

, respectively. The detailed experimental results can be found in Table 3. The results show that BC-ROPN achieved a recall rate of 92.80% with 2000 proposal boxes. When the number of proposal boxes was reduced to 1000, the recall rate only declined slightly by 0.6%. However, when the number of proposal boxes further decreased to 300, the recall rate experienced a significant drop. By balancing the inference speed and detection accuracy, we selected 1000 proposal boxes as the standard input for BC-ROPN in our experiments, and thus attained a relative equilibrium between efficiency and performance.

4.5. Comparison with Other New Technologies

In this experimental setup, we benchmarked our proposed DDL R-CNN against 8 target detection networks on the UCAS-AOD dataset and 14 methodologies on the HRSC2016 dataset. Table 4 reports the detailed comparison results for UCAS-AOD. On the UCAS-AOD dataset, our method achieved precision rates of 90.73% for vehicles and 90.91% for aircraft, with an overall mAP of 90.82%. Compared to advanced methods such as RetinaNet, RoI Trans, RIDet-Q, and CFC-Net, our method outperformed them in average precision by 4.32%, 1.87%, 1.59%, and 1.33%, respectively. The experimental results are shown in Figure 4, demonstrating our method’s effective detection of both aircraft and cars. The three images on the left half show detection boxes closely fitting the edges of aircraft of various sizes and orientations, including densely packed or interlaced areas where overlapping regions are distinguished by the detection boxes. The three images on the right half also show similar results, with densely distributed and multi-directional cars being represented by compact detection boxes. This outcome validates the efficacy of our approach in tackling complex detection tasks that encompass multi-directional, densely packed, or small targets; shape-like distractors; and occlusions due to light or shadows.

The comparative visualization results on the UCAS-AOD dataset are depicted in Figure 5, illustrating the detection outcomes of RetinaNet, R2CNN, and O-RCNN in contrast with the method proposed in this study. These methods exhibit significant differences in terms of anchor box size convergence and angular accuracy. Specifically, RetinaNet shows misidentification when the target color is similar to the background, as observable in the first and third images of the first row. Additionally, RetinaNet still misses the detection of small targets, as shown in the second and fourth images of the first row. This indicates its limitations in detecting small-sized targets against complex backgrounds. The detection results of R2CNN reveal the network’s performance constraints under specific conditions. When the target color is similar to the background color, R2CNN’s detection effect is also suboptimal. As seen in the first and second images of the first row, the network fails to accurately distinguish targets that are close in color to the background, demonstrating that R2CNN’s recognition ability is compromised in low-color-contrast situations. Moreover, R2CNN also performs poorly in identifying small targets, as illustrated in the fourth image of the first row, aligning with the challenges mentioned in the literature regarding small-target detection—small targets occupy fewer pixels in the image and have fewer distinct features, increasing the difficulty of detection. Although O-RCNN’s detection results are similar to those of our method, the network still faces recognition challenges when the target color is similar to the background color, as indicated by the red dashed boxes in the first and third images of the first row. Specifically, the issues with O-RCNN mirror those of our method; when targets are close in color to the background, O-RCNN’s detection performance significantly declines. This challenge highlights that even advanced object detection algorithms struggle to avoid misjudgments and omissions in complex background conditions.

Furthermore, as shown in Table 5, our method also demonstrated excellent performance on the HRSC2016 dataset. Employing R-50-FPN as the backbone network, our method acquires an mAP of 97.81%, representing the highest value among all the methods being compared. Compared to methods such as Gliding Vertex, PIoU, and R3Det, our method outperformed them in mAP by 9.61%, 8.61%, and 8.55%, respectively. The specific experimental results are shown in Figure 6, where the detection boxes tightly mark the direction and edges of the ships and show strong discrimination ability for similar objects, such as elongated piers, warehouses, water waves, and containers. Similar to the situation on the UCAS-AOD dataset, the results on this dataset indicate that existing methods with rotating convolutional kernels, such as Rotated RPN, use multiple predefined and fixed rotation angles for feature extraction. In port scenarios, the orientation of ships is arbitrarily distributed at 0° to 360°, and these predefined rotation angles are difficult to effectively align with the actual ships, leading to inaccurate directional feature information during feature extraction. Moreover, our method refines the rotation by incorporating the extracted directional features on the target’s bounding rectangle, resulting in more compact rotation boxes compared to direct angle-based rotation methods like PIoU, and thus enabling the extraction of more accurate features.

Further comparative visualization results on the HRSC2016 dataset are shown in Figure 7, where the detection outcomes of RetinaNet, R2CNN, and O-RCNN are contrasted with the method proposed in this study. Significant differences are observed in the aspects of small-target recognition, anchor box convergence, and the accuracy of rotation direction. Specifically, the method proposed in this study excels in the recognition of small targets. For instance, as demonstrated by the red dashed box in the first image of the first row, RetinaNet failed to detect a small target that was successfully identified by our method. RetinaNet also shows insufficient performance in identifying targets that are similar in color to the background, as revealed by the red dashed box in the third image of the first row. Moreover, the misjudgment phenomenon of RetinaNet in the recognition of similar target objects, as shown by the red dashed box in the last image of the first row, is avoided by our method. Compared with R2CNN, although the overall detection performance is similar, specific cases reveal its limitations, such as the misidentification of a pier as a ship in the second image of the first row, and the failure to accurately identify ships, as shown by the red dashed boxes in the third and fourth images of the first row, indicating R2CNN’s deficiency in anchor box convergence. Compared with O-RCNN, the advantage of our method in small-target recognition is evident, as shown in the first image of the first row, where O-RCNN failed to effectively recognize small-sized targets and also exhibited misjudgment in similar-target recognition, as demonstrated in the third image of the first row. Overall, our method outperforms these three networks in terms of overall anchor box convergence effects and the accuracy of rotation direction, demonstrating the superiority of our method in object detection tasks.

Table 6 provides a complexity analysis of existing methods trained on the UCAS-AOD dataset, including FLOPs, Params, and mAP. Compared to two-stage algorithms such as RoI Transformer, our algorithm achieves a precision increase of 1.87%. Moreover, compared to advanced single-stage algorithms like R3Det-KLD, our algorithm reduces FLOPs by 124.03G and enhances mAP by 6.9%. Compared to single-stage detection methods, DDL R-CNN demonstrates significant advantages in accuracy and feature fusion, effectively enhancing the recognition and localization of targets in complex scenes through two-stage refined processing. At the same time, when compared with dual-stage detection methods, DDL R-CNN maintains high accuracy while optimizing parameters and computational complexity, achieving faster inference speeds and making it more advantageous in applications requiring real-time performance. Furthermore, DDL R-CNN effectively mitigates overfitting through data augmentation and regularization techniques, enhancing the model’s generalization and robustness. Finally, these results further validate the accuracy of our method in extracting multi-directional features in high-resolution RS image detection tasks with multiple interferences.

5. Conclusions

This paper introduces a novel two-stage rotating detection network called DDL R-CNN for detecting multi-directional image targets. The method consists of a dynamic direction learning module and a boundary center region offset generation network, which addresses the angle matching issues at the feature level and the training instability issues at the sample level present in existing methods, and has undergone extensive experimentation. The experimental results showed that the mAP was 90.82% in the UCAS-AOD dataset and 97.81% in the HRSC2016 dataset. Compared with the existing advanced two-stage rotational detection network, the proposed method significantly improves the detection accuracy of multi-directional targets and maintains a certain level of competitiveness in terms of efficiency compared with the single-stage rotational detection network.

Future research is expected to be carried out in the following directions. Cross-domain applications could see DDL R-CNN applied to the medical image analysis field, such as in tumor detection or organ segmentation, where the adaptive nature of the network could significantly enhance detection accuracy due to the varied shapes and orientations of targets. Video object detection is another area where DDL R-CNN could be explored, particularly in dynamically changing scenes for detecting and tracking targets, with the integration of temporal information allowing the model to better handle target motion and changes. Additionally, small-object detection in specific scenarios like surveillance videos or drone-captured images remains a challenge, and research could focus on optimizing DDL R-CNN to improve recognition capabilities for these small targets. Lastly, multimodal data fusion, combining DDL R-CNN with other data types such as LiDAR and infrared images, could be investigated to enhance the model’s understanding of complex scenes. These directions not only pave the way for more accurate and robust detection systems but also expand the applicability of DDL R-CNN across various domains.

Author Contributions

Conceptualization, D.J.; software, W.S.; validation, W.S.; investigation, W.S.; methodology, W.S.; formal analysis, W.S. and D.J.; resources, D.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

UCAS-AOD: https://github.com/ming71/UCAS-AOD-benchmark (accessed on 14 August 2024); HRSC2016: https://www.kaggle.com/datasets/guofeng/hrsc2016 (accessed on 14 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-scale Dataset for Object Detection in Aerial Images. arXiv 2019, arXiv:1711.10398. [Google Scholar]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Xian, S.; Fu, K. SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects. arXiv 2019, arXiv:1811.07126. [Google Scholar]
Qian, W.; Yang, X.; Peng, S.; Guo, Y.; Yan, J. Learning Modulated Loss for Rotated Object Detection. arXiv 2019, arXiv:1911.08299. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision–ECCV 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding Vertex on the Horizontal Bounding Box for Multi-Oriented Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [Google Scholar] [CrossRef]
Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection. arXiv 2017, arXiv:1706.09579. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep Learning for Generic Object Detection: A Survey. arXiv 2019, arXiv:1809.02165. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Detecting Oriented Objects in Aerial Images. arXiv 2018, arXiv:1812.00155. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. ReDet: A Rotation-equivariant Detector for Aerial Object Detection. arXiv 2021, arXiv:2103.07733. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic Refinement Network for Oriented and Densely Packed Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11204–11213. [Google Scholar] [CrossRef]
Yang, X.; Liu, Q.; Yan, J.; Li, A. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. arXiv 2019, arXiv:1908.05612. [Google Scholar] [CrossRef]
Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning Modulated Loss for Rotated Object Detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 2458–2466. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking Rotated Object Detection with Gaussian Wasserstein Distance Loss. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; Volume 139, pp. 11830–11841. [Google Scholar]
Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 3735–3739. [Google Scholar] [CrossRef]
Liu, Z.; Wang, H.; Weng, L.; Yang, Y. Ship Rotated Bounding Box Space for Ship Extraction From High-Resolution Optical Satellite Images with Complex Backgrounds. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1074–1078. [Google Scholar] [CrossRef]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Verelst, T.; Tuytelaars, T. Dynamic convolutions: Exploiting spatial sparsity for faster inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2320–2329. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Yang, X.; Dong, Y. Optimization for Arbitrary-Oriented Object Detection via Representation Invariance Loss. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8021505. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Song, J.; Yang, X. Sparse label assignment for oriented object detection in aerial images. Remote Sens. 2021, 13, 2664. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Song, J.; Dong, Y.; Yang, X. Task interleaving and orientation estimation for high-precision oriented object detection in aerial images. ISPRS J. Photogramm. Remote Sens. 2023, 196, 241–255. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A critical feature capturing network for arbitrary-oriented object detection in remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Ming, Q.; Zhou, Z.; Miao, L.; Zhang, H.; Li, L. Dynamic anchor learning for arbitrary-oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 2355–2363. [Google Scholar]
Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. Piou loss: Towards accurate oriented object detection in complex environments. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part V 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 195–211. [Google Scholar]
Yang, X.; Hou, L.; Zhou, Y.; Wang, W.; Yan, J. Dense label encoding for boundary discontinuity free rotation detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15819–15829. [Google Scholar]
Yang, X.; Yan, J. On the arbitrary-oriented object detection: Classification based approaches revisited. Int. J. Comput. Vis. 2022, 130, 1340–1365. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Wang, J.; Yang, W.; Li, H.C.; Zhang, H.; Xia, G.S. Learning center probability map for detecting objects in aerial images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4307–4323. [Google Scholar] [CrossRef]
Guo, Z.; Liu, C.; Zhang, X.; Jiao, J.; Ji, X.; Ye, Q. Beyond bounding-box: Convex-hull feature adaptation for oriented and densely packed object detection. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8792–8801. [Google Scholar]
Hou, L.; Lu, K.; Xue, J.; Li, Y. Shape-adaptive selection and measurement for oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 923–932. [Google Scholar]
Dai, L.; Liu, H.; Tang, H.; Wu, Z.; Song, P. Ao2-detr: Arbitrary-oriented object detection transformer. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 2342–2356. [Google Scholar] [CrossRef]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. Rtmdet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning high-precision bounding box for rotated object detection via kullback-leibler divergence. Adv. Neural Inf. Process. Syst. 2021, 34, 18381–18394. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3520–3529. [Google Scholar]
Yang, X.; Zhou, Y.; Zhang, G.; Yang, J.; Wang, W.; Yan, J.; Zhang, X.; Tian, Q. The KFIoU loss for rotated object detection. arXiv 2022, arXiv:2201.12558. [Google Scholar]
Wang, D.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing plain vision transformer toward remote sensing foundation model. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5607315. [Google Scholar] [CrossRef]

Figure 1. Our motivation for this work originates from two crucial aspects: (a) In the detection of rotating targets, instances of targets with similar visual expressions are placed in arbitrary directions (for example, ships). Therefore, it is reasonable to learn the direction of the target in a data-driven manner and rotate the convolutional kernels accordingly. (b) Existing methods are troubled by the inconsistencies in regression direction and significant changes in features extracted from targets when rotating high-aspect-ratio anchor boxes. Thus, approximating the rotation direction of the anchor box by regressing towards the boundary center region is a solution to the problem of large changes in target features when rotating anchor boxes with high aspect ratios.

Figure 2. The schematic diagram of the DDL module. From a broad perspective, the final feature map

y (Δ α, Δ β)

is jointly guided by the original convolutional parameters w and the input feature map x, and is generated through the processing of dynamic directional convolution kernels and other related convolutions. (a) The rotating convolutional kernel is generated by combining the convolutional kernel parameters extracted from the original convolution with the rotation angles predicted by the routing function. (b) The functional makeup of the routing function involves encoding image features, represented by x, through deep convolutional layers that follow an average pooling layer. Parameters

θ

and

λ

are predicted by two distinct branches, each with its own specific activation function.

Figure 2. The schematic diagram of the DDL module. From a broad perspective, the final feature map

y (Δ α, Δ β)

is jointly guided by the original convolutional parameters w and the input feature map x, and is generated through the processing of dynamic directional convolution kernels and other related convolutions. (a) The rotating convolutional kernel is generated by combining the convolutional kernel parameters extracted from the original convolution with the rotation angles predicted by the routing function. (b) The functional makeup of the routing function involves encoding image features, represented by x, through deep convolutional layers that follow an average pooling layer. Parameters

θ

and

λ

are predicted by two distinct branches, each with its own specific activation function.

Figure 3. DDL R-CNN’s overarching architecture functions as a two-stage detection system, predicated on the Feature Pyramid Network (FPN). Initially, it employs an oriented Region Proposal Network (RPN) to produce oriented proposal boxes. Subsequently, the directed R-CNN head combined with the DDL module and the Boundary Center Region Offset Proposal Network (BC-ROPN) are used in the subsequent stage to classify these proposal boxes and fine-tune their spatial coordinates. For clarity, we have not depicted the FPN and classification branches oriented towards the RPN.

Figure 4. Visualization of the rotating anchor box generated by DDL R-CNN on the UCAS-AOD dataset. The figure displays the top 200 anchor boxes for each image.

Figure 5. Comparison of detection results between DDL R-CNN and existing methods on UCAS-AOD dataset.

Figure 6. Visualization result of rotating anchor box generated by DDL R-CNN on HRSC2016 dataset. The figure displays the top 200 anchor boxes for each image.

Figure 7. Comparison of detection results between DDL R-CNN and existing methods on HRSC2016 dataset.

Table 1. Experimental study of replacement strategy ablation for UCAS-AOD dataset on the DDL module and BC-ROPN. The experiments were conducted on R-CNN of the ResNet-50 backbone network with the DDL module and BC-ROPN. Here, “✕” means that the method was not used and “✔” means that the method was used.

DDL	BC-ROPN	Params(M)	FLOPs(G)	mAP
✕	✕	41.14	211.43	89.62
✔	✕	65.87	211.85	89.96
✕	✔	40.74	211.41	90.14
✔	✔	74.38	211.97	90.82

Table 2. Ablation experimental study of the number of kernels n. The experiments were conducted using DDL R-CNN on the UCAS-AOD dataset.

Backbone	n	Params(M)	FLOPs(G)	mAP
R50 [25]	1	41.14	211.43	75.81
R101 [25]	1	60.13	289.33	76.11
DDL-R50	1	41.18	211.85	76.97
DDL-R50	2	52.25	211.89	77.17
DDL-R50	4	74.38	211.97	77.35
DDL-R50	6	96.52	212.06	77.38
DDL-R50	8	118.72	212.27	77.38
DDL-R50	10	144.53	212.68	77.38

Table 3. Recall results for the UCAS-AOD dataset.

Method	$R_{300}$	$R_{1000}$	$R_{2000}$
BC-ROPN (Recall)	82.80%	91.60%	92.20%

Table 4. Comparison with various methods on UCAS-AOD dataset.

Methods	Airplane	Car	mAP
RetinaNet [28]	87.62	85.37	86.50
RIDet-Q [29]	89.96	88.50	89.23
SLA [30]	90.30	88.57	89.44
RoI Trans [15]	89.90	87.99	88.95
TIOE-Det [31]	90.15	88.83	89.49
CFC-Net [32]	88.69	89.29	89.49
RIDet-O [29]	90.35	88.88	89.62
DAL [33]	90.49	89.25	89.87
DDL R-CNN	90.91	90.73	90.82

Table 5. Examples of results compared with various methods on the HRSC2016 dataset using directional R-CNN with R-50-FPN backbone.

Methods	Backbone	mAP
R2CNN [11]	R-101	73.07
RoI Transformer [15]	R-101-FPN	86.20
PIoU [34]	DLA-34	89.20
Rotated RPN [9]	R-101	79.08
DRN [18]	H-34	-
R3Det [19]	R-101-FPN	89.26
DCL [35]	R-101	89.46
CSL [36]	R-50	89.62
DAL [33]	R-101-FPN	89.77
Gliding Vertex [10]	R-101-FPN	88.20
S²ANet [37]	R-101-FPN	90.17
CenterMap-Net [38]	R-50-FPN	-
GWD [21]	R-101	89.85
ReDet [16]	R-101	90.46
DDL R-CNN	R-50-FPN	97.81

Table 6. Complexity analysis and comparison between DDL module and BC-ROPN for FLOPs and model parameters in UCAS-AOD dataset.

Methods	Params(M)	FLOPs(G)	mAP
One-stage
R3Det	41.9	336	77.89
CFA [39]	33.6	194	78.95
SASM [40]	36.6	194	78.78
AO2-DETR [41]	74.3	304	79.22
S²ANet	38.6	198	79.42
R3Det-GWD [21]	41.9	336	80.23
RTMDet-R [42]	52.3	205	81.88
R3Det-KLD [43]	41.9	336	83.92
Two-stage
G.V. [10]	41.1	198	75.77
CenterMap-Net	41.1	198	79.34
CSL	37.4	236	80.23
O-RCNN [44]	41.1	199	81.87
KFloU [45]	58.8	206	84.93
RVSA [46]	114.4	414	86.74
RoI Transformer	55.1	200	88.95
DDL R-CNN	74.38	211.97	90.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, W.; Jing, D. DDL R-CNN: Dynamic Direction Learning R-CNN for Rotated Object Detection. Algorithms 2025, 18, 21. https://doi.org/10.3390/a18010021

AMA Style

Su W, Jing D. DDL R-CNN: Dynamic Direction Learning R-CNN for Rotated Object Detection. Algorithms. 2025; 18(1):21. https://doi.org/10.3390/a18010021

Chicago/Turabian Style

Su, Weixian, and Donglin Jing. 2025. "DDL R-CNN: Dynamic Direction Learning R-CNN for Rotated Object Detection" Algorithms 18, no. 1: 21. https://doi.org/10.3390/a18010021

APA Style

Su, W., & Jing, D. (2025). DDL R-CNN: Dynamic Direction Learning R-CNN for Rotated Object Detection. Algorithms, 18(1), 21. https://doi.org/10.3390/a18010021

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DDL R-CNN: Dynamic Direction Learning R-CNN for Rotated Object Detection

Abstract

1. Introduction

2. Related Works

2.1. Research on Rotation Feature Extraction

2.2. Research on the Representation of Rotating Boxes

2.3. Research on Loss Functions

3. Method

3.1. Routing Function

3.2. Dynamic Directional Learning (DDL) Module

3.3. Dynamic Direction Learning R-CNN (DDL R-CNN)

4. Experiment

4.1. Dataset

4.2. Parameter Setting and Tuning Methods

4.3. Experimental Evaluation Index

4.4. Ablation Experiment

4.4.1. Evaluation of Different Approaches

4.4.2. Ablation Experiments on the Number of Cores Inside the Module

4.4.3. Experimental Evaluation of BC-ROPN

4.5. Comparison with Other New Technologies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI