Object Detection Based on Global-Local Saliency Constraint in Aerial Images

Li, Chengyuan; Luo, Bin; Hong, Hailong; Su, Xin; Wang, Yajun; Liu, Jun; Wang, Chenjie; Zhang, Jing; Wei, Linhai

doi:10.3390/rs12091435

Open AccessArticle

Object Detection Based on Global-Local Saliency Constraint in Aerial Images

by

Chengyuan Li

¹

,

Bin Luo

^1,*,

Hailong Hong

¹,

Xin Su

²

,

Yajun Wang

¹

,

Jun Liu

¹,

Chenjie Wang

¹

,

Jing Zhang

¹

and

Linhai Wei

³

¹

State Key Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, 129 Luoyu Road, Wuhan 430079, China

²

School of Remote Sensing and Information Engineering, Wuhan University, 129 Luoyu Road, Wuhan 430079, China

³

School of Mathematics and Statistics, Wuhan University, 129 Luoyu Road, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2020, 12(9), 1435; https://doi.org/10.3390/rs12091435

Submission received: 9 April 2020 / Revised: 28 April 2020 / Accepted: 29 April 2020 / Published: 1 May 2020

Download

Browse Figures

Versions Notes

Abstract

:

Different from object detection in natural image, optical remote sensing object detection is a challenging task, due to the diverse meteorological conditions, complex background, varied orientations, scale variations, etc. In this paper, to address this issue, we propose a novel object detection network (the global-local saliency constraint network, GLS-Net) that can make full use of the global semantic information and achieve more accurate oriented bounding boxes. More precisely, to improve the quality of the region proposals and bounding boxes, we first propose a saliency pyramid which combines a saliency algorithm with a feature pyramid network, to reduce the impact of complex background. Based on the saliency pyramid, we then propose a global attention module branch to enhance the semantic connection between the target and the global scenario. A fast feature fusion strategy is also used to combine the local object information based on the saliency pyramid with the global semantic information optimized by the attention mechanism. Finally, we use an angle-sensitive intersection over union (IoU) method to obtain a more accurate five-parameter representation of the oriented bounding boxes. Experiments with a publicly available object detection dataset for aerial images demonstrate that the proposed GLS-Net achieves a state-of-the-art detection performance.

Keywords:

object detection; aerial imagery; saliency constraint; R-CNN; attention mechanism

Graphical Abstract

1. Introduction

Simultaneous localization and category recognition are the fundamental but challenging tasks of aerial image object detection. With the increasing number of aircraft and satellites, more and more aerial images are now becoming available. Object detection in aerial images has become one of the hot topics in the computer vision field, and is used in a wide range of applications, such as traffic control, airport surveillance, monitoring of oil storage facilities, inshore ship detection, and military target discovery [1,2,3,4]. The difficulties of aerial image object detection are mainly due to the varied weather conditions and the variation of the orientation and scale of the objects. In recent years, deep learning has shown its great potential in computer vision tasks [5,6,7,8,9,10], and significant progress has been made in the field of object detection. To avoid ambiguity, here we define the objects in this article as the categories that are selected by experts in aerial image interpretation, according to whether a kind of objects is common and its value for real-world applications [11].

Generally speaking, the object detection task in aerial images usually consists of two main steps: (1) extraction of the features from the images and (2) classification and localization of the objects, based on the features [12,13,14]. In order to extract the features, different kinds of manually designed feature extractors, such as the histogram of oriented gradient (HOG) [15,16] and the scale-invariant feature transform (SIFT) [17,18], have been used to describe the different types of targets. In fact, it can be difficult to efficiently process remote sensing images using hand-crafted features in the context of big data, because the features are not robust to different lighting, scales, shooting angles, scenes, and other conditions. In addition, how to balance the computation speed and feature description accuracy is still a difficult problem.

Since LeNet [19] and AlexNet [20] have emerged, deep learning based methods have become widely used as they can effectively extract features with a certain degree of translation invariance and rotation invariance from massive data, and they can also be used for different kinds of image processing tasks [1,21,22,23,24,25,26], such as efficient target detection, image classification, image generation, scene understanding, target tracking, target re-identification, etc. According to the bounding box style, the deep learning based object detection algorithms can be divided into two main categories: point-based algorithms and quadrilateral-based algorithms. The point-based algorithms are relatively new, and the key point detection technique is usually used to avoid the problem of predicting bounding boxes [27,28,29,30]. However, these methods are difficult to detect small targets in a dense array with rotation. In light of the quadrilateral-based deep learning paradigm, recent research has been focused on two mainstream branches aimed at boosting the performance of the object detection networks. The first branch relies on a region-based convolutional neural network (R-CNN) architecture [31], which is also known as a two-stage network, and uses a weak classifier to extract a suspected target region from the image based on a selective search algorithm [32,33]. A well-trained network is then used to ultimately determine the class and position. Due to the presence of the fully connected layer, the R-CNN [31] requires that the input image data be of a fixed size (e.g.,

227 \times 227

). To address this issue, the spatial pyramid pooling network (SPP-Net) [34] adds a spatial pyramid layer, i.e., a region of interest (RoI) layer, at the top of the last convolution layer. In addition, SPP-Net [34] extracts features from the whole image once, and then shares the features during the detection, avoiding the inefficiency of repeated feature extraction, as in R-CNN [31]. Fast R-CNN [35] uses a small network based on a fully connected layer, instead of support vector machine (SVM), for classification and localization. Faster R-CNN [36] uses a region proposal network (RPN), instead of a selective search algorithm, on the basis of Fast R-CNN [35], and realizes a neural network for the whole process of object detection. The method achieves end-to-end training and detection, but it does not make full use of the feature maps from different depths, which leads to the lack of robustness in the face of drastic scale changes. Therefore the feature pyramid network (FPN) [37] has also been proposed, which makes full use of both the low-resolution and semantically strong features and the high-resolution and semantically weak features. Cascade R-CNN [38] uses the cascade concept commonly used in machine learning to cascade multiple neural network classifiers to improve the detection performance. Although both the FPN and the Cascade R-CNN have good performance, they do not solve a defect, that is, they only use local features to predict the object, and ignore the relationship between the scene and the object. In addition to the R-CNN-based object detection frameworks, there are many fast object detection networks, including Overfeat [39], you only look once (YOLO) [40], YOLOv2 [41], YOLOv3 [42], and single shot detector (SSD) [43]. These algorithms, which are also known as one-stage algorithms, achieve end-to-end object detection, and can achieve a state-of-the-art detection accuracy in real time due to their simple network structures and picture gridding. As a result, they are widely used in scenarios requiring real-time detection, such as monitoring video analysis, target tracking initialization, visual odometry, and so on. Compared with the R-CNN-based object detection frameworks, these methods are more outstanding in speed, but due to the meshing mechanism, the accuracy can be significantly reduced when it encounters densely arranged objects. In addition, these methods usually use rectangles as the bounding boxes, which are difficult to detect objects with rotation.

Although natural image object detection based on deep learning has made progress, some difficulties are still encountered when these methods are directly applied to remote sensing image object detection tasks, due to the special characteristics of aerial images (as shown in Figure 1):

Intricate background noise. The complex background information reduces the accuracy of the region proposal, meaning that the extracted features contain more noise.
Drastic scale change. Since the flight trajectories and sensors of the different aircraft are not completely consistent, almost every aerial image product has unique resolution and imaging characteristics. In addition, for example, although warships and fishing vessels are both classified as ships, the number of pixels occupied by the different ships varies greatly.
Lack of consideration of the semantic information between the scene and objects. In remote sensing images, there are scene-object semantic relationships between, for example, aircraft and airports, cars and roads or parking lots, ships and water, and bridges and water. The R-CNN-based algorithms divide the image into different regions and then extract the features, discarding the scene information, which could be used as a constraint.
Arbitrary target orientation and dense arrangement. Because the image acquisition platform may fly over the target from any angle, the target may have an arbitrary orientation under the overhead view. In addition, scenarios such as ships berthing in sequence at a wharf and densely arranged vehicles driving on crowded roads in a city greatly increase the difficulty of the target detection, and miss-detection can easily occur.
Lack of aerial image datasets available for neural networks. Compared with the usual natural image datasets, there are relatively limited training datasets available for neural networks.

In this paper, in order to obtain improved object detection results with optical remote sensing images, we put forward an effective network structure, namely, the global-local saliency constraint network (GLS-Net). In GLS-Net we first build a saliency pyramid, which combines a non-deep-learning saliency algorithm with the FPN to reduce the impact of the complex background, without requiring more saliency labels. We then propose a branch to enhance the semantic connection between the target and the global scenario, based on a global attention modulated mechanism called GA-Net. In addition, a fast feature fusion strategy is used to combine the local object information, based on a saliency pyramid, with the global semantic information optimized by an attention mechanism. To obtain more accurate bounding boxes, we use an angle-sensitive intersection over union (IoU) method to calculate the matching degree of the prediction and ground truth, to obtain the five parameters of the final oriented bounding boxes. The main contributions of this paper can be summarized as follows:

We propose the use of a saliency pyramid, which makes the target pixels more distinct from the background.
We propose a global attention module mechanism, called GA-Net, to constrain the semantic information of the target in the global context, and a fast fusion strategy is used to combine the global information with the objects.
During the inference stage, we propose the use of an angle-sensitive IoU algorithm to obtain oriented bounding boxes that are as accurate as possible.

The rest of this paper is organized as follows. Section 2 gives a brief review of the related work on aerial image object detection based on deep learning, object detection based on saliency, and the attention mechanism. In Section 3, we introduce the proposed method in detail. The details of the dataset, the experiments conducted in this study, and the results of the experiments are presented in Section 4. Section 5 concludes this paper with a discussion of the results.

2. Related Work

2.1. Object Detection of Aerial Images

Over the past decades, research in the field of remote sensing image object detection has made great progress, and many imaginative and effective algorithms have now been proposed to solve various problems in optical remote sensing image detection. According to the feature description method, the object detection algorithms can be mainly divided into two branches: the hand-crafted feature based algorithm and the deep learning based algorithm.

On the one hand, hand-crafted features are widely used in the detection of various objects. For example, Liu et al. [12] proposed a method using a two-cascaded linear model followed by binary linear programming to detect ships, which can reduce the search space by constructing a nearly closed-form rotated bounding box space for the ships; Yang et al. [13] proposed a detection algorithm based on saliency segmentation and the local binary pattern (LBP) descriptor combined with the ship structure; Xu et al. [44] improved the adaptability of the invariant Hough transform with an iterative training method to solve the problem of detecting inshore ships in high-resolution remote sensing imagery; and Cheng et al. [45] proposed a framework composed of linear SVM, called the collection of part detectors (COPD), to complete the task of multiple category object detection in remote sensing images.

On the other hand, deep learning based methods have shown their potential in feature extraction and description for aerial images. For example, Tayara et al. [46] proposed a one-stage framework for object detection in aerial images based on a densely connected convolutional network [10]. Based on YOLOv2 [41], the you only look twice (YOLT) [47] is a two-branch network that can be used to predict the scene and target at the same time, realizing rapid detection of aerial images based on a one-stage framework. However, although the one-stage approach has advantages in the detection speed, the detection accuracy is still a problem. Therefore, most of the object detection methods for aerial images focus on a region-based object detection pipeline, i.e., they are two-stage detectors. For example, Xu et al. [48] and Ren et al. [49] adopted the idea of a deformable convolutional network [50] and proposed two-stage networks that can predict categories and rectangular bounding boxes in aerial images. Instead of using an end-to-end deep learning detector, the method developed by Xiao et al. [51] can be used to detect an airport scene in remote sensing images. This method uses three-scale sliding windows to generate scene candidates, target candidates, and local target candidates. It then extracts the features with a neural network, and finally uses SVM to generate rectangular bounding boxes. Deng et al. [52] also constructed a two-stage detector, which performs data enhancement by dividing the large-scale remote sensing data grid and rotating it, and then fuses the same-scale features from the three levels to generate a new feature map.

The objects in aerial images are mostly small targets. To address this problem, Ren et al. [53] used a densely connected structure to replace the commonly used FPN [37], and modified the default size of the anchors, making the network pay more attention to the many small areas in the feature map. All of the above methods use rectangular bounding boxes as the coordinate representation of the target. In fact, due to the uncertainty of the course of aircraft and space platforms, the target can be oriented in any direction when looking down from an overhead perspective. Many methods have been explored to deal with this issue, with the rotated rectangle bounding box methods being found to show a superior performance. Based on the idea of rotational region CNN (R2CNN) [54], many effective networks have now been proposed. For example, Yang et al. [55] constructed a remote sensing image object detector called SCRDet for predicting rotate rectangular coordinates using an attention mechanism. SCRDet first uses an inception module [56] to fuse the two layers of features from the feature pyramid, and then a spatial attention mechanism and channel attention mechanism [57] are added to this network. Finally, the algorithm combines RoIAlign [58] with the fully connected network to achieve class determination and coordinate prediction. The experiments undertaken by Yang et al. [55] proved that the attention mechanism can reduce the influence of a complex background, to some extent, but SCRDet requires the additional mask labels. Other researches [59,60,61] have focused on using the multi-scale feature map fusion method to solve the problem of the sharp change of the target scale in remote sensing images, and have predicted the rotated rectangle bounding boxes based on two-stage pipelines. All of the above works have attempted to circumvent the problem of the rectangular region proposals generated based on the anchor mechanism in a two-stage network containing both the object and the background, which causes the features extracted based on the R-CNN mechanism to also contain background noise. Using these features increases the difficulty of predicting the rotated bounding box that just surrounds the target, as shown in Figure 2. To solve this problem, Lei et al. [62] and Zhang et al. [63] used an anchor with angles to extract the region proposals. In addition, Liu et al. [64] used affine transformation to transform the target’s minimum rotated rectangle area in a region proposal predicted by the normal anchor mechanism (computer screen coordinate). Even more subtly, Ding et al. [65] proposed a framework with a rotated RoI (RRoI) learner and a rotated-position sensitive RoIAlign (RPS-RoI-Align) module to transform a horizontal region of interest (HRoI) into a rotated region of interest (RRoI) and extract rotation-invariant features.

2.2. Saliency Detection

The purpose of saliency detection, which is based on the cognitive studies of visual attention, is to obtain the typical objects or regions in the image which are different from most of the background. The early saliency detection methods were mainly based on hand-crafted features. These methods only require prior knowledge in the form of the existing datasets, and they do not require extensive manual labeling of samples. For example, Xie et al. [66] and Qi et al. [67] expressed the saliency based on local features. An alternative approach is to establish the scarcity of image areas relative to local areas by the use of a local contrast method. Correspondingly, some methods emphasize the relationship between the global pixels to predict the saliency map [68,69,70].

Among the different methods, the histogram-based contrast (HC) method developed by Cheng et al. [70] focuses on bottom-up data-driven saliency detection using the histogram contrast of the imagery. Inspired by biological vision, the HC method defines saliency values for the image pixels using the color statistics of the whole input image.

Beyond the hand-crafted feature based methods, deep learning has also shown its great potential in saliency representation. For example, Hou et al. [71] optimized a skip-layer structure by introducing short connections to it within the holistically nested edge detector architecture based on a fully convolutional neural networks (FCNs) [72], and Sun et al. [73] proposed an object detector for aerial images based on saliency detection, namely SBL-RetinaNet, which consists of a codec branch and a saliency prediction branch, and is optimized by focal loss [9]. Although the saliency detection methods based on deep learning have shown great promise, this kind of method needs to introduce new labels (not just rectangular bounding boxes) and cannot be quickly applied to fields without large-scale open-access datasets, such as Gaofen-2 imagery, Gaofen-3 imagery, and unmanned aerial vehicle (UAV)-borne data.

2.3. Attention Mechanism

Similar to saliency detection, the attention mechanism originates from the human visual system [57,74,75]. One typical characteristic of the human visual system is that humans tend to pay more attention to some prominent local areas, rather than the whole scene. This attention mechanism improves the efficiency of data processing. In fact, there are still some differences between the attention mechanism and saliency detection. Saliency detection involves obtaining certain regions or features in the image from the characteristics of the image itself, so lower-level feature descriptors such as the HOG are often used. However, the attention mechanism is more flexible, and can consider the spectral, structural, and channel information, plus other aspects [76].

Recently, there have been several attempts to incorporate attention processing to improve the performance of neural networks in large-scale classification and detection tasks. For example, Wang et al. [77] proposed an attention module, namely the residual attention network, which can be incorporated in the state-of-the-art feed-forward network architectures in an end-to-end training fashion. By combining a spatial attention module and a channel attention module through the idea of an encoder and decoder, the mixed attention mechanism not only performs well, but is also robust to noisy inputs. Each attention module is made up of a mask branch and a trunk branch. The mask branch contains fast feed-forward sweep and top-down feedback steps, while the trunk branch can be any one of the state-of-the-art pipelines. Inspired by the classical non-local means method in computer vision, Wang et al. [78] proposed a method called the non-local neural network, which can be plugged into many computer vision architectures. As for the object detection task, the use of a CNN can result in an improvement in accuracy. However, all of the architectures based on CNNs have a crucial but not easily solved problem, in that one of the CNN units only processes the value of the kernel size on the feature map. This mechanism causes each operation to be carried out in a small neighborhood, without considering the influence of pixels in other regions. The idea of the non-local neural network is depicted in Figure 3. In addition, Du et al. [79] designed a new loss mechanism based on principal component analysis (PCA), on the basis of a non-local block. Furthermore, Hu et al. [80] obtained excellent results on ImageNet [81] with a kind of channel attention mechanism, called the squeeze-and-excitation module, and other methods have been improved on this basis [82,83].

3. Proposed Framework

The flowchart of the proposed object detection method is shown in Figure 4. The framework is based on the popular Faster R-CNN framework [36] with FPN structure [37]. In order to better adapt to multi-scale object detection tasks, the RPN and Fast R-CNN network obtain the features of

{S_{2}, S_{3}, S_{4}, S_{5}}

at four scales from the saliency pyramid, instead of the one scale in Faster R-CNN [36]. The features

{S_{2}, S_{3}, S_{4}, S_{5}}

from saliency pyramid are fused with the classical FPN layers

{P_{2}, P_{3}, P_{4}, P_{5}}

and saliency map, which is explained in Section 3.1. This saliency pyramid mechanism can make the network concentrate more on the target area, while maintaining the advantage of deep features at a lower computational cost, and it has robustness to diverse scenes, while reducing the influence of complex backgrounds. We adopt ResNet101 [8] as backbone of our framework. Then, to adapt to remote sensing images which contain relationships between objects and scenes, we extract both the object-scene contextual information and the object-object contextual information with GA-Net, and we fuse them with features from RoIAlign by a lightweight structure (Section 3.2). As shown in Figure 4, GA-Net first obtains features of

C 6

from the last layer of the feature pyramid. Features from the deeper layers contain the relationship between targets, and represent the unique scene distribution information. In order to make full use of this, we only use the channel attention mechanism for the data dimensionality reduction, instead of the spectral attention mechanism, which can reduce the computational difficulty while not losing distribution information.

In addition, we use five parameters (Figure 5) to represent the predicted coordinates of the rotated rectangular bounding box, which is different from the standard rectangular bounding box. As Figure 1 shows, objects in remote sensing imagery have the characteristics of a small size and a dense distribution. Due to these traits, the commonly used horizontal rectangle coordinates can lead to the omission of objects (Figure 6). This is because the IoU of two tightly arranged oriented bounding boxes belonging to different targets is large, so we adopt an angle-sensitive IoU calculation method, which is introduced in detail in Section 3.3.

3.1. Saliency Pyramid

A typical feature of remote sensing imagery is the complex background. For example, waves on the water can affect the accuracy of the ship detection task, and complex urban scenes can reduce the detection efficiency for vehicles, etc. In order to reduce the influence of the background, we use a kind of saliency map to construct a saliency pyramid, to reduce the influence of noise. To build a saliency pyramid, we first use a method called the region contrast (RC) method [70] to process the remote sensing images, based on HC algorithm. The RC algorithm first uses an effective segmentation algorithm [84] to initialize the image, and considers the influence of other regions when calculating the saliency value. By using the RC method, the influence of the background noise can be effectively reduced.

Figure 7 shows the construction of the saliency pyramid based on a feature pyramid. While obtaining the saliency map, the multi-scale feature maps of the input image, marked as

{C_{2}, C_{3}, C_{4}, C_{5}}

, are extracted from the forward propagation of the neural network. With the ResNet-101 backbone, the shapes of the multi-scale feature maps can be

{[n, n, 256], [n / 2, n / 2, 512], [n / 4, n / 4, 1024], [n / 8, n / 8, 2048]}

. As shown in Figure 8, and as a general rule, the feature pyramid can be built by Equation (1):

P_{i} = \{\begin{matrix} C o n v (P o o l (C_{i - 1})), i = 6 \\ C o n v (C_{i}), i = 5 \\ C o n v (C_{i}) \oplus g (C_{i + 1}), i \in [2, 5) \end{matrix}

(1)

where

C o n v (\cdot)

denotes the convolution operation with kernel size [1, 1],

P o o l (\cdot)

denotes the pooling operation with stride 2,

g (\cdot)

denotes the upsampling operation with a factor of 2, and ⊕ is the matrix addition operation. The shapes of

{P_{2}, P_{3}, P_{4}, P_{5}, P_{6}}

from the feature pyramid are

{[n, n, 256], [n / 2, n / 2, 256], [n / 4, n / 4, 256], [n / 8, n / 8, 256], [n / 16, n / 16, 256]}

. In fact, in order to fuse the saliency map with the feature pyramid, we use Equation (2) to implement this process:

S_{i} = \{\begin{matrix} C o n v (P o o l (C_{i - 1})); h (s), i = 6 \\ C o n v (C_{i}); h (s), i = 5 \\ C o n v (C_{i}) \oplus g (C_{i + 1}); h (s), i \in [2, 5) \end{matrix}

(2)

where

C o n v (\cdot)

denotes the convolution operation with kernel size [1, 1],

P o o l (\cdot)

denotes the pooling operation with stride 2,

g (\cdot)

denotes the upsampling operation with a factor of 2, ⊕ is the matrix addition operation, s is the saliency map generated using the RC algorithm,

h (\cdot)

denotes the sampling operation, and

(\cdot; \cdot)

denotes the concatenation. After these operations, we can obtain a saliency pyramid with five feature maps named

{S_{2}, S_{3}, S_{4}, S_{5}, S_{6}}

, and the shapes of the multi-scale saliency maps can be denoted as

{[n, n, 257], [n / 2, n / 2, 257], [n / 4, n / 4, 257], [n / 8, n / 8, 257], [n / 16, n / 16, 257]}

. The following RPN network and R-CNN network use the saliency pyramid features for the category determination and bounding box calculation.

3.2. Global Attention Network

Region-based CNNs usually adopt a two-stage strategy to achieve the purpose of object detection. The first step is to extract region proposals with a high recall rate, and the second step is to use a high-accuracy algorithm to classify and predict the bounding box. As a result, each target is predicted using only its own pixel information and a very limited area of the pixel information nearby. Actually, remote sensing images usually capture a large area that carries strong semantic information that characterizes the captured scene. In addition, objects and scenarios are often closely correlated, such as planes being found in airports, ships being associated with water, and vehicles being closely related to roads or parking areas, etc. However, region-based CNNs do not fully exploit this potential relationship. Based on these observations, we propose a global attention network (GA-Net) that learns the global scene semantics with less computation (see Figure 9).

More specifically, inspired by the convolutional block attention module (CBAM) [82], GA-Net first learns the correlation between scenes and the objects in the scene, and compensates the learned features as a specific global context to compensate for the loss of object features. Instead of using the feature map from the top of the feature pyramid directly, we optimize the feature map of

C 5

in the channel dimension, which is similar to the band optimization in hyperspectral image processing. Equation (3) shows the mathematical process:

M_{c} (F) = C o n v 2 (C o n v 1 (r e s i z e (S (M L P (A v g P o o l (F)) \oplus M L P (M a x P o o l (F))) \otimes F))),

(3)

M L P (x) = f_{d e c o d e r} (R e L U (f_{c o d e r} (x))),

(4)

where F is an input tensor with size

[n, n, 2048]

, where the

w i d t h = h e i g h t = n

and the

c h a n n e l = 2048

. After an average-pooling layer and a max-pooling layer, we can obtain two tensors with size

[1, 1, 2048]

, where ⊕ is the matrix addition operation and ⊗ is the matrix multiplication operation. In addition, Equation (4) shows the details of

M L P

, where

f_{c o d e r} (\cdot)

denotes a convolutional layer with 2048 input channels and 16 output channels, and

f_{d e c o d e r} (\cdot)

denotes a convolutional layer with 16 input channels and 2048 output channels. To reduce the number of parameters, the two convolution layers are used to eliminate the insignificant channels and make the feature map more inclined to describe the distribution of the scene in the spatial dimension. With GA-Net, we can make the useful information more significant, while maintaining the spatial distribution of the scene as much as possible. We do not use the spatial attention module here because this operation would lose the attention for spatial distribution information of the background.

To improve the network efficiency, we use convolutional layers to build the RoIHead, instead of the usual two full connected layers, where

M_{c} (F)

is a feature map from GA-Net (Equation (3)) and

S (F^{^{'}})

is a feature map of the region proposals processed by the RPN and RoIAlign. We obtain a tensor with

c h a n n e l = 272

by concatenating

M_{c} (F)

with

S (F^{^{'}})

in the channel dimension. Finally, the tensor with 272 channels can be fed into the two convolutional layers to generate a tensor with shape

[1024, 1]

, which is used to predict the categories and bounding boxes (see Figure 10).

3.3. Angle-Sensitive Non-Maximum Suppression (NMS)

The non-maximum suppression (NMS) is widely used to reduce redundant bounding boxes in object detection. The usual calculation process is:

Step 1: All the bounding boxes are arranged in descending score order, and are referred to as set $A$ .
Step 2: Put the first bounding box $a_{1} \in A$ into set $B$ , and calculate the IoU of $a_{1}$ and the other bounding boxes $a_{i} \in A, i \neq 1$ in order. If the IoU is greater than the threshold (usually 0.5), $a_{i}$ is excluded from set $A$ , otherwise, it is skipped.
Step 3: Select the next bounding box in order to put it into set $B$ and continue from step 2 until the set $A$ is empty.

Generally speaking, the IoU can be described as the ratio of the intersection and union of two rectangular areas. For oriented bounding boxes, its mathematical expression is as shown in Equation (5), where

a r e a_{i n t e r}

denotes the intersection of

a r e a 1

and

a r e a 2

.

r o t a t e d_I o U = \frac{a r e a_{i n t e r}}{a r e a 1 + a r e a 2 - a r e a_{i n t e r}},

(5)

As Figure 11 shows, the usual rotated IoU may get a misleading value. In order to reduce the impact of this situation, we add a parameter

e^{\frac{| θ_{i} - θ_{j} |}{90}} = e^{\frac{| Δ θ |}{90}}, θ \in [- 90, 0)

to punish the angle differential, which can be seen in Equation (6), where

λ

is a parameter used to limit the value of IoU, to prevent over-checking of the bounding boxes when the objects are densely arranged.

r o t a t e d_I o U = λ e^{\frac{| Δ θ |}{90}} \frac{a r e a_{i n t e r}}{(a r e a 1 + a r e a 2 - a r e a_{i n t e r})},

(6)

4. Experiments and Results

The widely used “A Large-Scale Dataset for Object Detection in Aerial images” (DOTA) [11] dataset was used in the experiments in oriented object detection in aerial images. In the following, we describe the DOTA dataset, the implementation details, and the ablation studies conducted with the proposed method.

4.1. Dataset

DOTA. The DOTA dataset is a published large-scale open-access dataset for object detection in aerial images, with oriented and normal bounding box annotations. The dataset is made up of 2806 large-size aerial images collected from Google Earth and satellites including Julang-1 (JL-1) and GF-2. The dataset contains a total of 15 categories, including plane, baseball diamond (BD), bridge, ground track field (GTF), small vehicle (SV), large vehicle (LV), ship, tennis court (TC), basketball court (BC), storage tank (ST), soccer-ball field (SBF), roundabout (RA), harbor, swimming pool (SP), and helicopter (HC). The DOTA1.0 release has the characteristics of diverse categories, scales, and sensor sources, so it a very challenging task to detect objects in this dataset. Figure 12 shows the spatial resolution information of the DOTA dataset.

In the experiments, training and validation sets were both used for the training, and the test set was used for the testing. In order to reduce the memory requirement, we cropped a series of

1024 \times 1024

patches with a stride of 500 from the original images. During the training, we eliminated patches with no annotation. With these processes, we obtain 38,504 image patches. In the inference stage, the image patches are cropped from the test images with an overlap of 500 pixels between the neighboring patches. Zero padding was applied if an image was smaller than the cropped image patches. With these parameters, we obtained 20,012 patches for the test task, and the final results with NMS are submitted to the DOTA dataset evaluation system.

4.2. Evaluation Metrics

In this paper, a standard and widely used measure is used for evaluating the performance of the object detection algorithms, i.e., the mean average precision (mAP). The mAP is the average of the sum of the average precision (AP) values of all the categories. In brief, an AP value can be regarded as the area enclosed by the polyline, the x-axis, and the y-axis in a single-category precision-recall chart. In other words, the larger the area, the higher the AP value, and the better the algorithm’s effect.

4.3. Implementation Details

For the experiments, we built the baseline network based on Faster RCNN [35], inspired by the FPN [37], and the successful ResNet101 [8] pretrained on ImageNet [81]. In addition, we changed the fully connected layers of R-CNN to convolutional layers, as in the FCN [72]. To be fair, we used

x, y, w, h, θ

instead of the two-point coordinate representation of the upper-left and lower-right points in Faster RCNN [35]. This was done to uniformly use oriented bounding box (OBB) coordinates for the training and testing. Therefore, the difference between the baseline used in the comparative experiments described in this paper and the original Faster R-CNN [35] network lies only in the modification of the coordinate form, the FPN and the R-CNN head.

A series of experiments were designed to better evaluate the effects of the saliency pyramid, the global attention network and the angle-sensitive NMS proposed in this paper. The environment used was a single NVIDIA Tesla V100 GPU with 16 GB memory, along with the PyTorch 1.1.0 and Python 3.7 deep learning frameworks. The initial learning rate was 0.0025, the batch size of the input data was 2, the value of the momentum is 0.9, the value of the weight decay was 0.0001, and the minibatch stochastic gradient descent (SGD) was also used for optimization. The maximum number of proposals after the RPN procedure was set to 2000 during both the training and testing.

4.4. Ablation Studies

4.4.1. Saliency Pyramid

To evaluate the efficiency of the saliency pyramid, we just compared the saliency pyramid + baseline with the baseline. With this operation, the sizes of feature maps we obtained from ResNet-101 were

[256, 256], [128, 128], [64, 64], [32, 32]

, with channels

{256, 512, 1024, 2048}

. The shapes of the saliency feature maps from the saliency pyramid were then

[257, 256, 256], [257, 128, 128], [257, 64, 64], [257, 32, 32], [257, 16, 16]

, and these feature maps were sent to RPN and RoI extractor. Compared with the baseline in Table 1, the saliency pyramid can achieve a higher

m A P

of

69.03 %

in the DOTA dataset.

Figure 13 shows the saliency maps from the RC method [70], which were used in the saliency pyramid. In fact, in a very small area around an object (region proposal), the edges of the objects are more prominently represented, and the water, grass, and roads around the targets, such as boats, sports grounds, and vehicles, are weakened. Here, we simply concatenate the feature maps, and the matrix multiplication can also get a similar effect. Figure 14 shows the detection results from the baseline, the saliency pyramid and the global attention network. From column (a), column (b) and column (e), we can see that due to the use of the saliency features, the difference between the objects and the background in the local area increases, which improves the AP values. The AP values of the small vehicle, the large vehicle and the storage tank of the saliency pyramid outperform the values of the baseline by

3.33 %

,

8.54 %

and

2.79 %

. As for column (h), because the outline of the plane is relatively clear on the saliency feature map, a false detection in the upper part of the image is successfully eliminated, and a false detection of the baseball-diamond in column (b) is removed based on the same reason. However, the images in column (c) show the limitations of the saliency pyramid method, that is a false detection labeled as a vehicle in the lower part of the image. Due to the lack of semantic information between the scene and the object, the significant object on the water was erroneously detected, and a similar error occurred in the middle area of the image in column (f).

4.4.2. Global Attention Network

In order to demonstrate the effectiveness of the global attention network, we just compared GA-Net + baseline with the baseline. From Table 1, GA-Net obtains a

m A P

of

69.14 %

. Compared with the saliency pyramid +baseline (

m A P = 69.03 %

), GA-Net performs better in situations where the scene is closely related to the target. As Figure 14 shows, the results of the GA-Net in column (a), column (c), column (d), column (e) and column (g) are better than the results of the baseline. On the one hand, the increase of the number of parameters makes the network more representational. On the other hand, the use of the scenario information allows the network to find targets that should normally be closely associated with the scenario. Therefore, some missed objects are reduced. However, compared to the results of the saliency pyramid, the GA-Net is better at reducing false detection through the scene semantic constraints. For example, in Figure 14, a false bounding box of the saliency pyramid in column (c), lower part on the image, has been reduced by the GA-Net. Because in normal scenes, a vehicle is unlikely to appear on the water. In addition, in column (c) of the baseline, the guardrail along the river are mistakenly detected as a harbor and a ship, which was effectively eliminated due to the use of scene information by GA-Net. Comparing to the results of the saliency pyramid in column (g) and (h), both of the false basketball-court bounding box and the false plane bounding box cannot be reduced by the scene semantic constraints. Therefore, in local areas, the saliency pyramid has a better effect. In addition, when the objects are densely arranged or the backgrounds is very complex, the use of contextual information increases the difficulty of the object detection comparing to the results of the saliency pyramid, which explains that the saliency pyramid used to remove the effect of the background has better performance in the scene of swimming pool in the villa area, the scene of densely arranged cars, and the scene of oil pipelines in the complex building area. Even though there are problems in some scenarios, GA-Net still achieves a good performance relative to the baseline

m A P = 66.88 %

. Figure 15 shows the encoded scenario-object semantic information.

4.4.3. Angle-Sensitive IoU

To evaluate the efficiency of the angle-sensitive IoU algorithm, we trained GA + saliency pyramid and GLS-Net (GA + saliency pyramid + angle-sensitive IoU) respectively, for comparative experiments. The difference between the two networks is that GLS-Net uses angle-sensitive IoU (Equation (6)) and GA + saliency pyramid uses rotated IoU (Equation (5)) during the inference.

λ = 1

was set for GLS-Net. From Table 1, it can be seen that GLS-Net with angle-sensitive IoU achieves a

m A P = 72.96 %

, which is slightly lower than the

m A P = 72.99 %

of GLS-Net without angle-sensitive IoU. In addition, the angle-sensitive IoU algorithm has a greater impact on objects with large differences in length and width, such as aircraft, ships, vehicles, and bridges. Although this experiment shows that the use of the angle-sensitive IoU causes the mAP value to decrease slightly, compared with the use of horizontal IoU, the algorithm is more in line with the idea of the rotated rectangle angle difference constraint. Furthermore, most of the time, compared with a detector using horizontal IoU, the mAP of a detector with rotated IoU may be lower about

1.5 %

[85].

4.5. Comparison with the State-of-the-Art

Table 1 shows the experimental results obtained with the DOTA test dataset and the comparisons with the state-of-the-art algorithms using the official evaluation web (https://captain-whu.github.io/DOTA/evaluation.html), because of the lacking annotated labels. To be fair, the ResNet-101 was used for all the methods in this Table 1 as the backbone network. Among them, because the multi-scale feature integration attention rotation network (MFIAR-Net) [89] does not have detailed AP values based on ResNet101 backbone for each category, only the mAP value was used to express its performance in Table 1. In addition, Figure 16 shows the detection results of the GLS-Net.

Nine state-of-the-art detectors were used including the Faster R-CNN detector (FR-O) provided by DOTA [11] and the rotation region proposal networks (RRPN) [86] with a rotated anchor design. R2CNN indicates the rotational region CNN [54], which was built with a horizontal anchor extractor and horizontal RoIPooling, but R2CNN uses three pooling sizes before the R-CNN head and it can predict the oriented bounding boxes and horizontal bounding boxes at the same time. R-DFPN is the rotation dense feature pyramid network, and the method of Yang et al. [87] also belongs to the R-DFPN category. The RoITransformer method is a HRoI-based method with the RRoI warping operation. The method of Azimi et al. [61] is a kind of cascade structure with a multi-scale RRoI warping operation. MFIAR-Net [89] is a kind of feature pyramid network with an attention mechanism and a multi-scale feature fusion method. SCRDet [55] uses an inception module to fuse multi-scale feature maps, and use a MDA-Net to achieve the pixel attention and the channel attention at the same time. As can be observed in Table 1, the proposed detector, i.e., GLS-Net, outperforms FR-O [36,85], RRPN [85,86], R2CNN [54,85], the method of Yang et al. [87], R-DFPN [85,88], RoITransformer [85], the method of Azimi et al. [61], MFIAR-Net [89] and SCRDet [55] by

18.86 %

,

11.98 %

,

12.32 %

,

10.7 %

,

15.05 %

,

3.43 %

,

4.83 %

,

0.48 %

and

0.38 %

. In addition, the proposed method also achieves good results in each category. The experimental results fully prove that the proposed GLS-Net detector shows a superior performance, compared to the existing state-of-the-art methods.

In conclusion, GLS-Net builds the saliency pyramid to make the object area and structural information more prominent in the local area in the image, reducing the interference of the complex background around the objects. Furthermore, GLS-Net uses the global attention network to obtain the scene semantic information of the input image, and associates the objects with the semantic information of the surrounding environment. The use of scene semantic information can eliminate obvious false detections in some scenes, such as vehicles on the water and the guardrail along the coast. Furthermore, it also improves the detection accuracy of the objects closely related to the scenes, such as airports and airplanes. Finally, the angle-sensitive IoU makes a certain contribution to the calculation of the oriented bounding boxes of objects with large differences in length and width, such as ships, large vehicles and bridges, etc.

5. Conclusions

In this paper, we have proposed an arbitrary-oriented object detection algorithm, namely GLS-Net, which is effective for detecting objects oriented in different directions in aerial images. The framework combines a saliency pyramid, a global attention sub-network that can capture the semantic information from scene to object, and an angle-sensitive NMS method to obtain more accurate oriented bounding boxes. The experiments undertaken with the public DOTA dataset confirmed the remarkable performance of the proposed method. Despite this, the GLS-Net still has missed detections and inaccurate bounding boxes. In addition, although background noise information can be suppressed by saliency pyramid, whether this step can be replaced by a network structure remains to be studied. In our future work, we will pay more attention to the expression of the high and low frequency information in aerial images. In practical applications, noise greatly affects the detection accuracy and, at the same time, causes the failure of some network elements, resulting in a waste of computing resources. Therefore, we will consider encoding the high-frequency information and the low-frequency information in the convolution process, and we will attempt to reduce the influence of noise through the encoder-decoder structure.

Author Contributions

B.L. guided the algorithm design. C.L. designed the whole framework and experiments. C.L. wrote the paper. C.W., J.Z., L.W. help organize the paper and performed the experimental analysis. H.H., X.S., Y.W., J.L. provided advice for the preparation and revision of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (2019YFC0121502), the National Natural Science Foundation of China (Grant No. 61571332), and the National Natural Science Foundation of China (Grant No. 61261130587).

Acknowledgments

The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, X.; Zhu, K.; Chen, G.; Tan, X.; Zhang, L.; Dai, F.; Liao, P.; Gong, Y. Geospatial Object Detection on High Resolution Remote Sensing Imagery Based on Double Multi-Scale Feature Pyramid Network. Remote Sens. 2019, 11, 755. [Google Scholar] [CrossRef] [Green Version]
Chen, Z.; Zhang, T.; Ouyang, C. End-to-End Airplane Detection Using Transfer Learning in Remote Sensing Images. Remote Sens. 2018, 10, 139. [Google Scholar] [CrossRef] [Green Version]
Ma, W.; Guo, Q.; Wu, Y.; Zhao, W.; Zhang, X.; Jiao, L. A Novel Multi-Model Decision Fusion Network for Object Detection in Remote Sensing Images. Remote Sens. 2019, 11, 737. [Google Scholar] [CrossRef] [Green Version]
Liu, Z.; Zhao, D.; Shi, Z.; Jiang, Z. Unsupervised Saliency Model with Color Markov Chain for Oil Tank Detection. Remote Sens. 2019, 11, 1089. [Google Scholar] [CrossRef] [Green Version]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the International Conferenceon Neural Information Processing Systems, Lake Tahoe, ND, USA, 3–8 December 2012; pp. 1097–1105. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Xia, G.S.; Bai, X.; Zhang, L.P.; Serge, B.; Marcello, P. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Liu, Z.; Wang, H.; Weng, L.; Yang, Y. Ship Rotated Bounding Box Space for Ship Extraction From High-Resolution Optical Satellite Images With Complex Backgrounds. IEEE Geosci. Remote Sens. Lett. 2017, 13, 1074–1078. [Google Scholar] [CrossRef]
Yang, F.; Xu, Q.; Li, B. Ship Detection From Optical Satellite Images Based on Saliency Segmentation and Structure-LBP Feature. IEEE Geosci. Remote Sens. Lett. 2017, 14, 602–606. [Google Scholar] [CrossRef]
Cheng, G.; Han, J. A Survey on Object Detection in Optical Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef] [Green Version]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [Green Version]
Zhang, S.; Wu, R.; Xu, K.; Wang, J.; Sun, W. R-CNN-Based Ship Detection from High Resolution Remote Sensing Imagery. Remote Sens. 2019, 11, 631. [Google Scholar] [CrossRef] [Green Version]
Sirmacek, B.; Unsalan, C. Urban-area and building detection using SIFT keypoints and graph theory. IEEE Trans. Geosci. Remote Sens. 2009, 47, 1156–1167. [Google Scholar] [CrossRef]
Lyasheva, S.A.; Medvedev, M.V.; Shleimovich, M.P. Terrain object recognition in unmanned aerial vehicle control system. Russ. Aeronaut. 2014, 57, 303–306. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Sutskever, I.; Hinton, G.E.; Krizhevsky, A. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 60, 1097–1105. [Google Scholar]
Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A review on deep learning techniques applied to semantic segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar]
Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Luc, P.; Neverova, N.; Couprie, C.; Verbeek, J.; Lecun, Y. Predicting Deeper into the Future of Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 648–657. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Bing, X.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengjo, Y. Generative adversarial nets. In Proceedings of the Neural Information Processing Systems, Lake Tahoe, NV, USA, 1–4 December 2014; pp. 2672–2680. [Google Scholar]
Ristani, E.; Tomasi, C. Features for multi-target multi-camera tracking and re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6036–6046. [Google Scholar]
Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 1–12. [Google Scholar] [CrossRef] [Green Version]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Law, H.; Teng, Y.; Russakovsky, O.; Deng, J. CornerNet-Lite: Efficient Keypoint Based Object Detection. arXiv 2019, arXiv:1904.08900. [Google Scholar]
Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Shi, J. FoveaBox: Beyond Anchor-based Object Detector. arXiv 2019, arXiv:1904.03797. [Google Scholar]
Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. arXiv 2019, arXiv:1903.00621. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Van de Sande, K.E.; Uijlings, J.R.; Gevers, T.; Smeulders, A.W. Segmentation as selective search for object recognition. In Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 6–11 November 2011; p. 7. [Google Scholar]
Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef] [Green Version]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; Lecun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–26 April 2014. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
Xu, J.; Sun, X.; Zhang, D.; Fu, K. Automatic detection of inshore ships in high-resolution remote sensing images using robust invariant generalized Hough transform. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2070–2074. [Google Scholar]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Tayara, H.; Chong, K. Object detection in very high-resolution aerial images using one-stage densely connected feature pyramid network. Sensors. 2018, 18, 3341. [Google Scholar] [CrossRef] [Green Version]
Van Etten, A. You only look twice: Rapid multi-scale object detection in satellite imagery. arXiv 2018, arXiv:1805.09512. [Google Scholar]
Xu, Z.; Xu, X.; Wang, L.; Yang, R.; Pu, F. Deformable convnet with aspect ratio constrained nms for object detection in remote sensing imagery. Remote Sens. 2017, 9, 1312. [Google Scholar] [CrossRef] [Green Version]
Ren, Y.; Zhu, C.; Xiao, S. Deformable Faster R-CNN with aggregating multi-layer features for partially occluded object detection in optical remote sensing images. Remote Sens. 2018, 10, 1470. [Google Scholar] [CrossRef] [Green Version]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Xiao, Z.; Gong, Y.; Long, Y.; Li, D.; Wang, X.; Liu, H. Airport detection based on a multiscale fusion feature for optical remote sensing images. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1469–1473. [Google Scholar] [CrossRef]
Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Zou, H. Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 18, 3652–3664. [Google Scholar] [CrossRef]
Ren, Y.; Zhu, C.; Xiao, S. Small object detection in optical remote sensing images via modified faster R-CNN. Appl. Sci. 2018, 8, 813. [Google Scholar] [CrossRef] [Green Version]
Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection. arXiv 2017, arXiv:1706.09579. [Google Scholar]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Xian, S.; Fu, K. SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects. arXiv 2018, arXiv:1811.07126. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
Corbetta, M.; Shulman, G.L. Control of goal-directed and stimulus-driven attention in the brain. Nat. Rev. Neurosci. 2002, 3, 201. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Li, K.; Cheng, G.; Bu, S.; You, X. Rotation-insensitive and context-augmented object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2337–2348. [Google Scholar] [CrossRef]
Li, X.; Wang, S. Object detection using convolutional neural networks in a coarse-to-fine manner. IEEE Geosci. Remote Sens. Lett. 2014, 14, 2037–2041. [Google Scholar] [CrossRef]
Azimi, S.M.; Vig, E.; Bahmanyar, R.; Körner, M.; Reinartz, P. Towards multi-class object detection in unconstrained remote sensing imagery. In Proceedings of the IEEE Asian Conference on Computer Vision, Perth, Australia, 4–6 December 2018; pp. 150–165. [Google Scholar]
Liu, L.; Pan, Z.; Lei, B. Learning a rotation invariant detector with rotatable bounding box. arXiv 2017, arXiv:1711.09405. [Google Scholar]
Zhang, Z.; Guo, W.; Zhu, S.; Yu, W. Toward arbitrary-oriented ship detection with rotated region proposal and discrimination networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1745–1749. [Google Scholar] [CrossRef]
Liu, Z.; Hu, J.; Weng, L.; Yang, Y. Rotated region based CNN for ship detection. In Proceedings of the IEEE International Conference on Image Processing, Beijing, China, 17–20 September 2017; pp. 900–904. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning roi transformer for detecting oriented objects in aerial images. arXiv 2018, arXiv:1812.00155. [Google Scholar]
Xie, Y.; Lu, H.; Yang, M.H. Bayesian saliency via low and mid level cues. IEEE Trans. Image Process. 2012, 22, 1689–1698. [Google Scholar] [PubMed]
Qi, W.; Cheng, M.M.; Borji, A.; Lu, H.; Bai, L.F. SaliencyRank: Two-stage manifold ranking for salient object detection. Comput. Vis. Media 2015, 1, 309–320. [Google Scholar] [CrossRef] [Green Version]
Zhai, Y.; Shah, M. Visual attention detection in video sequences using spatiotemporal cues. In Proceedings of the 14th ACM international conference on Multimedia, Santa Barbara, CA, USA, 23–27 October 2006; pp. 815–824. [Google Scholar]
Achanta, R.; Hemami, S.; Estrada, F.; Süsstrunk, S. Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1597–1604. [Google Scholar]
Cheng, M.M.; Mitra, N.J.; Huang, X.; Torr, P.H.; Hu, S.M. Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach Intell. 2014, 37, 569–582. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hou, Q.; Cheng, M.M.; Hu, X.; Borji, A.; Tu, Z.; Torr, P.H. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3203–3212. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–12 June 2015; pp. 3431–3440. [Google Scholar]
Sun, P.; Chen, G.; Luke, G.; Shang, Y. Salience Biased Loss for Object Detection in Aerial Images. arXiv 2018, arXiv:1810.08103. [Google Scholar]
Rensink, R.A. The dynamic representation of scenes. Vis. Cognit. 2000, 7, 17–42. [Google Scholar] [CrossRef]
Larochelle, H.; Hinton, G.E. Learning to combine foveal glimpses with a third-order Boltzmann machine. In Proceedings of the International Conferenceon Neural Information Processing Systems, Vancouver, BC, Canada, 6–11 December 2010; pp. 1243–1251. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Du, Y.; Yuan, C.; Li, B.; Zhao, L.; Li, Y.; Hu, W. Interaction-aware spatio-temporal pyramid attention networks for action classification. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 373–389. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; So Kweon, I. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
Felzenszwalb, P.F.; Huttenlocher, D.P. Efficient graph-based image segmentation. Int. J. Comput. Vis. 2004, 59, 167–181. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.X.; Lu, Q.K. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2844–2853. [Google Scholar]
Ma, J.Q.; Shao, W.Y.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.B.; Xue, X.Y. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 2018, 20, 3111–3122. [Google Scholar] [CrossRef] [Green Version]
Yang, X.; Sun, H.; Sun, X.; Yan, M.; Guo, Z.; Fu, K. Position detection and direction prediction for arbitrary-oriented ships via multitask rotation region convolutional neural network. IEEE Access. 2018, 6, 50839–50849. [Google Scholar] [CrossRef]
Yang, X.; Sun, H.; Fu, K.; Yang, J.; Sun, X.; Yan, M.; Guo, Z. Automatic Ship Detection in Remote Sensing Images from Google Earth of Complex Scenes Based on Multiscale Rotation Dense Feature Pyramid Networks. Remote Sens. 2018, 10, 132. [Google Scholar] [CrossRef] [Green Version]
Yang, F.; Li, W.; Hu, H.; Li, W.; Wang, P. Multi-Scale Feature Integrated Attention-Based Rotation Network for Object Detection in VHR Aerial Images. Sensors 2020, 20, 1686. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. The characteristics of aerial images: (a) The complex backgrounds make it difficult to distinguish the targets. (b) Aircraft and ship targets with large scale changes. (c) An example of how it is easier to distinguish bridges and roads based on scene information. (d) Densely arranged targets with angles make it difficult to represent the targets by regular bounding boxes.

Figure 2. Two kinds of rectangular bounding boxes. (a) Standard rectangular bounding boxes: in addition to a boat, each rectangular bounding box contains pixels of water, docks, and other boats. (b) Inclined rectangular bounding boxes: the boxes contain fewer background pixels.

Figure 3. A spacetime non-local block [78]. If we suppose that the size of the input feature map is

T \times H \times W \times 1024

, where means 1024 channels, then “⊗” denotes matrix multiplication and “⊕” denotes element-wise summation. The green boxes denote

1 \times 1 \times 1

convolutions.

Figure 3. A spacetime non-local block [78]. If we suppose that the size of the input feature map is

T \times H \times W \times 1024

, where means 1024 channels, then “⊗” denotes matrix multiplication and “⊕” denotes element-wise summation. The green boxes denote

1 \times 1 \times 1

convolutions.

Figure 4. The proposed framework, namely, the global-local saliency constraint network (GLS-Net), is made up of four parts: the baseline network, which is made up of the towards real-time object detection with region proposal networks (Faster R-CNN) [36] and the feature pyramid network (FPN) [37]; the saliency pyramid, which makes the objects more prominent in the region proposals; the global context network with a channel attention mechanism and a lightweight feature fusion structure; and the angle-sensitive non-maximum suppression in R-CNN.

Figure 5. Each rotated bounding box is defined by five parameters:

{x, y, w, h, θ}

.

{x, y}

represent the center of the rectangle.

θ ϵ [- 90, 0)

denotes the rotation angle of the bounding box. If the x-axis rotates counterclockwise, the first side of the rectangle encountered is w, and the other side is h, so then the rotation angle of the x-axis is

θ

for rotated bounding box.

Figure 5. Each rotated bounding box is defined by five parameters:

{x, y, w, h, θ}

.

{x, y}

represent the center of the rectangle.

θ ϵ [- 90, 0)

denotes the rotation angle of the bounding box. If the x-axis rotates counterclockwise, the first side of the rectangle encountered is w, and the other side is h, so then the rotation angle of the x-axis is

θ

for rotated bounding box.

Figure 6. Detection results from the horizontal rectangle coordinates and the inclined bounding boxes: (a) With horizontal rectangle coordinates, densely packed objects can be missed because the IoU between two adjacent targets is too large. (b) With the inclined bounding boxes, the IoU from densely arranged objects can be well calculated and improves the recall value.

Figure 7. The construction process for the saliency pyramid.

Figure 8. The construction process for the feature pyramid.

Figure 9. The details of GA-Net.

Figure 10. In the RoIHead (fast-rcnn) stage, the global semantic features

M_{c} (F)

are fused with the local features

S (F^{^{'}})

from the saliency pyramid and RoIAlign by the convolutional layers.

Figure 10. In the RoIHead (fast-rcnn) stage, the global semantic features

M_{c} (F)

are fused with the local features

S (F^{^{'}})

from the saliency pyramid and RoIAlign by the convolutional layers.

Figure 11. For this ship, the red rectangle indicates a better inference result with a score of 0.9. The green rectangle has the same area as the blue one, and they are also the inference results before the non-maximum suppression (NMS) processing, with scores 0.8 and 0.6. When using the five-parameter coordinates,

Δ θ_{l e f t}

represents the rotation angle between the green bounding box and the red rectangle, and

Δ θ_{r i g h t}

represents the rotation angle between the blue bounding box and the red rectangle. By using Equation (5), we can obtain a result where the

r o t a t e d_I o U_{l e f t} < 0.5 \leq r o t a t e d_I o U_{r i g h t}

even if both the blue and green boxes should be removed in the same iteration.

Figure 11. For this ship, the red rectangle indicates a better inference result with a score of 0.9. The green rectangle has the same area as the blue one, and they are also the inference results before the non-maximum suppression (NMS) processing, with scores 0.8 and 0.6. When using the five-parameter coordinates,

Δ θ_{l e f t}

represents the rotation angle between the green bounding box and the red rectangle, and

Δ θ_{r i g h t}

represents the rotation angle between the blue bounding box and the red rectangle. By using Equation (5), we can obtain a result where the

r o t a t e d_I o U_{l e f t} < 0.5 \leq r o t a t e d_I o U_{r i g h t}

even if both the blue and green boxes should be removed in the same iteration.

Figure 12. The histogram represents the spatial resolution distribution of the images in the “A Large-Scale Dataset for Object Detection in Aerial images” (DOTA) dataset. Due to the lack of the annotated files of the test set, we only counted the data from the training set and the validation set. The spatial resolution values come from the ground sampling distance (GSD) data in the second line of the annotated files. There are 1869 annotated files in total, and only 1846 files have the ground sampling distance (GSD) values.

Figure 13. The saliency maps based on the RC algorithm [70]. The images of the above line come from the “A Large-Scale Dataset for Object Detection in Aerial images” (DOTA) dataset [11], and the images of the following line is the result of the RC algorithm. In addition, (a–h) are eight different scenarios. It can be seen that the saliency algorithm can reduce the interference of background information to a certain extent. In the local area, the difference between the object and the background increases, and the structural information of the object is highlighted.

Figure 14. Visualization of the baseline, the SP and the GA results in the DOTA test dataset. The first line is the results of the baseline, which is the Faster RCNN detector with the FPN and the convolutional head. SP (the second line) means a network with only the saliency pyramid based on the baseline. GA (the third line) means a network with only the global attention network based on the baseline. In addition, (a–h) are eight different scenarios. The red arrow indicates the false detection or missed detection apparently in the three images in a column, while the blue arrow indicates the better detection area in the three images in a column.

Figure 15. Visualization of the encoded scenario-object semantic information from the global attention network (GA-Net). The four feature maps from different channels behind each image visualize the response of different objects, and these feature maps collectively reflect the semantic relationship between the scene and the objects. The more biased to blue, the lower the response value. For that reason, the GA-Net branch obtains the distribution information of each part of the scene and encodes the information between the scene and the objects.

Figure 16. Visualization of the GLS-Net results in the DOTA test dataset.

Table 1. Comparisons with the state-of-the-art detectors in the DOTA test dataset [11]. The baseline is the Faster RCNN detector with the FPN and the convolutional head. GA means a network with only the global attention network based on the baseline. SP means a network with only the saliency pyramid based on the baseline. GA+SP means that the global attention network, the saliency pyramid, and the baseline were used at the same time. Finally, GLS-Net is a network adding the angle-sensitive NMS to GA+SP. In addition, the specific meanings of the following abbreviations are: plane, baseball diamond (BD), bridge, ground track field (GTF), small vehicle (SV), large vehicle (LV), ship, tennis court (TC), basketball court (BC), storage tank (ST), soccer-ball field (SBF), roundabout (RA), harbor, swimming pool (SP), and helicopter (HC). The bold Numbers denote The highest values in each column.

Method	Plane	BD	Bridge	GTF	SV	LV	Ship	TC	BC	ST	SBF	RA	Harbor	SP	HC	mAP
FR-O [36,85]	79.42	77.13	17.7	64.05	35.3	38.02	37.16	89.41	69.64	59.28	50.3	52.91	47.89	47.4	46.3	54.13
RRPN [85,86]	80.94	65.75	35.34	67.44	59.92	50.91	55.81	90.67	66.92	72.39	55.06	52.23	55.14	53.35	48.22	61.01
R2CNN [54,85]	88.52	71.2	31.66	59.3	51.85	56.19	57.25	90.81	72.84	67.38	56.69	52.84	53.08	51.94	53.58	60.67
Yang et al. [87]	81.52	71.41	36.53	67.44	61.16	50.91	56.60	90.67	68.09	72.39	55.06	55.60	62.44	53.35	51.47	62.29
R-DFPN [85,88]	80.92	65.82	33.77	58.94	55.77	50.94	54.78	90.33	66.34	68.66	48.73	51.76	55.1	51.32	35.88	57.94
RoITransformer [85]	88.64	78.52	43.44	75.92	68.81	73.68	83.59	90.74	77.27	81.46	58.39	53.54	62.83	58.93	47.67	69.56
Azimi et al. [61]	81.36	74.30	47.70	70.32	64.89	67.82	69.98	90.76	79.06	78.20	53.64	62.90	67.02	64.17	50.23	68.16
MFIAR-Net [89]	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	72.51
SCRDet [55]	89.98	80.65	52.09	68.36	68.36	60.32	72.41	90.85	87.94	86.86	65.02	66.68	66.25	68.24	65.21	72.61
baseline	87.84	76.44	45.21	67.73	64.32	59.71	68.16	88.61	78.72	79.13	52.34	58.79	64.51	67.18	44.54	66.88
ours(SP)	88.02	76.61	46.13	69.53	67.65	68.25	71.04	90.88	78.84	81.92	55.15	59.50	64.70	68.20	49.07	69.03
ours(GA)	88.28	77.12	47.87	70.63	66.33	66.37	70.19	90.86	79.64	79.47	57.35	60.88	66.17	67.33	48.56	69.14
ours(GA+SP)	89.17	77.40	51.25	71.03	73.32	72.10	84.76	90.87	80.43	85.39	58.31	62.27	67.58	70.69	60.41	72.99
ours(GLS-Net)	88.65	77.40	51.20	71.03	73.30	72.16	84.68	90.87	80.43	85.38	58.33	62.27	67.58	70.69	60.42	72.96

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Luo, B.; Hong, H.; Su, X.; Wang, Y.; Liu, J.; Wang, C.; Zhang, J.; Wei, L. Object Detection Based on Global-Local Saliency Constraint in Aerial Images. Remote Sens. 2020, 12, 1435. https://doi.org/10.3390/rs12091435

AMA Style

Li C, Luo B, Hong H, Su X, Wang Y, Liu J, Wang C, Zhang J, Wei L. Object Detection Based on Global-Local Saliency Constraint in Aerial Images. Remote Sensing. 2020; 12(9):1435. https://doi.org/10.3390/rs12091435

Chicago/Turabian Style

Li, Chengyuan, Bin Luo, Hailong Hong, Xin Su, Yajun Wang, Jun Liu, Chenjie Wang, Jing Zhang, and Linhai Wei. 2020. "Object Detection Based on Global-Local Saliency Constraint in Aerial Images" Remote Sensing 12, no. 9: 1435. https://doi.org/10.3390/rs12091435

APA Style

Li, C., Luo, B., Hong, H., Su, X., Wang, Y., Liu, J., Wang, C., Zhang, J., & Wei, L. (2020). Object Detection Based on Global-Local Saliency Constraint in Aerial Images. Remote Sensing, 12(9), 1435. https://doi.org/10.3390/rs12091435

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Object Detection Based on Global-Local Saliency Constraint in Aerial Images

Abstract

1. Introduction

2. Related Work

2.1. Object Detection of Aerial Images

2.2. Saliency Detection

2.3. Attention Mechanism

3. Proposed Framework

3.1. Saliency Pyramid

3.2. Global Attention Network

3.3. Angle-Sensitive Non-Maximum Suppression (NMS)

4. Experiments and Results

4.1. Dataset

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Ablation Studies

4.4.1. Saliency Pyramid

4.4.2. Global Attention Network

4.4.3. Angle-Sensitive IoU

4.5. Comparison with the State-of-the-Art

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI