1. Introduction
Nowadays, with the growth of global industries, there has been an enormous increase in the production of plastic garbage. Such garbage has created many problems for the conservation of the ecological environment and caused increasingly serious environmental problems, especially regarding water pollution [
1,
2]. In general, garbage discharged into the water is sparse and dispersed, and it is a challenge to clean with machinery instead of manual cleaning. Cleaning up underwater garbage takes longer time and costs more than cleaning up aboveground garbage. The underwater garbage, unfortunately, will exist for several years to decades, leaving harmful effects on the water quality [
3]. It might entangle some species or be accidentally eaten by aquatic animals [
4], causing death and affecting the ecological balance. In addition, ship propellers might be winded and stuck by discarded nets, which results in a dangerous voyage. As a consequence, it is necessary to clean up underwater garbage efficiently.
With the development of robotics, artificial intelligence, and autonomous driving in recent years [
5,
6], it has become possible to apply intelligent robots to accomplish underwater garbage cleaning. To improve the environmental perception ability of the underwater garbage cleaning robot, the primary technical point is to locate and recognize underwater garbage accurately and efficiently. Therein, the image segmentation method is a better methodology compared with the classical deep-learning-based target detection method [
7], for it can compute accurate and refined edges of targets [
8]. In virtue of the obtained shape and edge information, the underwater garbage cleaning robot can execute more reasonable and precise operations. With respect to the common underwater garbage, such as plastic bags, ropes, fishing nets, and bricks, there are the following characteristics:
Similar underwater garbage varies in size and scale;
Some underwater garbage has an unfixed shape;
The target only has a limited area in an image; hence, the problem of imbalance between the number of targets and background is prominent.
Underwater images are also affected by many limited conditions. When spreading in the water, light is susceptible to absorption and scattering by the underwater medium. Underwater images suffer from visual degradation problems, which adversely affect image recognition tasks. Turbid water bodies and other compositions in the water can also lead to visual degradation of underwater images. The main visual degradation can be classified as image color attenuation and shift, image turbidity, and low image brightness. These are the problems that need to be solved for underwater image segmentation tasks. To effectively detect underwater targets in different situations, a system with high robustness is necessary for this underwater image segmentation task, while a larger dataset can also mitigate the effects of the complex underwater environment.
Image segmentation tasks can be divided into two categories according to the output results: non-semantic segmentation and semantic segmentation. Non-semantic segmentation outputs edge region, or the contour lines of the segmented target, not including the category information of the target. The active contour method is based on a predefined closed contour, and calculates the actual contour of the target through the energy function. Ge et al. proposed a pre-fitted energy-driven active contour model with adaptive edge indicator functions to accelerate the segment speed and reduce the number of iterations [
9]. The level set approach can be used to solve the intensity inhomogeneity problem in real-world images. The adaptive data-driven term can optimize the algorithm’s parameters to segment targets at different sizes and features. The additive bias reduces illumination interference, but this method cannot be used for multiple-colors images [
10]. Semantic segmentation outputs include segmented contour and the category of each segmented pixel. Semantic segmentation can be used in multi-category segmentation tasks, and most methods involve applying neural networks. The effectiveness of segmentation depends on the network design and the dataset for training the network, which is suitable for multi-category segmentation tasks for specific targets.
In such a situation, the semantic segmentation method is an appropriate way to modify the underwater garbage detection capability so as to assist the robot in extracting more target edge and shape information as well as improve the detection accuracy of targets of various scales. To accomplish the aforementioned technical goal, the fundamental architecture is established based on the U-Net network, which is widely used for biomedical image segmentation tasks [
11]. The primary features of the U-Net architecture are symmetrical channels on both sides and skip connection channels that merge different feature maps from each scale of the network. Such a network uses a multilayer convolutional structure at different scales to allow the entire network to preserve features at different scales. More specifically, small targets will be captured by the high-level layers while the large targets will be captured by the low-level layers, hence this network can achieve an impressive result in pixel-level segmentation on multiscale targets [
12]. Considering that the underwater garbage targets in this project desiderate a network which has the capacity to the features on both large and small scales, U-Net is an appropriate candidate.
In this paper, an improved U-Net structure is proposed to accomplish underwater garbage image semantic segmentation. First, the unbalanced loss function focal loss and data augmentation strategy are provided to solve the target–background imbalance problem. Meanwhile, the U-Net backbone network is rebuilt with reference to the VGG16 network structure to solve the network capacity problem in the multitarget segmentation task [
13]. Note that the primary contributions of this paper are as follows:
The network structure of U-Net is improved specifically for underwater garbage targets with a stronger capacity to conduct multiclass segmentation tasks.
The underwater garbage semantic segmentation dataset is established to train and evaluate the proposed network, offering a sturdy support platform.
To solve the target–background imbalance problem, the special data augmentation strategy and the focal loss function are tightly combined [
14]. Experimental results demonstrate an increase in various evaluation indexes via applying this strategy.
The remainder of this article is organized as follows.
Section 2 introduces related works in computer vision and U-Net architecture. In
Section 3, the underwater garbage dataset and redesign work of the U-Net architecture are accomplished. Next, the experimental results based on the dataset are detailed in
Section 4. Finally,
Section 5 discusses the experimental results, and the conclusion and the future work are summed up in
Section 6.
4. Results
Experimental results of the modified U-Net method are described quantitatively and qualitatively in this section, based on the underwater garbage dataset, to confirm the network’s effectiveness for segmentation tasks.
Section 4.1 introduces the basic setting of the experiment.
Section 4.2 presents the results of the experiments in various conditions.
4.1. Settings
The test dataset used for the experiments was collected in a cistern by an underwater collection with a binocular camera and a monocular camera [
32], where the depth of the cistern was about 1.5 m.
A total of 350 images are deployed as a training dataset, 39 images as a validation dataset, and 50 images as a test dataset. All images are unified as 512 × 512 RGB images before being sent into the network. To evaluate the network training performance, the confusion matrix is used to calculate precision, recall, F1-score, and intersection over union (IoU).
The model is trained for a total of 30 epochs using a two-stage training mode. In the first 10 epochs of the freeze training stage, the network is trained with a learning rate of . In the remaining 20 epochs unfreeze training stage, the learning rate is . Pretrained weights are loaded in network initialization steps to improve the training effectiveness because the feature extraction method is similar, especially when using small datasets. The initial weights are too random if training from zero, while using transfer learning can be a wiser method.
4.2. Experimental Results
The network takes 1 min 20 s to train each epoch on GPU, and the complete training takes 40 min (NVIDIA RTX3060). The network is almost converged after 20 epochs, and all these rates are stabilized before 30 epochs, indicating that the network training is finished. The focal loss function is applied to reduce the unbalance problem between target and background pixels. In this training stage, the hyperparameters are set as 1.5:1.2:1.5 for plastic bags, ropes (or nets), and bricks. The evolutions of precision, recall, and IoU during the network’s training process and the loss stabilization process of 30 epochs are shown in
Figure 3.
As results show in
Figure 3, the network achieves more than 87% precision, more than 95% recall, and more than 85% IoU for each category, as detailed in
Table 2. From that result, the rope and net are difficult to segment accurately compared with other categories, and receive the lowest rate in precision and IoU. An ambiguous and complex boundary of these ropes and nets may cause that situation.
The quantitative comparative experiments are conducted with respect to the following conditions: (1) training with original U-Net architecture; (2) replace the focal loss with the cross-entropy loss function; (3) training with compressed images and full-size images; (4) training with the SGD optimizer.
To verify the performance of the improved network, the comparative experiments between the original U-Net and the modified one were conducted. Because the original U-Net can only receive the single-channel grayscale images and generate the binary categories outputs, in this test stage the original network adopts the three-channel input and the multi-categories output, remaining asthe original backbone. Note that the loss function is the cross-entropy loss function. The experimental results are shown in
Table 3. Compared with the original U-Net, there exists a 10–20% increase in each index on the test dataset via the modified architecture, which indicates that the improved method is significant for the segmentation task.
Table 4 shows results using cross-entropy as the loss function. There are different reductions in precision, recall, and IoU compared with the proposed method. It indicates the effects of focal loss in solving the target–background unbalance problem and making the networks more sensitive to specific targets by adjusting hyperparameters.
Table 5 and
Table 6 show results with compressed input images and full-size input images, respectively. Note that compressed input images are [256 × 256] and full-size images are [1280 × 480]. A compressed input increases the network’s efficiency to 11 FPS but causes a decrease in precision, recall, and IoU. Meanwhile, full-size input decreases the network’s speed to 4.3 FPS but is unable to improve the segmentation results in large size.
Table 7 shows results applying stochastic gradient descent (SGD) optimizer, where the learning rate is
in the freeze training step and
in the unfreeze training step. The result is similar to training with the Adam optimizer, while the SGD optimizer needs a larger learning rate to achieve effective gradient descent.
Some results from the test dataset are shown in
Figure 4 to make qualitative analysis, including input images, real masks, and segmentation outputs.
By comparing the output images with real masks, it can be concluded that the segmentation performance of different categories and conditions is satisfactory. Plastic bags have clear boundaries while the boundaries of nets are complex, resulting inaccurate segmentation. With respect to a small target, such as a brick, if far from the camera, it is harder to detect. Additionally,
Figure 5 shows the segmentation results of a binocular camera, and
Figure 6 shows the results from a monocular camera. The difference in segmentation results between binocular and monocular cameras also needs to be considered further. As shown in
Figure 7, image stitching causes segmentation errors in binocular camera inputs. It is concluded that these errors are caused by insufficient resolution. Therefore, applying specific resolution image inputs on different robots ensures that the images are in the resolution region to balance the efficiency and accuracy, which is an appropriate way to solve this problem.
5. Discussion
Computer vision has been applied to more complex tasks based on advancements in artificial intelligence and has achieved better performance than humans in particular tasks. One common area of research interest is self-navigation autonomous robots using vision information. This paper proposes a modified U-net architecture to accomplish efficient underwater garbage semantic segmentation. Compared with other semantic segmentation models, such as DeeplabV3+ [
33], PSPNet [
34], and SegFormer [
35], the U-Net-based structure has a simple symmetric encoder and decoder architecture, making it easier to converge when training on small datasets. Meanwhile, this architecture can use our pretrained weights based on transfer learning to accomplish our underwater garbage semantic segmentation task more efficiently.
Various different conditions are tested and the results are conducted in quantitative comparison in order to evaluate and obtain the best performance of the network. We tested the original U-Net network performance in the existing environment, and the results of experimental results in the test dataset concluded that the modified network is more precise in the underwater garbage segmentation task. The initial goal of this paper was to segment underwater targets in artificial water bodies, such as cisterns, open-air pools, and landscape lakes, and the effectiveness of the method was verified through experiments. In the experimental process, we tested the improved method under various conditions to verify the effect of the focal loss function on the accuracy of small targets and the effect of multiscales input images on the network’s outputs, then proposed the image segmentation method for underwater garbage to achieve good segmentation results.
The results demonstrate that the network has acceptable performance in the tasks. An underwater cleaning robot can finish garbage locating and cleaning work by virtue of this method and can collect garbage without a fixed shape successfully.
There are some limitations to this study. First, the lack of datasets is a major limitation for both this study and other studies about underwater vision. Although the method in this paper has good results for the segmentation task of underwater garbage targets in a cistern, it still cannot accurately detect the targets in the complex natural water environment, which is the main limitation of the current period of this study. Fortunately, increasing numbers of underwater datasets have been released in recent years and such studies on data augmentation methods from existing datasets to simulate underwater images are gradually soaring.
There is still room for further research on the general applicability to various garbage and natural work situations. Future work may focus on dataset construction by collecting images in real-world working conditions and making data augmentation from other garbage to underwater garbage datasets. Second, the speed of the segmentation network is less efficient compared with target detection work. The network runs at only 10 FPS on GPU, far from the 30 FPS required for real-time detection. We will test other novel image semantic segmentation methods in underwater targets and reach a higher speed of image segmentation with the help of edge computing.
6. Conclusions and Future Work
In this study, we proposed a modified U-Net network structure to match the multiclass target segmentation task of underwater garbage images. The unbalanced loss function focal loss and data augmentation strategy were provided to solve the target–background imbalance problem. First, we used the monocular and binocular cameras of the underwater collection robot to construct the underwater garbage dataset and modified the architecture of the U-Net to receive three channels input images and multi-categories output. The necessity of the improvement was verified in comparison tests with the original U-Net. Second, we investigated the effects of the loss function, optimizer, and input scale on the network performance and determined the final hyperparameters. Experimental results indicate that the built network can achieve the requirements of the underwater garbage segmentation task.
Future work will be concentrated on two parts. First, we will construct and train the model on a larger dataset. Other state-of-the-art semantic segmentation and instance segmentation methods, such as DeeplabV3+, PSPNet, and SegFormer, will be taken into consideration to improve the performance in the underwater garbage segmentation task. Second, we will improve the network architecture and workflow to increase efficiency, then deploy it on underwater robots for practical testing in complex field scenarios.