Deep Learning-Based Frameworks for Semantic Segmentation of Road Scenes
:1. Introduction
2. Datasets
2.1. Cambridge-Driving Labeled Video Database (CamVid)
2.2. KITTI
2.3. Microsoft Common Objects in Context (COCO)
2.4. Pascal Visual Object Classes (VOC)
2.5. Cityscapes
2.6. SYNTHetic Collection of Imagery and Annotations (SYNTHIA)
2.7. Grand Theft Auto 5 (GTA5)
2.8. Mapillary Vistas
2.9. ADE20K
2.10. SemanticKITTI
2.11. nuScenes
2.12. ApolloScape
3. Data Augmentation
3.1. Data Augmentation Techniques Based on Image Manipulation
3.2. Data Augmentation Techniques Based on Deep Learning
- Auxiliary Classifier GAN (ACGAN) [62]: In ACGAN, the discriminator classifies and discriminates between real and synthetic generated data, where a binary cross-entropy is included in the loss function for classification. In this way, the generator learns representative class samples and learns to generate more-realistic data. ACGAN methods are used to improve the training process of GANs.
- Data Augmentation GAN (DAGAN) [63]: DAGAN uses a lower-dimensional representation of real images to learn how to generate synthetic images. Figure 4 shows its architecture. The generator of DAGAN is composed of an encoder that takes a true image from class as input, then projects it down to a lower-dimensional manifold (bottleneck). Then, by transforming and concatenating a random vector with the bottleneck vector, these two vectors are passed to the decoder for generating an augmented image. Furthermore, the discriminator of DAGAN has been trained to discriminate between real images from the class and fake images that have been generated by the generator. However, the training process drives the network to generate new images from the existing ones that appear to be in the same class, whatever the class is, although the generated images look different enough to be different samples.
- Balancing GAN (BAGAN) [64]: It is an augmentation tool that restores balancing data in imbalanced datasets. Doing this was a challenge, considering the few minority class images that might not be enough for training a GAN. This problem was solved by including all available images of minority and majority classes during the training process. Figure 5 shows the three steps of the BAGAN training approach: autoencoder training, GAN initialization, and adversarial training. In this way, the model learns useful features from the majority classes and then uses these features to generate images for minority classes. Additionally, to drive the generation process toward a target class, class conditioning has been applied in the latent space. An autoencoder is used for the generator, which helped in learning an accurate class conditioning in the latent space.
3.3. Comparing Data Augmentation Techniques
4. Domain Adaptation
- Self-Ensembling Attention Networks [71] is the first self-ensembling model introduced to domain adaptation. This model aims to improve the learning of domain-invariant features. Moreover, an attention mechanism is introduced into the proposed model to generate attention-aware features. This mechanism is used to calculate consistency loss in the target-domain. Moreover, the self-ensembling model has two major components: a student network that represents base networks and a teacher network that represents ensemble networks. The student network learns from the teacher network with the help of the consistency loss. As a result, the student network becomes more accurate and the teacher network gets closer to the correct labels in the target-domain. Hence, domain-invariant features can be learned correspondingly.
- Semantic-Edge Domain Adaptation [72] is the first attempt to use low-level edge information that can be easily adapted to guide the transfer of high-level semantic information. The semantic-edge domain adaptation model uses an edge stream for processing edge information to produce high-quality semantic boundaries over the target-domain. Moreover, an edge consistency loss is proposed to align the target semantic predictions with the generated semantic boundaries in addition to two entropy reweighting methods for enhancing the adaptation performance of the model.
- Self-Ensembling GAN (SE-GAN) [73] is a novel self-ensembling GAN for domain adaptation. SE-GAN adopted a self-ensembling model as the generator in the adversarial network to improve the adversarial training performance. SE-GAN has three major components: a student network, a teacher network, and a discriminator. Both the student and the teacher networks form a self-ensembling model that generates domain-invariant features. On the other hand, the discriminator determines whether the segmentation maps come from the source-domain or the target-domain. However, the teacher network in SE-GAN produces pseudo labels for the target-domain images and conducts self-training on the target-domain for the student network. Since SE-GAN combines two promising methods, which are self-ensembling and adversarial training, it gets the advantages from both methods and addresses their respective weaknesses.
5. Frameworks
5.1. ParseNet
5.2. Pyramid Scene Parsing Network (PSPNet)
5.3. Bilateral Segmentation Network (BiSeNet)
5.4. DeepLabv3+
5.5. Mask R-CNN
5.6. Dual Attention Network (DANet)
5.7. Hybrid Task Cascade (HTC)
5.8. FastFCN
5.9. Gated Shape CNN (GSCNN)
5.10. ShelfNet
5.11. 3D-MiniNet
5.12. BlendMask
5.13. High-Resolution Network (HRNet)
5.14. Squeeze-and-Attention Network (SANet)
6. Discussion
6.1. Evaluation Metrics
6.1.1. Execution Time
6.1.2. Accuracy
6.1.3. Memory Footprint
6.2. Results
6.2.1. Semantic Segmentation
6.2.2. 3D Semantic Segmentation
6.2.3. Real-Time Semantic Segmentation
6.3. Summary
7. Future Directions
- Memory: For segmentation networks, significant amounts of memory are needed for execution. Some devices have limited memory, where the networks must be simplified to fit in them. Network simplification is made by reducing complexity, which decreases accuracy. One of the most promising research directions is simplifying a network by reducing its weight and keeping the accuracy of the original network architecture [96,97,98].
- 3D Datasets: There is an urgent need for large-scale 3D datasets, due to the evaluation of new segmentation techniques that depend on 3D data. Even though there are some promising works, a need remains for varied and better data. It is important to create a 3D dataset from real data because most of the existing 3D datasets are synthetic [99,100,101].
- Real-Time Segmentation: Most segmentation implementations are far from the common camera framerate, which is at least 25 fps. Currently, many of the existing frameworks take between 100 ms to 500 ms to process low-resolution images. For that reason, there is a need for new works that focus on real-time segmentation while finding a trade-off between runtime and accuracy [102,103,104].
8. Conclusions
Author Contributions
Conflicts of Interest
- Oberweger, M.; Wohlhart, P.; Lepetit, V. Hands Deep in Deep Learning for Hand Pose Estimation. arXiv 2015, arXiv:1502.06807. [Google Scholar]
- Wan, J.; Wang, D.; Hoi, S.C.H.; Wu, P.; Zhu, J.; Zhang, Y.; Li, J. Deep Learning for Content-Based Image Retrieval: A Comprehensive Study. In Proceedings of the 22nd ACM International Conference on Multimedia, New York, NY, USA, 3–7 November 2014; pp. 157–166. [Google Scholar]
- Ess, A.; Müller, T.; Ch, M.; Grabner, H.; van Gool, L.; Leuven Belgium, K.U. Segmentation-Based Urban Traffic Scene Understanding. In Proceedings of the 2009 British Machine Vision Conference, London, UK, 7–10 September 2009; p. 2. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 3213–3223. [Google Scholar]
- Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A Review on Deep Learning Techniques Applied to Semantic Segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar]
- Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollar, P. Panoptic Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9396–9405. [Google Scholar]
- Cheng, B.; Collins, M.D.; Zhu, Y.; Liu, T.; Huang, T.S.; Adam, H.; Chen, L.C. Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12475–12485. [Google Scholar]
- Broggi, A.; Berte, S. Vision-Based Road Detection in Automotive Systems: A Real-Time Expectation-Driven Approach. J. Artif. Intell. Res. 1995, 3, 325–348. [Google Scholar] [CrossRef] [Green Version]
- Jyothi, S.; Padmavati, S.; Visvavidyalayam, M. A Survey on Threshold Based Segmentation Technique in Image Processing. Int. J. Innov. Res. 2014, 3, 234–239. [Google Scholar]
- Nath, S.S.; Mishra, G.; Kar, J.; Chakraborty, S.; Dey, N. A Survey of Image Classification Methods and Techniques. In Proceedings of the 2014 International Conference on Control, Instrumentation, Communication and Computational Technologies, Kanyakumari District, India, 10–11 July 2014; pp. 554–557. [Google Scholar]
- Gulhane, A.; Paikrao, P.L.; Chaudhari, D.S. A Review of Image Data Clustering Techniques. Int. J. Soft Comput. Eng. 2012, 2, 212–215. [Google Scholar]
- Olaode, A.; Naghdy, G.; Todd, C. Unsupervised Classification of Images: A Review. Int. J. Image Process. 2014, 8, 325–342. [Google Scholar]
- Peng, B.; Zhang, L.; Zhang, D. A survey of graph theoretical approaches to image segmentation. Pattern Recognit. 2013, 46, 1020–1038. [Google Scholar] [CrossRef] [Green Version]
- Prieto, A.; Prieto, B.; Ortigosa, E.M.; Ros, E.; Pelayo, F.; Ortega, J.; Rojas, I. Neural networks: An overview of early research, current frameworks and new challenges. Neurocomputing 2016, 214, 242–268. [Google Scholar] [CrossRef]
- Ning, F.; Delhomme, D.; LeCun, Y.; Piano, F.; Bottou, L.; Barbano, P.E. Toward automatic phenotyping of developing embryos from videos. IEEE Trans. Image Process. 2005, 14, 1360–1371. [Google Scholar] [CrossRef] [Green Version]
- Ciresan, D.; Giusti, A.; Gambardella, L.; Schmidhuber, J. Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images. Adv. Neural Inf. Process. Syst. 2012, 25, 2843–2851. [Google Scholar]
- Farabet, C.; Couprie, C.; Najman, L.; LeCun, Y. Learning Hierarchical Features for Scene Labeling. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 1915–1929. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Simultaneous Detection and Segmentation. In Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 297–312. [Google Scholar]
- Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning Rich Features from RGB-D Images for Object Detection and Segmentation. In Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 345–360. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Jégou, S.; Drozdzal, M.; Vazquez, D.; Romero, A.; Bengio, Y. The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 11–19. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
- Choi, S.; Kim, J.T.; Choo, J. Cars Can’t Fly up in the Sky: Improving Urban-Scene Segmentation via Height-Driven Attention Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9373–9383. [Google Scholar]
- Brostow, G.J.; Fauqueur, J.; Cipolla, R. Semantic Object Classes in Video: A High-Definition Ground Truth Database. Pattern Recognit. Lett. 2009, 30, 88–97. [Google Scholar] [CrossRef]
- Sturgess, P.; Alahari, K.; Ladicky, L.; Torr, P.H.S. Combining Appearance and Structure from Motion Features for Road Scene Understanding. In Proceedings of the 2009 British Machine Vision Conference, London, UK, 7–10 September 2009. [Google Scholar]
- Alvarez, J.M.; Gevers, T.; Lecun, Y.; Lopez, A.M. Road Scene Segmentation from a Single Image. In Proceedings of the 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 376–389. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. In Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Everingham, M.; Eslami, S.M.A.; van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
- Ros, G.; Ramos, S.; Granados, M.; Bakhtiary, A.; Vazquez, D.; Lopez, A.M. Vision-Based Offline-Online Perception Paradigm for Autonomous Driving. In Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2015; pp. 231–238. [Google Scholar]
- Zhang, R.; Candra, S.A.; Vetter, K.; Zakhor, A. Sensor Fusion for Semantic Segmentation of Urban Scenes. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation, Seattle, WA, USA, 26–30 May 2015; pp. 1850–1857. [Google Scholar]
- Ros, G.; Sellart, L.; Materzynska, J.; Vazquez, D.; Lopez, A.M. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3234–3243. [Google Scholar]
- Richter, S.R.; Vineet, V.; Roth, S.; Koltun, V. Playing for Data: Ground Truth from Computer Games. In Proceedings of the 2016 European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 102–118. [Google Scholar]
- Neuhold, G.; Ollmann, T.; Rotabuì, S.; Kontschieder, P. The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4990–4999. [Google Scholar]
- Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene Parsing through ADE20K Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 633–641. [Google Scholar]
- Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9297–9307. [Google Scholar]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. NuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
- Huang, X.; Wang, P.; Cheng, X.; Zhou, D.; Geng, Q.; Yang, R. The ApolloScape Open Dataset for Autonomous Driving and Its Application. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2702–2719. [Google Scholar] [CrossRef] [Green Version]
- Brostow, G.J.; Shotton, J.; Fauqueur, J.; Cipolla, R. Segmentation and Recognition Using Structure from Motion Point Clouds. In Proceedings of the 10th European Conference on Computer Vision, Marseille, France, 12–18 October 2008; pp. 44–57. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision Meets Robotics: The KITTI Dataset. Int. J. Rob. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
- Ros, G.; Alvarez, J.M. Unsupervised Image Transformation for Outdoor Semantic Labelling. In Proceedings of the 2015 IEEE Intelligent Vehicles Symposium, Seoul, Korea, 28 June–1 July 2015; pp. 537–542. [Google Scholar]
- Wong, S.C.; Gatt, A.; Stamatescu, V.; McDonnell, M.D. Understanding Data Augmentation for Classification: When to Warp? In Proceedings of the 2016 International Conference on Digital Image Computing: Techniques and Applications, Gold Coast, Australia, 30 November–2 December 2016; pp. 1–6. [Google Scholar]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Bagherinezhad, H.; Horton, M.; Rastegari, M.; Farhadi, A. Label Refinery: Improving ImageNet Classification through Label Progression. arXiv 2018, arXiv:1805.02641. [Google Scholar]
- Taylor, L.; Nitschke, G. Improving Deep Learning Using Generic Data Augmentation. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence, Bangalore, India, 18–21 November 2018; pp. 1542–1547. [Google Scholar]
- Fei-Fei, L.; Fergus, R.; Perona, P. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories. In Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop, Washington, DC, USA, 27 June–2 July 2004; p. 178. [Google Scholar]
- Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 13001–13008. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Kang, G.; Dong, X.; Zheng, L.; Yang, Y. PatchShuffle Regularization. arXiv 2017, arXiv:1707.07103. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, Computer Science Department, University of Toronto, Toronto, ON, Canada, 2009. [Google Scholar]
- DeVries, T.; Taylor, G.W. Dataset Augmentation in Feature Space. arXiv 2017, arXiv:1702.05538. [Google Scholar]
- LeCun, Y. The MNIST Database of Handwritten Digits. Available online: (accessed on 8 April 2022).
- Gatys, L.A.; Ecker, A.S.; Bethge, M. A Neural Algorithm of Artistic Style. arXiv 2015, arXiv:1508.06576. [Google Scholar] [CrossRef]
- Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 694–711. [Google Scholar]
- Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
- dos Tanaka, F.H.K.S.; Aranha, C. Data Augmentation Using GANs. arXiv 2019, arXiv:1904.09135. [Google Scholar]
- Odena, A.; Olah, C.; Shlens, J. Conditional Image Synthesis with Auxiliary Classifier GANs. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2642–2651. [Google Scholar]
- Antoniou, A.; Storkey, A.; Edwards, H. Data Augmentation Generative Adversarial Networks. arXiv 2017, arXiv:1711.04340. [Google Scholar]
- Mariani, G.; Scheidegger, F.; Istrate, R.; Bekas, C.; Malossi, C. BAGAN: Data Augmentation with Balancing GAN. arXiv 2018, arXiv:1803.09655. [Google Scholar]
- Yi, X.; Walia, E.; Babyn, P. Generative Adversarial Network in Medical Imaging: A Review. Med. Image Anal. 2019, 58, 101552. [Google Scholar] [CrossRef] [Green Version]
- Shijie, J.; Ping, W.; Peiyi, J.; Siping, H. Research on Data Augmentation for Image Classification Based on Convolution Neural Networks. In Proceedings of the 2017 Chinese Automation Congress, Jinan, China, 20–22 October 2017; pp. 4165–4170. [Google Scholar]
- Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; Darrell, T. Deep Domain Confusion: Maximizing for Domain Invariance. arXiv 2014, arXiv:1412.3474. [Google Scholar]
- Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial Discriminative Domain Adaptation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 7167–7176. [Google Scholar]
- Long, M.; Cao, Y.; Wang, J.; Jordan, M.I. Learning Transferable Features with Deep Adaptation Networks. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 97–105. [Google Scholar]
- Hoffman, J.; Wang, D.; Yu, F.; Darrell, T. FCNs in the Wild: Pixel-Level Adversarial and Constraint-Based Adaptation. arXiv 2016, arXiv:1612.02649. [Google Scholar]
- Xu, Y.; Du, B.; Zhang, L.; Zhang, Q.; Wang, G.; Zhang, L. Self-Ensembling Attention Networks: Addressing Domain Shift for Semantic Segmentation. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 5581–5588. [Google Scholar]
- Chen, H.; Wu, C.; Xu, Y.; Du, B. Unsupervised Domain Adaptation for Semantic Segmentation via Low-Level Edge Information Transfer. arXiv 2021, arXiv:2109.08912. [Google Scholar]
- Xu, Y.; He, F.; Du, B.; Zhang, L.; Tao, D. Self-Ensembling GAN for Cross-Domain Semantic Segmentation. arXiv 2021, arXiv:2112.07999. [Google Scholar]
- Liu, W.; Rabinovich, A.; Berg, A.C. ParseNet: Looking Wider to See Better. arXiv 2015, arXiv:1506.04579. [Google Scholar]
- Mottaghi, R.; Chen, X.; Liu, X.; Cho, N.-G.; Lee, S.-W.; Fidler, S.; Urtasun, R.; Yuille, A. The Role of Context for Object Detection and Semantic Segmentation in the Wild. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 891–898. [Google Scholar]
- Liu, C.; Yuen, J.; Torralba, A. SIFT Flow: Dense Correspondence across Scenes and Its Applications. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 978–994. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 2881–2890. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
- Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid Task Cascade for Instance Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4974–4983. [Google Scholar]
- Wu, H.; Zhang, J.; Huang, K.; Liang, K.; Yu, Y. FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation. arXiv 2019, arXiv:1903.11816. [Google Scholar]
- Takikawa, T.; Acuna, D.; Jampani, V.; Fidler, S. Gated-SCNN: Gated Shape CNNs for Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–3 November 2019; pp. 5229–5238. [Google Scholar]
- Zhuang, J.; Yang, J.; Gu, L.; Dvornek, N. ShelfNet for Fast Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Korea, 27–28 October 2019; pp. 847–856. [Google Scholar]
- Alonso, I.; Riazuelo, L.; Montesano, L.; Murillo, A.C. 3D-MiniNet: Learning a 2D Representation from Point Clouds for Fast and Efficient 3D LIDAR Semantic Segmentation. IEEE Robot. Autom. Lett. 2020, 5, 5432–5439. [Google Scholar] [CrossRef]
- Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8573–8581. [Google Scholar]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [Green Version]
- Zhong, Z.; Lin, Z.Q.; Bidart, R.; Hu, X.; Daya, I.B.; Li, Z.; Zheng, W.-S.; Li, J.; Wong, A. Squeeze-and-Attention Networks for Semantic Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13065–13074. [Google Scholar]
- Caesar, H.; Uijlings, J.; Ferrari, V. COCO-Stuff: Thing and Stuff Classes in Context. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1209–1218. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–3 November 2019; pp. 9627–9636. [Google Scholar]
- Strigl, D.; Kofler, K.; Podlipnig, S. Performance and Scalability of GPU-Based Convolutional Neural Networks. In Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, Pisa, Italy, 17–19 February 2010; pp. 317–324. [Google Scholar]
- Kim, W.; Seok, J. Indoor Semantic Segmentation for Robot Navigating on Mobile. In Proceedings of the 10th International Conference on Ubiquitous and Future Networks, Prague, Czech Republic, 3–6 July 2018; pp. 22–25. [Google Scholar]
- Asadi, K.; Chen, P.; Han, K.; Wu, T.; Lobaton, E. LNSNet: Lightweight Navigable Space Segmentation for Autonomous Robots on Construction Sites. Data 2019, 4, 40. [Google Scholar] [CrossRef] [Green Version]
- Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
- Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Inference. arXiv 2016, arXiv:1611.06440. [Google Scholar]
- Anwar, S.; Hwang, K.; Sung, W. Structured Pruning of Deep Convolutional Neural Networks. ACM J. Emerg. Technol. Comput. Syst. 2017, 13, 1–18. [Google Scholar] [CrossRef] [Green Version]
- Tremblay, J.; To, T.; Birchfield, S. Falling Things: A Synthetic Dataset for 3D Object Detection and Pose Estimation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2038–2041. [Google Scholar]
- Jalal, M.; Spjut, J.; Boudaoud, B.; Betke, M. SIDOD: A Synthetic Image Dataset for 3D Object Pose Recognition with Distractors. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 475–477. [Google Scholar]
- Tan, W.; Qin, N.; Ma, L.; Li, Y.; Du, J.; Cai, G.; Yang, K.; Li, J. Toronto-3D: A Large-Scale Mobile LiDAR Dataset for Semantic Segmentation of Urban Roadways. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 202–203. [Google Scholar]
- Chen, T.; Dai, B.; Wang, R.; Liu, D.; Chen, T.; Dai, B.; Liu, D.; Dai, B.; Liu, D.; Wang, R. Gaussian-Process-Based Real-Time Ground Segmentation for Autonomous Land Vehicles. J. Intell. Robot. Syst. 2014, 76, 563–582. [Google Scholar] [CrossRef]
- Sun, L.; Yang, K.; Hu, X.; Hu, W.; Wang, K. Real-Time Fusion Network for RGB-D Semantic Segmentation Incorporating Unexpected Obstacle Detection for Road-Driving Images. IEEE Robot. Autom. Lett. 2020, 5, 5558–5565. [Google Scholar] [CrossRef]
- Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Dataset Name | Purpose | Number of Classes | Resolution | Real/Synthetic | Training | Validation | Test |
CamVid [30] | Urban (Driving) | 32 | 960 × 720 | Real | 701 | N/A | N/A |
CamVid-Sturgess [31] | Urban (Driving) | 11 | 960 × 720 | Real | 367 | 100 | 233 |
KITTI-Layout [32] | Urban/Driving | 3 | Variable | Real | 323 | N/A | N/A |
Microsoft COCO [33] | Generic | >80 | Variable | Real | 82,783 | 40,504 | 81,434 |
PASCAL VOC 2012 [34] | Generic | 21 | Variable | Real | 1464 | 1449 | Private |
KITTI-Ros [35] | Urban/Driving | 11 | Variable | Real | 170 | N/A | 46 |
KITTI-Zhang [36] | Urban/Driving | 10 | 1226 × 370 | Real | 140 | N/A | 112 |
Cityscapes [5] (fine) | Urban | 30 (8) | 2048 × 1024 | Real | 2975 | 500 | 1525 |
Cityscapes [5] (coarse) | Urban | 30 (8) | 2048 × 1024 | Real | 22,973 | 500 | N/A |
SYNTHIA [37] | Urban/Driving | 13 | 960 × 720 | Synthetic | 13,407 | N/A | N/A |
GTA5 [38] | Driving | 19 | 1914 × 1052 | Synthetic | N/A | N/A | N/A |
Mapillary Vistas [39] | Urban | 66 | High | Real | 18,000 | 2000 | 5000 |
ADE20K [40] | Urban/Indoor | 150 | High | Real | 27,574 | N/A | 2000 |
SemanticKITTI [41] | Driving | 28 | High | Real | 23,201 | N/A | 20,351 |
nuScenes [42] | Driving | 23 | High | Real | 700 | 150 | 150 |
Apolloscape [43] | Driving | 36 | 3384 × 2710 | Real | N/A | N/A | N/A |
Top-1 Accuracy | Top-5 Accuracy | |
Baseline | 48.13 ± 0.42% | 64.50 ± 0.65% |
Flipping | 49.73 ± 1.13% | 67.36 ± 1.38% |
Rotating | 50.80 ± 0.63% | 69.41 ± 0.48% |
Cropping | 61.95 ± 1.01% | 79.10 ± 0.80% |
Color Jittering | 49.57 ± 0.53% | 67.18 ± 0.42% |
Edge Enhancement | 49.29 ± 1.16% | 66.49 ± 0.84% |
Fancy PCA | 49.41 ± 0.84% | 67.54 ± 1.01% |
Model | Baseline | Random Erasing | ||
Top-1 | Top-5 | Top-1 | Top-5 | |
ResNet-34 | 25.22 | 8.01 | 24.89 | 7.71 |
ResNet-50 | 23.39 | 6.89 | 22.75 | 6.69 |
ResNet-101 | 20.98 | 5.73 | 20.43 | 5.30 |
Model | MNIST | CIFAR-10 |
Baseline | 1.093 ± 0.057 | 30.65 ± 0.27 |
Baseline + input space affine transformation | 1.477 ± 0.068 | - |
Baseline + input space extrapolation | 1.010 ± 0.065 | - |
Baseline + feature space extrapolation | 0.950 ± 0.036 | 29.24 ± 0.27 |
SYNTHIA → Cityscapes | |||
Method | Backbone | MIoU | |
NoAdapt | Adapt | ||
Self-Ensembling Attention Networks [71] | VGG-16 | 17.4 | 37.5 |
Semantic-Edge Domain Adaptation [72] | ResNet-101 | 37.8 | 55.9 |
SE-GAN [73] | ResNet-101 | 33.3 | 48.9 |
GTA5 → Cityscapes | |||
Method | Backbone | MIoU | |
NoAdapt | Adapt | ||
Self-Ensembling Attention Networks [71] | VGG-16 | 21.2 | 35.7 |
Semantic-Edge Domain Adaptation [72] | ResNet-101 | 36.5 | 52.8 |
SE-GAN [73] | ResNet-101 | 37.2 | 50.1 |
Framework | Tasks | Evaluation Results |
ParseNet [74] | Semantic Segmentation | Ranked #46 on semantic segmentation on PASCAL VOC 2012 test |
PSPNet [77] | Semantic Segmentation Real-Time Semantic Segmentation Video Semantic Segmentation Lesion Segmentation Scene Parsing Image Classification | Ranked #3 on video semantic segmentation on Cityscapes val |
BiSeNet [78] | Semantic Segmentation Real-Time Semantic Segmentation | Ranked #4 on semantic segmentation on SkyScapes-Dense |
DeepLabv3+ [79] | Semantic Segmentation Lesion Segmentation Image Classification | Ranked #1 on lesion segmentation on ATLAS |
Mask R-CNN [80] | Semantic Segmentation Instance Segmentation 3D Instance Segmentation Human Part Segmentation Nuclear Segmentation Panoptic Segmentation Object Detection Real-Time Object Detection Key Point Detection Multi-Human Parsing Pose Estimation | Ranked #1 on real-time object detection on COCO minival (MAP metric) |
DANet [81] | Semantic Segmentation Scene Segmentation | Ranked #8 on semantic segmentation on COCO-Stuff test |
HTC [82] | Semantic Segmentation Instance Segmentation Object Detection | Ranked #27 on instance segmentation on COCO test-dev |
FastFCN [83] | Semantic Segmentation | Ranked #29 on semantic segmentation on PASCAL Context |
GSCNN [84] | Semantic Segmentation | Ranked #16 on semantic segmentation on Cityscapes test |
ShelfNet [85] | Semantic Segmentation Real-Time Semantic Segmentation Scene Understanding Autonomous Driving | Ranked #11 on real-time semantic segmentation on Cityscapes test |
3D-MiniNet [86] | Semantic Segmentation Real-Time Semantic Segmentation 3D Semantic Segmentation Real-Time 3D Semantic Segmentation LiDAR Semantic Segmentation Autonomous Driving Autonomous Vehicles | Ranked #1 on real-time 3D semantic segmentation on SemanticKITTI |
BlendMask [87] | Semantic Segmentation Instance Segmentation Real-Time Instance Segmentation | Ranked #6 on real-time instance segmentation on COCO |
HRNet [88] | Semantic Segmentation Instance Segmentation Object Detection Pose Estimation Representation Learning | Ranked #1 on object detection COCO test-dev (Hardware Burden metric) |
SANet [89] | Semantic Segmentation | Ranked #14 on semantic segmentation on PASCAL VOC 2012 test (using extra training data) |
Metrics | Formula | Evaluation Focus |
Pixel Accuracy (PA) | This metric computes the ratio between the amount of correctly classified pixels and the total number of pixels. | |
Mean Pixel Accuracy (MPA) | This metric computes the ratio of correct pixels in a per-class basis and the averages over the total number of classes. It is an improved PA. | |
Mean Intersection over Union (MIoU) | This metric computes a ratio between the intersection and the union of two sets, which are the ground truth and the predicted segmentation. Reformulating the ratio is possible as the number of true positives over the sum of true positives, false negatives, and false positives which represents Intersection over Union (IoU). The IoU is computed on a per-class basis and then averaged. | |
Frequency Weighted Intersection over Union (FWIoU) | This metric weighs every class importance based on their appearance frequency. |
Dataset | Framework | Backbone | MIoU |
Semantic Segmentation | |||
CamVid [30] | PSPNet | 69.1 | |
BiSeNet | ResNet-18 | 68.7 | |
PASCAL VOC 2012 [34] | DeepLabv3+ | Xception-JFT | 89 |
SANet * | ResNet-101 | 86.1 | |
PSPNet * | ResNet-101 | 85.4 | |
ShelfNet * | ResNet-101 | 84.2 | |
SANet | ResNet-101 | 83.2 | |
DANet | ResNet-101 | 82.6 | |
ShelfNet | ResNet-101 | 81.1 | |
ParseNet | 69.8 | ||
Cityscapes test [5] | GSCNN | 82.8 | |
DeepLabv3+ (coarse) | 82.1 | ||
DANet | ResNet-101 | 81.5 | |
PSPNet (fine and coarse) | ResNet-101 | 80.2 | |
ShelfNet | ResNet-34 | 79 | |
BiSeNet | ResNet-101 | 78.9 | |
ADE20K [40] | PSPNet | ResNet-269 | 44.94 |
FastFCN | ResNet-101 | 44.34 | |
3D Semantic Segmentation | |||
SemanticKITTI [41] | 3D-MiniNet | 55.8 |
Dataset | 2D\3D | Framework | Backbone | Parameters | MIoU | Time | Frame |
Cityscapes test [5] | 2D | ShelfNet | ResNet-18 | 23.5 M | 74.8 | 16.9 ms | 59.2 fps |
2D | BiSeNet | ResNet-18 | 49.0 M | 74.7 | 15.2 ms | 65.5 fps | |
SemanticKITTI [41] | 3D | 3D-MiniNet | 3.97 M | 55.8 | 28 fps |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (
Share and Cite
Alokasi, H.; Ahmad, M.B. Deep Learning-Based Frameworks for Semantic Segmentation of Road Scenes. Electronics 2022, 11, 1884.
Alokasi H, Ahmad MB. Deep Learning-Based Frameworks for Semantic Segmentation of Road Scenes. Electronics. 2022; 11(12):1884.
Chicago/Turabian StyleAlokasi, Haneen, and Muhammad Bilal Ahmad. 2022. "Deep Learning-Based Frameworks for Semantic Segmentation of Road Scenes" Electronics 11, no. 12: 1884.
APA StyleAlokasi, H., & Ahmad, M. B. (2022). Deep Learning-Based Frameworks for Semantic Segmentation of Road Scenes. Electronics, 11(12), 1884.