skip to main content
10.1007/978-3-030-58545-7_40guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation

Published: 23 August 2020 Publication History

Abstract

Supervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences and extra images to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences and extra images to surpass state-of-the-art performance on core computer vision tasks.

References

[1]
Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (2016)
[2]
Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. arXiv:1609.08675 (2016)
[3]
Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K.: Pseudo-labeling and confirmation bias in deep semi-supervised learning. arXiv:1908.02983 (2019)
[4]
Badrinarayanan, V., Galasso, F., Cipolla, R.: Label propagation in video sequences. In: CVPR (2010)
[5]
Bell S, Upchurch P, Snavely N, and Bala K OpenSurfaces: a richly annotated catalog of surface appearance ACM Trans. Graph. 2013 32 1-17
[6]
Budvytis, I., Sauer, P., Roddick, T., Breen, K., Cipolla, R.: Large scale labelled video data augmentation for semantic segmentation in driving scenarios. In: ICCV Workshop (2017)
[7]
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)
[8]
Castrejon, L., Kundu, K., Urtasun, R., Fidler, S.: Annotating object instances with a polygon-RNN. In: CVPR (2017)
[9]
Chen, L.C., et al.: Searching for efficient multi-scale architectures for dense image prediction. In: NeurIPS (2018)
[10]
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2015)
[11]
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. In: IEEE TPAMI (2017)
[12]
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 (2017)
[13]
Chen L-C, Zhu Y, Papandreou G, Schroff F, and Adam H Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Encoder-decoder with atrous separable convolution for semantic image segmentation Computer Vision – ECCV 2018 2018 Cham Springer 833-851
[14]
Cheng, B., et al.: Panoptic-DeepLab. In: ICCV COCO + Mapillary Joint Recognition Challenge Workshop (2019)
[15]
Cheng, B., et al.: Panoptic-DeepLab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: CVPR (2020)
[16]
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: CVPR (2017)
[17]
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
[18]
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: practical data augmentation with no separate search. arXiv:1909.13719 (2019)
[19]
Dai, J., He, K., Sun, J.: Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: ICCV (2015)
[20]
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)
[21]
Everingham M, Van Gool L, Williams CK, Winn J, and Zisserman A The pascal visual object classes (VOC) challenge IJCV 2010 88 2 303-338
[22]
Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice Hall Professional Technical Reference (2002)
[23]
Gadde, R., Jampani, V., Gehler, P.V.: Semantic video CNNs through representation warping. In: ICCV (2017)
[24]
Geiger A, Lenz P, Stiller C, and Urtasun R Vision meets robotics: the KITTI dataset Int. J. Robot. Res. 2013 32 1231-1237
[25]
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: CVPR (2018)
[26]
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
[27]
Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV (2011)
[28]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
[29]
Hénaff, O.J., Razavi, A., Doersch, C., Eslami, S., Oord, A.v.d.: Data-efficient image recognition with contrastive predictive coding. arXiv:1905.09272 (2019)
[30]
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
[31]
Hong, S., Noh, H., Han, B.: Decoupled deep neural network for semi-supervised semantic segmentation. In: NeurIPS (2015)
[32]
Huang G, Sun Yu, Liu Z, Sedra D, and Weinberger KQ Leibe B, Matas J, Sebe N, and Welling M Deep networks with stochastic depth Computer Vision – ECCV 2016 2016 Cham Springer 646-661
[33]
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
[34]
Iscen, A., Tolias, G., Avrithis, Y., Chum, O.: Label propagation for deep semi-supervised learning. In: CVPR (2019)
[35]
Khoreva, A., Benenson, R., Hosang, J., Hein, M., Schiele, B.: Simple does it: weakly supervised instance and semantic segmentation. In: CVPR (2017)
[36]
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
[37]
Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR (2019)
[38]
Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. In: CVPR (2019)
[39]
Kornblith, S., Shlens, J., Le, Q.V.: Do better imagenet models transfer better? In: CVPR (2019)
[40]
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NeurIPS (2012)
[41]
Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and think like people. Behav. Brain Sci. (2017)
[42]
Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: ICML Workshop (2013)
[43]
Li, J., Raventos, A., Bhargava, A., Tagawa, T., Gaidon, A.: Learning to fuse things and stuff. arXiv:1812.01192 (2018)
[44]
Li LJ and Fei-Fei L Optimol: automatic online picture collection via incremental model learning IJCV 2010 88 147-168
[45]
Li Q, Arnab A, and Torr PHS Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Weakly- and semi-supervised panoptic segmentation Computer Vision – ECCV 2018 2018 Cham Springer 106-124
[46]
Li, Q., Qi, X., Torr, P.H.: Unifying training and inference for panoptic segmentation. arXiv:2001.04982 (2020)
[47]
Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: CVPR (2017)
[48]
Liang, J., Homayounfar, N., Ma, W.C., Xiong, Y., Hu, R., Urtasun, R.: Polytransform: deep polygon transformer for instance segmentation. arXiv:1912.02801 (2019)
[49]
Lin T-Y et al. Fleet D, Pajdla T, Schiele B, Tuytelaars T, et al. Microsoft COCO: common objects in context Computer Vision – ECCV 2014 2014 Cham Springer 740-755
[50]
Liu, C., et al.: Auto-DeepLab: hierarchical neural architecture search for semantic image segmentation. In: CVPR (2019)
[51]
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: CVPR (2018)
[52]
Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: looking wider to see better. arXiv:1506.04579 (2015)
[53]
Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: ICCV (2017)
[54]
Mustikovela SK, Yang MY, and Rother C Hua G and Jégou H Can ground truth label propagation from video help semantic segmentation? Computer Vision – ECCV 2016 Workshops 2016 Cham Springer 804-820
[55]
Neuhold, G., Ollmann, T., Bulò, S.R., Kontschieder, P.: The mapillary vistas dataset for semantic understanding of street scenes. In: ICCV (2017)
[56]
Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow propagation. In: CVPR (2018)
[57]
Papandreou, G., Chen, L.C., Murphy, K.P., Yuille, A.L.: Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In: ICCV (2015)
[58]
Papandreou G, Zhu T, Chen L-C, Gidaris S, Tompson J, and Murphy K Ferrari V, Hebert M, Sminchisescu C, and Weiss Y PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model Computer Vision – ECCV 2018 2018 Cham Springer 282-299
[59]
Pathak, D., Krahenbuhl, P., Darrell, T.: Constrained convolutional neural networks for weakly supervised segmentation. In: ICCV (2015)
[60]
Pinheiro, P.O., Collobert, R., Dollár, P.: Learning to segment object candidates. In: NeurIPS (2015)
[61]
Porzi, L., Bulò, S.R., Colovic, A., Kontschieder, P.: Seamless scene segmentation. In: CVPR (2019)
[62]
Porzi, L., Hofinger, M., Ruiz, I., Serrat, J., Bulo, S.R., Kontschieder, P.: Learning multi-object tracking and segmentation from automatic annotations. In: CVPR (2020)
[63]
Qi, H., et al.: Deformable convolutional networks - COCO detection and segmentation challenge 2017 entry. In: ICCV COCO Challenge Workshop (2017)
[64]
Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., He, K.: Data distillation: towards omni-supervised learning. In: CVPR (2018)
[65]
Real, E., Shlens, J., Mazzocchi, S., Pan, X., Vanhoucke, V.: YouTube-BoundingBoxes: a large high-precision human-annotated data set for object detection in video. In: CVPR (2017)
[66]
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
[67]
Riloff, E., Wiebe, J.: Learning extraction patterns for subjective expressions. In: EMNLP (2003)
[68]
Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised self-training of object detection models. WACV/MOTION (2005)
[69]
Russakovsky O et al. ImageNet large scale visual recognition challenge IJCV 2015 115 211-252
[70]
Russell BC, Torralba A, Murphy KP, and Freeman WT LabelMe: a database and web-based tool for image annotation IJCV 2008 77 157-173
[71]
Scudder H Probability of error of some adaptive pattern-recognition machines IEEE Trans. Inf. Theor. 1965 11 363-371
[72]
Shi W, Gong Y, Ding C, Ma Z, Tao X, and Zheng N Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Transductive semi-supervised deep learning using min-max features Computer Vision – ECCV 2018 2018 Cham Springer 311-327
[73]
Souly, N., Spampinato, C., Shah, M.: Semi supervised semantic segmentation using generative adversarial network. In: ICCV (2017)
[74]
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: ICCV (2017)
[75]
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. arXiv:1912.04838 (2019)
[76]
Tang, Y., Wang, J., Gao, B., Dellandréa, E., Gaizauskas, R., Chen, L.: Large scale semi-supervised object detection using visual and semantic knowledge transfer. In: CVPR (2016)
[77]
Voigtlaender, P., et al.: Mots: multi-object tracking and segmentation. In: CVPR (2019)
[78]
Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.C.: Axial-DeepLab: stand-alone axial-attention for panoptic segmentation. arXiv:2003.07853 (2020)
[79]
Wang, P., et al.: Understanding convolution for semantic segmentation. arXiv:1702.08502 (2017)
[80]
Wei, Y., et al.: STC: a simple to complex framework for weakly-supervised semantic segmentation. In: IEEE TPAMI (2016)
[81]
Wei, Y., Xiao, H., Shi, H., Jie, Z., Feng, J., Huang, T.S.: Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In: CVPR (2018)
[82]
Wu, J., Yildirim, I., Lim, J.J., Freeman, B., Tenenbaum, J.: Galileo: perceiving physical object properties by integrating a physics engine with deep learning. In: NeurIPS (2015)
[83]
Wu Z, Shen C, and Van Den Hengel A Wider or deeper: revisiting the ResNet model for visual recognition Pattern Recogn. 2019 90 119-133
[84]
Xie, Q., Hovy, E., Luong, M.T., Le, Q.V.: Self-training with noisy student improves imagenet classification. arXiv:1911.04252 (2019)
[85]
Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., Urtasun, R.: UPSNet: a unified panoptic segmentation network. In: CVPR (2019)
[86]
Yalniz, I.Z., J’egou, H., Chen, K., Paluri, M., Mahajan, D.: Billion-scale semi-supervised learning for image classification. arXiv:1905.00546 (2019)
[87]
Yang, T.J., et al.: DeeperLab: single-shot image parser. arXiv:1902.05093 (2019)
[88]
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: ACL (1995)
[89]
Yuan, Y., Chen, X., Wang, J.: Object-contextual representations for semantic segmentation. arXiv:1909.11065 (2019)
[90]
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)
[91]
Zhai, X., Oliver, A., Kolesnikov, A., Beyer, L.: S4l: self-supervised semi-supervised learning. In: ICCV (2019)
[92]
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
[93]
Zheng, Z., Zheng, L., Yang, Y.: Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In: ICCV (2017)
[94]
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: CVPR (2017)
[95]
Zhu, Y., et al.: Improving semantic segmentation via video propagation and label relaxation. In: CVPR (2019)
[96]
Zhu, Y., et al.: Improving semantic segmentation via self-training. arXiv:2004.14960 (2020)

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX
Aug 2020
860 pages
ISBN:978-3-030-58544-0
DOI:10.1007/978-3-030-58545-7

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 August 2020

Author Tags

  1. Semi-supervised learning
  2. Pseudo label
  3. Semantic segmentation
  4. Instance segmentation
  5. Panoptic segmentation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media