Abstract
Learning discriminative representations with deep neural networks often relies on massive labeled data, which is expensive and difficult to obtain in many real scenarios. As an alternative, self-supervised learning that leverages input itself as supervision is strongly preferred for its soaring performance on visual representation learning. This paper introduces a contrastive self-supervised framework for learning generalizable representations on the synthetic data that can be obtained easily with complete controllability. Specifically, we propose to optimize a contrastive learning task and a physical property prediction task simultaneously. Given the synthetic scene, the first task aims to maximize agreement between a pair of synthetic images generated by our proposed view sampling module, while the second task aims to predict three physical property maps, i.e., depth, instance contour maps, and surface normal maps. In addition, a feature-level domain adaptation technique with adversarial training is applied to reduce the domain difference between the realistic and the synthetic data. Experiments demonstrate that our proposed method achieves state-of-the-art performance on several visual recognition datasets.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
B. Zhao, J. S. Feng, X. Wu, S. Yan. A survey on deep learning-based fine-grained object classification and semantic segmentation. International Journal of Automation and Computing, vol. 14, no. 2, pp. 119–135, 2017. DOI: https://doi.org/10.1007/s11633-017-1053-3.
V. K. Ha, J. C. Ren, X. Y. Xu, S. Zhao, G. Xie, V. Masero, A. Hussain. Deep learning based single image super-resolution: A survey. International Journal of Automation and Computing, vol. 16, no. 4, pp. 413–426, 2019. DOI: https://doi.org/10.1007/s11633-019-1183-x.
K. Aukkapinyo, S. Sawangwong, P. Pooyoi, W. Kusakunniran. Localization and classification of rice-grain images using region proposals-based convolutional neural network. International Journal of Automation and Computing, vol. 17, no. 2, pp. 233–246, 2020. DOI: https://doi.org/10.1007/s11633-019-1207-6.
X. L. Wang, A. Gupta. Unsupervised learning of visual representations using videos. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 2794–2802, 2015. DOI: https://doi.org/10.1109/ICCV.2015.320.
C. Doersch, A. Gupta, A. A. Efros. Unsupervised visual representation learning by context prediction. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1422–1430, 2015. DOI: https://doi.org/10.1109/ICCV.2015.167.
C. Doersch, A. Zisserman. Multi-task self-supervised visual learning. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 2070–2079, 2017. DOI: https://doi.org/10.1109/ICCV.2017.226.
S. Gidaris, P. Singh, N. Komodakis. Unsupervised representation learning by predicting image rotations. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2536–2544, 2016. DOI: https://doi.org/10.1109/CVPR.2016.278.
G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, vol. 313, no. 5786, pp. 504–507, 2006. DOI: https://doi.org/10.1126/science.1127647.
P. Vincent, H. Larochelle, Y. Bengio, P. A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine learning, ACM, Helsinki, Finland, pp. 1096–1103, 2008. DOI: https://doi.org/10.1145/1390156.1390294.
R. Lopez, J. Regier, M. I. Jordan, N. Yosef. Information constraints on auto-encoding variational bayes. In Advances in Neural Information Processing, Montreal, Canada, pp. 6117–6128, 2018.
X. Liu, F. J. Zhang, Z. Y. Hou, Z. Y. Wang, L. Mian, J. Zhang, J. Tang. Seff-supervssed learning: Generative or contrastive. [Online], Available: https://arxiv.org/abs/2006.08218, 2020.
Z. Z. Ren, Y. Jae Lee. Cross-domain self-supervised multitask feature learning using synthetic imagery. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA, pp. 762–771, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00086.
R. Zhang, P. Isola, A. A. Efros. Colorful image colorization. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 649–666, 2016. DOI: https://doi.org/10.1007/978-3-319-46487-9_40.
R. Hadsell, S. Chopra, Y. LeCun. Dimensionality reduction by learning an invariant mapping. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern, IEEE, New York, USA, pp. 1735–1742, 2006. DOI: https://doi.org/10.1109/CVPR.2006.100.
A. van den Oord, Y. Z. Li, O. Vinyals. Representation learning with contrastive predictive coding. [Online], Available: https://arxiv.org/abs/1807.03748, 2018.
R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, Y. Bengio. Learning deep representations by mutual information estimation and maximization. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
N. Saunshi, O. Plevrakis, V. Arora, M. Khodak, H. Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, USA, pp. 5628–5637, 2019.
T. Nathan Mundhenk, D. Ho, B. Y. Chen. Improvements to context based self-supervised learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 9339–9348, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00973.
M. Noroozi, P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 69–84, 2016. DOI: https://doi.org/10.1007/978-3-319-46466-4_5.
H. Y. Lee, J. B. Huang, M. Singh, M. H. Yang. Unsupervised representation learning by sorting sequences. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 667–676, 2017. DOI: https://doi.org/10.1109/ICCV.2017.79.
D. Kim, D. Cho, D. Yoo, I. S. Kweon. Learning image representations by completing damaged jigsaw puzzles. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, IEEE, Lake Tahoe, USA, pp. 793–802, 2018. DOI: https://doi.org/10.1109/WACV.2018.00092.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, ACM, Lake Tahoe, USA, pp. 3111–3119, 2013.
X. H. Zhan, X. G Pan, Z. W. Liu, D. H. Lin, C. C. Loy. Self-supervised learning via conditional motion propagation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 1881–1889, 2019 DOI: https://doi.org/10.1109/CVPR.2019.00198
Z. Y. Feng, C. Xu, D. C. Tao. Self-supervised representation learning by rotation feature decoupling. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 10364–10374, 2019. DOI: https://doi.org/10.1109/CVPR.2019.01061.
X. L. Wang, K. M. He, A. Gupta. Transitive invariance for self-supervised visual representation learning. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 1338–1347, 2017. DOI: https://doi.org/10.1109/ICCV.2017.149.
L. H. Zhang, G J. Qi, L. Q. Wang, J. B. Luo. AET vs. AED: Unsupervised representation learning by auto-encoding transformations rather than data. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 2542–2550, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00265.
J. Donahue, K. Simonyan. Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 10541–10551, 2019.
R. Zhang, P. Isola, A. A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 645–654, 2017. DOI: https://doi.org/10.1109/CVPR.2017.76.
X. C. Peng, B. C. Sun, K. Ali, K. Saenko. Learning deep object detectors from 3D models. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1278–1286, 2015. DOI 10.1109/ICCV.2015.151.
O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. M. A. Eslami, A. van den Oord. Data-efficient image recognition with contrastive predictive coding. [Online], Available: https://arxiv.org/abs/1905.09272, 2019.
P. Bachman, R. D. Hjelm, W. Buchwalter. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 15509–15519, 2019.
M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, M. Lucic. On mutual information maximization for representation learning. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
K. M. He, H. Q. Fan, Y. X. Wu, S. N. Xie, R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9726–9735, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00975.
T. Chen, S. Kornblith, M. Norouzi, G. Hinton. A simple framework for contrastive learning of visual representations. [Online], Available: https://arxiv.org/abs/2002.05709, 2020.
Y. L. Tian, D. Krishnan, P. Isola. Contrastive Multiview coding. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 776–794, 2020. DOI: https://doi.org/10.1007/978-3-030-58621-8_45.
T. Chen, Y. Z. Sun, Y. Shi, L. J. Hong. On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Halifax, Canada, pp. 767–776, 2017. DOI: https://doi.org/10.1145/3097983.3098202.
J. McCormac, A. Handa, S. Leutenegger, A. J. Davison. SceneNet RGB-D: Can 5M synthetic images beat generic imagenet pre-training on indoor segmentation? In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 2697–2706, 2017. DOI: https://doi.org/10.1109/ICCV.2017.292.
T. Hachisuka, H. W. Jensen. Parallel progressive photon mapping on GPUS. In ACM SIGGRAPH ASIA, Seoul, Proceedings of Korea, pp. 54:1, 2010.
S. N. Xie, Z. W. Tu. Holistically-nested edge detection. International Journal of Computer Vision, vol. 125, no. 1–3, pp. 3–18, 2017. DOI: https://doi.org/10.1007/s11263-017-1004-z.
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, ACM, Montreal, Canada, pp. 2672–2680, 2014.
Y. Ganin, V. S. Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 1180–1189, 2015.
K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 3722–3731, 2017. DOI: https://doi.org/10.1109/CVPR.2017.18.
E. Tzeng, J. Hoffman, K. Saenko, T. Darrell. Adversarial discriminative domain adaptation. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 7167–7176, 2017. DOI: https://doi.org/10.1109/CVPR.2017.316.
K. Sohn, W. L. Shang, X. Yu, M. Chandraker. Unsupervised domain adaptation for distance metric learning. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, ACM, Lake Tahoe, USA, pp. 1097–1105, 2012.
B. L. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–1464, 2018. DOI: https://doi.org/10.1109/TPAMI.2017.2723009.
M. Noroozi, A. Vinjimoor, P. Favaro, H. Pirsiavash. Boosting self-supervised learning via knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 9359–9367, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00975.
P. Krähenbühl, C. Doersch, J. Donahue, T. Darrell. Data-dependent initializations of convolutional neural networks. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2016.
M. Noroozi, H. Pirsiavash, P. Favaro. Representation learning by learning to count. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 5899–5907, 2017. DOI: https://doi.org/10.1109/ICCV.2017.628.
B. Zhou, À. Lapedriza, J. X. Xiao, A. Torralba, A. Oliva. Learning deep features for scene recognition using places database. In Proceedings of Conference in Neural Information Processing Systems, Montreal, Canada, pp. 487–495, 2014.
M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015. DOI: https://doi.org/10.1007/s11263-014-0733-5.
R. Girshick. Fast R-CNN. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1440–1448, 2015. DOI: https://doi.org/10.1109/ICCV.2015.169.
J. Long, E. Shelhamer, T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 3431–3440, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298965.
N. Silberman, D. Hoiem, P. Kohli, R. Fergus. Indoor segmentation and support inference from RGBD images. In Proceedings of the 12th European Conference on Computer Vision, Springer, Florence, Italy, pp. 746–760, 2012. DOI: https://doi.org/10.1007/978-3-642-33715-4_54.
L. Ladicky, B. Zeisl, M. Pollefeys. Discriminatively trained dense surface normal estimation. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 468–484, 2014. DOI: https://doi.org/10.1007/978-3-319-10602-1_31.
Acknowledgements
This work was supported by National Natural Science Foundation of China (No. 61822204 and 61521002).
Author information
Authors and Affiliations
Corresponding author
Additional information
Recommended by Associate Editor Jangmyung Lee
Colored figures are available in the online version at https://link.springer.com/journal/11633
Dong-Yu She received the B. Eng. and the M. Eng. degrees in computer science and technology from Nankai University, China in 2019 and 2016, respectively. She is a Ph. D. degree candidate in Department of Computer Science and Technology, Tsinghua University, China.
Her research interests include deep learning and computer vision. E-mail: [email protected]
ORCID iD: 0000-0002-1434-562X
Kun Xu received B. Eng. and Ph.D. degrees in computer science and technology from Tsinghua University, China in 2005 and 2009, respectively. He is an associate professor in Department of Computer Science and Technology, Tsinghua University, China.
His research interests include realistic rendering and image/video editing.
E-mail: [email protected] (Corresponding author)
ORCID iD: 0000-0002-2671-4170
Rights and permissions
Open Access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
She, DY., Xu, K. Contrastive Self-supervised Representation Learning Using Synthetic Data. Int. J. Autom. Comput. 18, 556–567 (2021). https://doi.org/10.1007/s11633-021-1297-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-021-1297-9