skip to main content
article

ImageNet Large Scale Visual Recognition Challenge

Published: 01 December 2015 Publication History

Abstract

The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

References

[1]
Ahonen, T., Hadid, A., & Pietikinen, M. (2006). Face description with local binary patterns: Application to face recognition. Pattern Analysis and Machine Intelligence, 28(14), 2037-2041.
[2]
Alexe, B., Deselares, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2189-2202.
[3]
Arandjelovic, R., & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In CVPR.
[4]
Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In Computer vision and pattern recognition.
[5]
Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 33, 898-916.
[6]
Batra, D., Agrawal, H., Banik, P., Chavali, N., Mathialagan, C. S., & Alfadda, A. (2013). Cloudcv: Large-scale distributed computer vision as a cloud service.
[7]
Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2013). OpenSurfaces: A richly annotated catalog of surface appearance. In ACM transactions on graphics (SIGGRAPH).
[8]
Berg, A., Farrell, R., Khosla, A., Krause, J., Fei-Fei, L., Li, J., & Maji, S. (2013). Fine-grained competition. https://sites.google.com/site/fgcomp2013/.
[9]
Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. CoRR, abs/1405.3531.
[10]
Chen, Q., Song, Z., Huang, Z., Hua, Y., & Yan, S. (2014). Contextualizing object detection and classification. In CVPR.
[11]
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7, 551-585.
[12]
Criminisi, A. (2004). Microsoft Research Cambridge (MSRC) object recognition image database (version 2.0). http://research. microsoft.com/vision/cambridge/recognition.
[13]
Dean, T., Ruzon, M., Segal, M., Shlens, J., Vijayanarasimhan, S., & Yagnik, J. (2013). Fast, accurate detection of 100,000 object classes on a single machine. In CVPR.
[14]
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
[15]
Deng, J., Russakovsky, O., Krause, J., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2014). Scalable multi-label annotation. In CHI.
[16]
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531.
[17]
Dubout, C., & Fleuret, F. (2012). Exact acceleration of linear object detectors. In Proceedings of the European conference on computer vision (ECCV).
[18]
Everingham, M., Gool, L. V., Williams, C., Winn, J., & Zisserman, A. (2005-2012). PASCAL Visual Object Classes Challenge (VOC). http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
[19]
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303-338.
[20]
Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2014). The Pascal Visual Object Classes (VOC) challenge¿A retrospective. International Journal of Computer Vision, 111, 98-136.
[21]
Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In CVPR.
[22]
Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few examples: An incremental bayesian approach tested on 101 object categories. In CVPR.
[23]
Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627-1645.
[24]
Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, NIPS.
[25]
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. International Journal of Robotics Research, 32, 1231-1237.
[26]
Girshick, R. B., Donahue, J., Darrell, T., & Malik, J. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation (v4). CoRR.
[27]
Girshick, R., Donahue, J., Darrell, T., & Malik., J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
[28]
Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In ICCV.
[29]
Graham, B. (2013). Sparse arrays of signatures for online character recognition. CoRR.
[30]
Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset. Technical report 7694, Caltech.
[31]
Harada, T., & Kuniyoshi, Y. (2012). Graphical Gaussian vector for image categorization. In NIPS.
[32]
Harel, J., Koch, C., & Perona, P. (2007). Graph-based visual saliency. In NIPS.
[33]
He, K., Zhang, X., Ren, S., & Su, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV.
[34]
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580.
[35]
Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In ECCV.
[36]
Howard, A. (2014). Some improvements on deep convolutional neural network based image classification. In ICLR.
[37]
Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report 07-49, University of Massachusetts, Amherst.
[38]
Iandola, F. N., Moskewicz, M. W., Karayev, S., Girshick, R. B., Darrell, T., & Keutzer, K. (2014). Densenet: Implementing efficient convnet descriptor pyramids. CoRR.
[39]
Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/.
[40]
Jojic, N., Frey, B. J., & Kannan, A. (2003). Epitomic analysis of appearance and shape. In ICCV.
[41]
Kanezaki, A., Inaba, S., Ushiku, Y., Yamashita, Y., Muraoka, H., Kuniyoshi, Y., & Harada, T. (2014). Hard negative classes for multiple object detection. In ICRA.
[42]
Khosla, A., Jayadevaprakash, N., Yao, B., & Fei-Fei, L. (2011). Novel dataset for fine-grained image categorization. In First workshop on fine-grained visual categorization, CVPR.
[43]
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In NIPS.
[44]
Kuettel, D., Guillaumin, M., & Ferrari, V. (2012). Segmentation propagation in ImageNet. In ECCV.
[45]
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
[46]
Lin, M., Chen, Q., & Yan, S. (2014a). Network in network. In ICLR.
[47]
Lin, Y., Lv, F., Cao, L., Zhu, S., Yang, M., Cour, T., Yu, K., & Huang, T. (2011). Large-scale image classification: Fast feature extraction and SVM training. In CVPR.
[48]
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollr, P., & Zitnick, C. L. (2014b). Microsoft COCO: Common objects in context. In ECCV.
[49]
Liu, C., Yuen, J., & Torralba, A. (2011). Nonparametric scene parsing via label transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 2368-2382.
[50]
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91-110.
[51]
Maji, S., & Malik, J. (2009). Object detection using a max-margin hough transform. In CVPR.
[52]
Manen, S., Guillaumin, M., & Van Gool, L. (2013). Prime object proposals with randomized Prim's algorithm. In ICCV.
[53]
Mensink, T., Verbeek, J., Perronnin, F., & Csurka, G. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.
[54]
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In ICLR.
[55]
Miller, G. A. (1995). Wordnet: A lexical database for English. Commun. ACM, 38(11), 39-41.
[56]
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. In IJCV.
[57]
Ordonez, V., Deng, J., Choi, Y., Berg, A. C., & Berg, T. L. (2013). From large scale image categorization to entry-level categories. In IEEE international conference on computer vision (ICCV).
[58]
Ouyang, W., & Wang, X. (2013). Joint deep learning for pedestrian detection. In ICCV.
[59]
Ouyang, W., Luo, P., Zeng, X., Qiu, S., Tian, Y., Li, H., Yang, S., Wang, Z., Xiong, Y., Qian, C., Zhu, Z., Wang, R., Loy, C. C., Wang, X., & Tang, X. (2014). Deepid-net: multi-stage and deformable deep convolutional neural networks for object detection. CoRR, abs/1409.3505.
[60]
Papandreou, G. (2014). Deep epitomic convolutional neural networks. CoRR.
[61]
Papandreou, G., Chen, L.-C., & Yuille, A. L. (2014). Modeling image patches with a generic dictionary of mini-epitomes.
[62]
Perronnin, F., & Dance, C. R. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.
[63]
Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In CVPR.
[64]
Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV (4).
[65]
Russakovsky, O., Deng, J., Huang, Z., Berg, A., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: What have we done, & where are we going? In ICCV.
[66]
Russell, B., Torralba, A., Murphy, K., & Freeman, W. T. (2007). LabelMe: A database and web-based tool for image annotation. In IJCV.
[67]
Sanchez, J., & Perronnin, F. (2011). High-dim. signature compression for large-scale image classification. In CVPR.
[68]
Sanchez, J., Perronnin, F., & de Campos, T. (2012). Modeling spatial layout of images beyond spatial pyramids. In PRL.
[69]
Scheirer, W., Kumar, N., Belhumeur, P. N., & Boult, T. E. (2012). Multi-attribute spaces: Calibration for attribute fusion and similarity search. In CVPR.
[70]
Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In CVPR.
[71]
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229.
[72]
Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? Improving data quality and data mining using multiple, noisy labelers. In SIGKDD.
[73]
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.
[74]
Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep fisher networks for large-scale image classification. In NIPS.
[75]
Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon Mechanical Turk. In InterNet08.
[76]
Su, H., Deng, J., & Fei-Fei, L. (2012). Crowdsourcing annotations for visual object detection. In AAAI human computation workshop.
[77]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., & Rabinovich, A. (2014). Going deeper with convolutions. Technical report.
[78]
Tang, Y. (2013). Deep learning using support vector machines. CoRR, abs/1306.0239.
[79]
Thorpe, S., Fize, D., Marlot, C., et al. (1996). Speed of processing in the human visual system. Nature, 381(6582), 520-522.
[80]
Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR'11.
[81]
Torralba, A., Fergus, R., & Freeman, W. (2008). 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958-1970.
[82]
Uijlings, J., van de Sande, K., Gevers, T., & Smeulders, A. (2013). Selective search for object recognition. International Journal of Computer Vision, 104, 154-171.
[83]
Urtasun, R., Fergus, R., Hoiem, D., Torralba, A., Geiger, A., Lenz, P., Silberman, N., Xiao, J., & Fidler, S. (2013-2014). Reconstruction meets recognition challenge. http://ttic.uchicago.edu/rurtasun/rmrc/.
[84]
van de Sande, K. E. A., Snoek, C. G. M., & Smeulders, A. W. M. (2014). Fisher and vlad with flair. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[85]
van de Sande, K. E. A., Uijlings, J. R. R., Gevers, T., & Smeulders, A. W. M. (2011b). Segmentation as selective search for object recognition. In ICCV.
[86]
van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1582-1596.
[87]
van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2011a). Empowering visual categorization with the GPU. IEEE Transactions on Multimedia, 13(1), 60-70.
[88]
Vittayakorn, S., & Hays, J. (2011). Quality assessment for crowdsourced object annotations. In BMVC.
[89]
von Ahn, L., & Dabbish, L. (2005). Esp: Labeling images with a computer game. In AAAI spring symposium: Knowledge collection from volunteer contributors.
[90]
Vondrick, C., Patterson, D., & Ramanan, D. (2012). Efficiently scaling up crowdsourced video annotation. International Journal of Computer Vision, 1010, 184-204.
[91]
Wan, L., Zeiler, M., Zhang, S., LeCun, Y., & Fergus, R. (2013). Regularization of neural networks using dropconnect. In Proceedings of the international conference on machine learning (ICML'13).
[92]
Wang, M., Xiao, T., Li, J., Hong, C., Zhang, J., & Zhang, Z. (2014). Minerva: A scalable and highly efficient training platform for deep learning. In APSys.
[93]
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In CVPR.
[94]
Wang, X., Yang, M., Zhu, S., & Lin, Y. (2013). Regionlets for generic object detection. In ICCV.
[95]
Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In NIPS.
[96]
Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba., A. (2010). SUN database: Large-scale scene recognition from Abbey to Zoo. In CVPR.
[97]
Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.
[98]
Yao, B., Yang, X., & Zhu, S.-C. (2007). Introduction to a large scale general purpose ground truth dataset: methodology, annotation tool, and benchmarks. Berlin: Springer.
[99]
Zeiler, M. D., & Fergus, R. (2013). Visualizing and understanding convolutional networks. CoRR, abs/1311.2901.
[100]
Zeiler, M. D., Taylor, G. W., & Fergus, R. (2011). Adaptive deconvolutional networks for mid and high level feature learning. In ICCV.
[101]
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In <NIPS.
[102]
Zhou, X., Yu, K., Zhang, T., & Huang, T. (2010). Image classification using super-vector coding of local image descriptors. In ECCV.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal of Computer Vision
International Journal of Computer Vision  Volume 115, Issue 3
December 2015
152 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 December 2015

Author Tags

  1. Benchmark
  2. Dataset
  3. Large-scale
  4. Object detection
  5. Object recognition

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media