skip to main content
10.1145/2578726.2578739acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
tutorial

Towards Efficient Learning of Optimal Spatial Bag-of-Words Representations

Published: 01 April 2014 Publication History

Abstract

Spatial Pyramid Matching (SPM) assumes that the spatial Bag-of-Words (BoW) representation is independent of data. However, evidence has shown that the assumption usually leads to a suboptimal representation. In this paper, we propose a novel method called Jensen-Shannon (JS) Tiling to learn the BoW representation from data directly at the BoW level. The proposed JS Tiling is especially appropriate for large-scale datasets as it is orders of magnitude faster than existing methods, but with comparable or even better classification precision. Experimental results on four benchmarks including two TRECVID12 datasets validate that JS Tiling outperforms the SPM and the state-of-the-art methods. The runtime comparison demonstrates that selecting BoW representations by JS Tiling is more than 1,000 times faster than running classifiers. Besides, JS Tiling is an important component contributing to CMU Teams' final submission in TRECVID 2012 Multimedia Event Detection.

References

[1]
JS Tiling webpage. https://code.google.com/p/learning2tile/.
[2]
Y. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In CVPR, 2010.
[3]
Y. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition. In ICML, 2010.
[4]
C. Chang and C. Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
[5]
M. Chen and A. Hauptmann. MoSift: Recognizing human actions in surveillance videos. Technical report, Carnegie Mellon University, 2009.
[6]
M. Er. A fast algorithm for generating set partitions. The Computer Journal, 31(3):283--284, 1988.
[7]
M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303--338, 2010.
[8]
J. Feng, B. Ni, Q. Tian, and S. Yan. Geometric lp-norm feature pooling for image classification. In CVPR, 2011.
[9]
A. Habibian, K. E. van de Sande, and C. G. Snoek. Recommendations for video event recognition using concept vocabularies. In ICMR, pages 89--96, 2013.
[10]
Y. Jia, C. Huang, and T. Darrell. Beyond spatial pyramids: Receptive field learning for pooled image features. In CVPR, 2012.
[11]
L. Jiang, A. G. Hauptmann, and G. Xiang. Leveraging high-level and low-level features for multimedia event detection. In Multimedia, pages 449--458, 2012.
[12]
S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.
[13]
X. Li, C. G. Snoek, M. Worring, and A. W. Smeulders. Fusing concept detection and geo context for visual search. In ICMR, page 4, 2012.
[14]
D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91--110, 2004.
[15]
M. Marszalek and C. Schmid. Learning representations for visual object class recognition. In PASCAL VOC workshop, 2007.
[16]
M. Orlov. Efficient Generation of Set Partitions. Technical report, University of ULM, 2002.
[17]
P. Over, J. Fiscus, and G. Sanders. Trecvid 2012--an overview to the goals, tasks, data, evaluation mechanisms, and metrics. In TRECVID, NIST, 2012.
[18]
G. Sharma and F. Jurie. Learning discriminative spatial representation for image classification. In BMVC, 2011.
[19]
G. Sharma, F. Jurie, and C. Schmid. Discriminative spatial saliency for image classification. In CVPR, 2012.
[20]
H. Sharp. Cardinality of finite topologies. Journal of Combinatorial Theory, 5(1):82--86, 1968.
[21]
W. Tong, Y. Yang, L. Jiang, S. I. Yu, Z. Lan, Z. Ma, W. Sze, E. Younessian, and A. G. Hauptmann. E-LAMP: integration of innovative ideas for multimedia event detection. Machine Vision and Applications, pages 1--11, 2013.
[22]
A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR, pages 1521--1528. IEEE, 2011.
[23]
J. C. van Gemert, C. J. Veenman, A. W. Smeulders, and J. Geusebroek. Visual word ambiguity. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(7):1271--1283, 2010.
[24]
V. Viitaniemi and J. Laaksonen. Spatial extensions to bag of visual words. In ACM CIVR, 2009.
[25]
H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid, et al. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009.
[26]
J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In CVPR, 2010.
[27]
J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009.
[28]
J. Yang, K. Yu, and T. Huang. Efficient highly over-complete sparse coding using a mixture model. In ECCV, 2010.
[29]
S. I. Yu, Z. Xu, D. Ding, W. Sze, F. Vicente, Z. Lan, Y. Cai, S. Rawat, P. Schulam, N. Markandaiah, et al. Informedia E-LAMP@ trecvid 2012 multimedia event detection and recounting (med and mer). In TRECVID, NIST, 2012.
[30]
L. Zhang, L. Jiang, L. Bao, S. Takahashi, Y. Li, and A. Hauptmann. Informedia@ trecvid 2011: Surveillance event detection. In TRECVID, NIST, 2011.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMR '14: Proceedings of International Conference on Multimedia Retrieval
April 2014
564 pages
ISBN:9781450327824
DOI:10.1145/2578726
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bag of Visual Words
  2. Feature Representation
  3. Jensen-Shannon Tiling
  4. Pooling Method
  5. SPM
  6. Spatial Pyramid

Qualifiers

  • Tutorial
  • Research
  • Refereed limited

Conference

ICMR '14
ICMR '14: International Conference on Multimedia Retrieval
April 1 - 4, 2014
Glasgow, United Kingdom

Acceptance Rates

ICMR '14 Paper Acceptance Rate 21 of 111 submissions, 19%;
Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media