skip to main content
10.1145/1460096.1460115acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Large-scale content-based audio retrieval from text queries

Published: 30 October 2008 Publication History

Abstract

In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM).
We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future.

References

[1]
A. Amir, G. Iyengar, J. Argillander, M. Campbell, A. Haubold, S. Ebadollahi, F. Kang, M. R. Naphade, A. Natsev, J. R. Smith, J. Tesic, and T. Volkmer. IBM research TRECVID-2005 video retrieval system. In TREC Video Workshop 2005.
[2]
Anonymous.http://sound1sound.googlepages.com.
[3]
J. J. Aucouturier. Ten Experiments on the Modelling of Polyphonic Timbre PhD thesis, Univ. Paris 6, 2006.
[4]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval Addison Wesley, England, 1999.
[5]
K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer.Online passive-aggressive algorithms. J. of Machine Learning Research (JMLR)7, 2006.
[6]
Freesound.http://freesound.iua.upf.edu.
[7]
J. Gauvain and C. Lee. Maximum a posteriori estimation for multivariate gaussian mixture observation of Markov chains. In IEEE Trans. on Speech Audio Process. volume 2, pages 291--298, 1994.
[8]
D. Grangier, F. Monay, and S. Bengio. A discriminative approach for the retrieval of images from text queries. In European Conference on Machine Learning, ECML, Lecture Notes in Computer Science volume LNCS 4212. Springer-Verlag, 2006.
[9]
T. Joachims. Optimizing search engines using clickthrough data.In International Conference on Knowledge Discovery and Data Mining (KDD) 2002.
[10]
R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In WWW'06: Proceedings of the 15th international conference on World Wide Web pages 387--396, New York, NY, USA, 2006.ACM.
[11]
J. Mariéthoz and S. Bengio. A comparative study of adaptation methods for speaker verification. In Proc. Int. Conf. on Spoken Lang. Processing, ICSLP 2002.
[12]
L. Rabiner and B.-H. Juang. Fundamentals of speech recognition Prentice All, first edition, 1 993.
[13]
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted gaussian mixture models. Digital Signal Processing 10(1--3), 2000.
[14]
M. Slaney, I. Center, and C. San Jose. Semantic-audio retrieval. In ICASSP volume 4, 2002.
[15]
D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Towards musical query by semantic description using the cal 500 data set. In SIGIR'07: 30th annual international ACM SIGIR conference on Research and development in information retrieval pages 439--446, New York, NY, USA, 2007. ACM.
[16]
D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Semantic annotation and retrieval of music and sound effects. In IEEE Transactions on Audio, Speech and Language Processing 2008.
[17]
V. Vapnik. The nature of statistical learning theory Springer Verlag, 1995.
[18]
L. von Ahn and L. Dabbish. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems pages 319--326. ACM Press New York, NY, USA, 2004.
[19]
P. Wan and L. Lu. Content-based audio retrieval:a comparative study of various features and similarity measures. Proceedings of SPIE 6015:60151H, 2005.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MIR '08: Proceedings of the 1st ACM international conference on Multimedia information retrieval
October 2008
506 pages
ISBN:9781605583129
DOI:10.1145/1460096
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content-based audio retrieval
  2. discriminative learning
  3. large scale
  4. ranking

Qualifiers

  • Research-article

Conference

MM08
Sponsor:
MM08: ACM Multimedia Conference 2008
October 30 - 31, 2008
British Columbia, Vancouver, Canada

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)4
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media