research-article

Large-scale content-based audio retrieval from text queries

Authors:

Dick LyonAuthors Info & Claims

MIR '08: Proceedings of the 1st ACM international conference on Multimedia information retrieval

Pages 105 - 112

https://doi.org/10.1145/1460096.1460115

Published: 30 October 2008 Publication History

Abstract

In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM).

We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future.

References

[1]

A. Amir, G. Iyengar, J. Argillander, M. Campbell, A. Haubold, S. Ebadollahi, F. Kang, M. R. Naphade, A. Natsev, J. R. Smith, J. Tesic, and T. Volkmer. IBM research TRECVID-2005 video retrieval system. In TREC Video Workshop 2005.

[2]

Anonymous.http://sound1sound.googlepages.com.

[3]

J. J. Aucouturier. Ten Experiments on the Modelling of Polyphonic Timbre PhD thesis, Univ. Paris 6, 2006.

[4]

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval Addison Wesley, England, 1999.

Digital Library

[5]

K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer.Online passive-aggressive algorithms. J. of Machine Learning Research (JMLR)7, 2006.

Digital Library

[6]

Freesound.http://freesound.iua.upf.edu.

[7]

J. Gauvain and C. Lee. Maximum a posteriori estimation for multivariate gaussian mixture observation of Markov chains. In IEEE Trans. on Speech Audio Process. volume 2, pages 291--298, 1994.

[8]

D. Grangier, F. Monay, and S. Bengio. A discriminative approach for the retrieval of images from text queries. In European Conference on Machine Learning, ECML, Lecture Notes in Computer Science volume LNCS 4212. Springer-Verlag, 2006.

Digital Library

[9]

T. Joachims. Optimizing search engines using clickthrough data.In International Conference on Knowledge Discovery and Data Mining (KDD) 2002.

Digital Library

[10]

R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In WWW'06: Proceedings of the 15th international conference on World Wide Web pages 387--396, New York, NY, USA, 2006.ACM.

Digital Library

[11]

J. Mariéthoz and S. Bengio. A comparative study of adaptation methods for speaker verification. In Proc. Int. Conf. on Spoken Lang. Processing, ICSLP 2002.

[12]

L. Rabiner and B.-H. Juang. Fundamentals of speech recognition Prentice All, first edition, 1 993.

Digital Library

[13]

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted gaussian mixture models. Digital Signal Processing 10(1--3), 2000.

[14]

M. Slaney, I. Center, and C. San Jose. Semantic-audio retrieval. In ICASSP volume 4, 2002.

[15]

D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Towards musical query by semantic description using the cal 500 data set. In SIGIR'07: 30th annual international ACM SIGIR conference on Research and development in information retrieval pages 439--446, New York, NY, USA, 2007. ACM.

Digital Library

[16]

D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Semantic annotation and retrieval of music and sound effects. In IEEE Transactions on Audio, Speech and Language Processing 2008.

Digital Library

[17]

V. Vapnik. The nature of statistical learning theory Springer Verlag, 1995.

Digital Library

[18]

L. von Ahn and L. Dabbish. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems pages 319--326. ACM Press New York, NY, USA, 2004.

Digital Library

[19]

P. Wan and L. Lu. Content-based audio retrieval:a comparative study of various features and similarity measures. Proceedings of SPIE 6015:60151H, 2005.

Cited By

Zhou DLei FLi LZhou YYang A(2024)Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics RetrievalIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.335804832(1248-1260)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TASLP.2024.3358048
Lee NKang SCho Y(2024)Efficient text augmentation in latent space for video retrievalMultimedia Tools and Applications10.1007/s11042-024-20320-wOnline publication date: 18-Oct-2024
https://doi.org/10.1007/s11042-024-20320-w
Song FHu JWang CHuang JZhang HWang Y(2023)Cross-Modal Audio-Text Retrieval via Sequential Feature AugmentationProceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning10.1145/3590003.3590056(298-304)Online publication date: 17-Mar-2023
https://dl.acm.org/doi/10.1145/3590003.3590056
Show More Cited By

Index Terms

Large-scale content-based audio retrieval from text queries

Recommendations

A Two-Stage Audio Retrieval Method for Searching Unannotated Audio Clips
ISM '08: Proceedings of the 2008 Tenth IEEE International Symposium on Multimedia

Traditional audio retrieval systems deal principally with audio clips having text descriptions. To retrieve unannotated audio clips is cumbersome because of the immaturity of content-based analysis and retrieval techniques. In this paper, we propose a ...
Content-based audio retrieval with relevance feedback

In this paper, we have proposed two relevance feedback algorithms for content-based audio retrieval. One is a modified version of a technique used for image retrieval with positive feedback; another is based on a constrained optimization concept. ...
Relevance feature mapping for content-based multimedia information retrieval

This paper presents a novel ranking framework for content-based multimedia information retrieval (CBMIR). The framework introduces relevance features and a new ranking scheme. Each relevance feature measures the relevance of an instance with respect to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MIR '08: Proceedings of the 1st ACM international conference on Multimedia information retrieval

October 2008

506 pages

ISBN:9781605583129

DOI:10.1145/1460096

General Chair:
Michael S. Lew
Leiden University, The Netherlands
,
Program Chairs:
Alberto del Bimbo
University of Florence, Italy
,
Erwin M. Bakker
Leiden University, The Netherlands

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM08

Sponsor:

MM08: ACM Multimedia Conference 2008

October 30 - 31, 2008

British Columbia, Vancouver, Canada

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

62
Total Citations
View Citations
674
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)4

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhou DLei FLi LZhou YYang A(2024)Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics RetrievalIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.335804832(1248-1260)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TASLP.2024.3358048
Lee NKang SCho Y(2024)Efficient text augmentation in latent space for video retrievalMultimedia Tools and Applications10.1007/s11042-024-20320-wOnline publication date: 18-Oct-2024
https://doi.org/10.1007/s11042-024-20320-w
Song FHu JWang CHuang JZhang HWang Y(2023)Cross-Modal Audio-Text Retrieval via Sequential Feature AugmentationProceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning10.1145/3590003.3590056(298-304)Online publication date: 17-Mar-2023
https://dl.acm.org/doi/10.1145/3590003.3590056
Koepke AOncescu AHenriques JAkata ZAlbanie S(2023)Audio Retrieval With Natural Language Queries: A Benchmark StudyIEEE Transactions on Multimedia10.1109/TMM.2022.314971225(2675-2685)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3149712
Luo KZhang XWang JLi HCheng NXiao J(2023)Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI59109.2023.00137(913-917)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICTAI59109.2023.00137
Xie HRäsänen OVirtanen T(2023)On Negative Sampling for Contrastive Audio-Text RetrievalICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095319(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10095319
Kim DKim NKwak S(2023)Improving Cross-Modal Retrieval with Set of Diverse Embeddings2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.02243(23422-23431)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.02243
Chao YYang DGu RZou Y(2022)3CMLF: Three-Stage Curriculum-Based Mutual Learning Framework for Audio-Text Retrieval2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.23919/APSIPAASC55919.2022.9979989(1602-1607)Online publication date: 7-Nov-2022
https://doi.org/10.23919/APSIPAASC55919.2022.9979989
Lou SXu XWu MYu K(2022)Audio-Text Retrieval in ContextICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746786(4793-4797)Online publication date: 23-May-2022
https://doi.org/10.1109/ICASSP43922.2022.9746786
Nagrani ASeo PSeybold BHauth AManen SSun CSchmid C(2022)Learning Audio-Video Modalities from Image CaptionsComputer Vision – ECCV 202210.1007/978-3-031-19781-9_24(407-426)Online publication date: 23-Oct-2022
https://doi.org/10.1007/978-3-031-19781-9_24
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents