skip to main content
10.1145/1027933.1027972acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
Article

A segment-based audio-visual speech recognizer: data collection, development, and initial experiments

Published: 13 October 2004 Publication History

Abstract

This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours of read speech collected from 223 different speakers. This new corpus was used to evaluate our new AVSR system which incorporates a novel audio-visual integration scheme using segment-constrained Hidden Markov Models (HMMs). Preliminary experiments have demonstrated improvements in phonetic recognition performance when incorporating visual information into the speech recognition process.

References

[1]
C. Benoit. The intrinsic bimodality of speech communication and the synthesis of talking faces. In Journal on Communications of the Scientific Society for Telecommunications, Hungary, number 43, pages 32--40, September 1992.
[2]
M. T. Chan, Y. Zhang, and T. S. Huang. Real-time lip tracking and bimodal continuous speech recognition. In Proc. of the Workshop on Multimedia Signal Processing, pp. 65--70, Redondo Beach, CA, 1998.
[3]
S. Chu and T. Huang. Bimodal speech recognition using coupled hidden Markov models. In Proc. of the International Conference on Spoken Language Processing, vol. II, Beijing, October 2000.
[4]
S. Dupont and J. Luettin. Audio-visual speech modeling for continuous speech recognition. In IEEE Transactions on Multimedia, number 2, pages 141--151, September 2000.
[5]
J. Glass. A probabilistic framework for segment-based speech recognition. To appear in Computer Speech and Language, 2003.
[6]
A. Halberstadt and J. Glass. Heterogeneous measurements and multiple classifiers for speech recognition. In Proceedings of ICSLP 98, Sydney, Australia, November 1998.
[7]
T. J. Hazen and A. Halberstadt, "Using aggregation to improve the performance of mixture Gaussian acoustic models," In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Seattle, May, 1998.
[8]
IBM Research - Audio Visual Speech Technologies: Data Collection. Accessed online at http://www.research.ibm.com/AVSTG/data.html, May 2003.
[9]
Intel's AVCSR Toolkit source code can be downloaded from http://sourceforge.net/projects/opencvlibrary/.
[10]
K. F. Lee and H. W. Hon. Speaker-independent phone recognition using hidden Markov models. In IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 11, pp. 1641--1648, November 1989.
[11]
L. H. Liang, X. X. Liu, Y. Zhao, X. Pi and A.V. Nefian. Speaker independent audio-visual continuous speech recognition. In Proc. of the IEEE International Conference on Multimedia and Expo, vol.2, pp. 25--28, 2002.
[12]
I. Matthews, J. A. Bangham, and S. Cox. Audio-visual speech recognition using multiscale nonlinear image decomposition. In Proc. of the International Conference on Spoken Language Processing, pp. 38--41, Philadelphia, PA, 1996.
[13]
U. Meier, R. Stiefelhagen, J. Yang, and A. Waibel. Towards unrestricted lip reading. In International Journal of Pattern Recognition and Artificial Intelligence, number 14, pages 571--585, August 2000.
[14]
K. Messer, J. Matas, J. Kittler, and K. Jonsson. XM2VTSDB: The extended M2VTS database. In Audio- and Video-based Biometric Person Authentication, AVBPA'99, pages 72--77, Washington, D.C., March 1999. 16 IDIAP--RR 99-02.
[15]
C. Neti, et al. Audio-visual speech recognition. In Technical Report, Center for Language and Speech Processing, Baltimore, Maryland, 2000. The Johns Hopkins University.
[16]
S. Pigeon and L. Vandendorpe. The M2VTS multimodal face database. In Proc. of the Audio- and Video-based Biometric Person Authentication Workshop, Germany, 1997.
[17]
G. Potamianos and C. Neti. Audio-visual speech recognition in challenging environments. In Proc. Of EUROSPEECH, pp. 1293--1296, Geneva, Switzerland, September 2003.
[18]
K. Saenko, T. Darrel, and J. Glass. Articulatory features for robust visual speech recognition In these proceedings, ICMI'04, State College, Pennsylvania, 2004.
[19]
C. Sanderson. The VidTIMIT Database. IDIAP Communication 02-06, Martigny, Switzerland, 2002.
[20]
C. Sanderson. Automatic Person Verification Using Speech and Face Information. PhD Thesis, Griffith University, Brisbane, Australia, 2002.
[21]
N. Strom, L. Hetherington, T.J. Hazen, E. Sandness, and J. Glass. Acoustic modeling improvements in a segment-based speech recognizer. In Proc. 1999 IEEE ASRU Workshop, Keystone, CO, December 1999.
[22]
V. Zue, S. Seneff, and J. Glass. Speech database development: TIMIT and beyond. Speech Communication, vol. 9, no. 4, pp. 351--356, 1990.

Cited By

View all

Index Terms

  1. A segment-based audio-visual speech recognizer: data collection, development, and initial experiments

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMI '04: Proceedings of the 6th international conference on Multimodal interfaces
    October 2004
    368 pages
    ISBN:1581139950
    DOI:10.1145/1027933
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 October 2004

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. audio-visual corpora
    2. audio-visual speech recognition

    Qualifiers

    • Article

    Conference

    ICMI04
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 453 of 1,080 submissions, 42%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 14 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media