skip to main content

Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model

Published: 01 March 2008 Publication History


In this paper, we describe a statistical approach to both an articulatory-to-acoustic mapping and an acoustic-to-articulatory inversion mapping without using phonetic information. The joint probability density of an articulatory parameter and an acoustic parameter is modeled using a Gaussian mixture model (GMM) based on a parallel acoustic-articulatory speech database. We apply the GMM-based mapping using the minimum mean-square error (MMSE) criterion, which has been proposed for voice conversion, to the two mappings. Moreover, to improve the mapping performance, we apply maximum likelihood estimation (MLE) to the GMM-based mapping method. The determination of a target parameter trajectory having appropriate static and dynamic properties is obtained by imposing an explicit relationship between static and dynamic features in the MLE-based mapping. Experimental results demonstrate that the MLE-based mapping with dynamic features can significantly improve the mapping performance compared with the MMSE-based mapping in both the articulatory-to-acoustic mapping and the inversion mapping.


Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. J. Acoust. Soc. Amer. v63. 1535-1555.
Chu, M., Peng, H., Yang, H., Chang, E., 2001. Selecting non-uniform units from a very large corpus for concatenative speech synthesizer. In: Proc. ICASSP. Salt Lake City, USA, pp. 785-788.
Frankel, J., Richmond, K., King, S., Taylor, P., 2000. An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces. In: Proc. ICSLP, Beijing, China, Vol. 4, pp. 254-257.
Estimation of articulatory movements from speech acoustics using an HMM-based speech production model. IEEE Trans. Speech Audio Process. v12 i2. 175-185.
Speaker adaptation method for acoustic-to-articulatory inversion using an HMM-based speech production model. IEICE Trans. Inf. Systems. vE87-D i5. 1071-1078.
Accurate recovery of articulator positions from acoustics: new conclusions based on human data. J. Acoust. Soc. Amer. v100. 1819-1834.
Hunt, A.J., Black, A.W., 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In: Proc. ICASSP, Atlanta, USA, pp. 373-376.
Kaburagi, T., Honda, M., 1998. Determination of the vocal tract spectrum from the articulatory movements based on the search of an articulatory-acoustic database. In: Proc. ICSLP, Sydney, Australia, pp. 433-436.
Kain, A., Macon, M.W., 1998. Spectral voice conversion for text-to-speech synthesis. In: Proc. ICASSP, Seattle, USA, pp. 285-288.
Kain, A., Niu, X., Hosom, J.-P., Miao, Q., van Santen, J., 2004. Formant re-synthesis of dysarthric speech. In: Proc. 5th ISCA Speech Synthesis Workshop, Pittsburgh, USA, pp. 25-30.
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Comm. v27 i3-4. 187-207.
Kawahara, H., Katayose, H., de Cheveigné, A., Patterson, R.D., 1999. Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity. In: Proc. EUROSPEECH, Budapest, Hungary, pp. 2781-2784.
Kawai, H., Toda, T., Ni, J., Tsuzaki, M., Tokuda, K. 2004. XIMERA: a new TTS from ATR based on corpus-based technologies. In: Proc. 5th ISCA Speech Synthesis Workshop (SSW5). Pittsburgh, USA, pp. 179-184.
A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. J. Acoust. Soc. Amer. v116 i4. 2354-2364.
Minami, Y., McDermott, E., Nakamura, A., Katagiri, S., 2004. A theoretical analysis of speech recognition based on feature trajectory models. In: Proc. INTERSPEECH, Jeju, Korea, pp. 549-552.
Nakamura, K., Toda, T., Nankaku, Y., Tokuda, K., 2006. On the use of phonetic information for mapping from articulatory movements to vocal tract spectrum. In: Proc. ICASSP. Toulouse, France, pp. 93-96.
Park, K.Y., Kim, H.S., 2000. Narrowband to wideband conversion of speech using GMM based transformation. In: Proc. ICASSP, Istanbul, pp. 1847-1850.
Richmond, K., 2001. Estimating articulatory parameters from the acoustic speech signal. Ph.D. Thesis, The Centre for Speech Technology Research, University of Edinburgh.
Richmond, K., 2006. A trajectory mixture density network for the acoustic-articulatory inversion mapping. In: Proc. INTERSPEECH, Pittsburgh, USA, pp. 577-580.
Modelling the uncertainty in recovering articulation from acoustics. Computer Speech Language. v17 i2. 153-172.
Sagisaka, Y., 1988. Speech synthesis by rule using an optimal selection of non-uniform synthesis units. In: Proc. ICASSP, New York, USA, pp. 679-682.
Determination of the geometry of the human vocal tract by acoustic measurements. J. Acoust. Soc. Amer. v41. 1002-1010.
Speech coding based on physiological models of speech production. In: Furui, S., Sondhi, M.M. (Eds.), Advances in Speech Signal Processing, Marcel Dekker, New York. pp. 231-267.
Techniques for estimating vocal-tract shapes from the speech signal. IEEE Trans. Speech Audio Process. v2. 133-150.
Shiga, Y., King, S. 2004. Accurate spectral envelope estimation for articulation-to-speech synthesis. In: Proc. 5th ISCA Speech Synthesis Workshop. Pittsburgh, USA, pp. 19-24.
Sondhi, M.M. 2002. Articulatory modeling: a possible role in concatenative text-to-speech synthesis. IEEE 2002 Workshop on Speech Synthesis, Santa Monica, USA.
Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. v6 i2. 131-142.
Suzuki, S., Okadome, T., Honda, M., 1998. Determination of articulatory positions from speech acoustics by applying dynamic articulatory constraints. In: Proc. ICSLP. Sydney, Australia, pp. 2251-2254.
Syrdal, A.K., Wightman, C.W., Conkie, A., Stylianou, Y., Beutnagel, M., Schroeter, J., Strom, V., Lee, K.-S., Makashay, M.J., 2000. Corpus-based techniques in the AT& T NextGen synthesis system. In: Proc. ICSLP, Beijing, China, Vol. 3, pp. 410-415.
Toda, T., Black, A.W., Tokuda, K., 2004. Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis. In: Proc. 5th ISCA Speech Synthesis Workshop. Pittsburgh, USA, pp. 31-36.
Toda, T., Black, A.W., Tokuda, K., 2004. Acoustic-to-articulatory inversion mapping with Gaussian mixture model. In: Proc. INTERSPEECH. Jeju, Korea, pp. 1129-1132.
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T., 2000. Speech parameter generation algorithms for HMM-based speech synthesis. In: Proc. ICASSP, Istanbul, Turkey, pp. 1315-1318.
Wrench, A. 1999. The MOCHA-TIMIT articulatory database., Queen Margaret University College.
Wrench, A.A., Richmond, K., 2000. Continuous speech recognition using articulatory data. In: Proc. ICSLP. Beijing, China, pp. 145-148.
Reformulating the HMM as a trajetory model by imposing explicit relationships between static and dynamic feature vector sequences. Computer Speech Language. v21. 153-173.
Zheng, Y., Liu, Z., Zhang, Z., Sinclair, M., Droppo, J., Deng, L., Acero, A., Huang, X., 2003. Air- and bone-conductive integrated microphones for robust speech detection and enhancement. In: Proc. ASRU, St. Thomas, USA, pp. 249-254.

Cited By

View all



Information & Contributors


Published In

cover image Speech Communication
Speech Communication  Volume 50, Issue 3
March, 2008
101 pages


Elsevier Science Publishers B. V.


Publication History

Published: 01 March 2008

Author Tags

  1. 43.70.+i
  2. Acoustic-to-articulatory inversion mapping
  3. Articulatory-to-acoustic mapping
  4. Dynamic features
  5. GMM
  6. MMSE


  • Article


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Feb 2025

Other Metrics


Cited By

View all

View Options

View options






Share this Publication link

Share on social media