Academia.eduAcademia.edu

The Karlsruhe-Verbmobil speech recognition engine

1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing

Verbmobil, a German research project, aims at machine translation of spontaneous speech input. The ultimate goal is the development of a portable machine translator that will allow people to negotiate in their native language. Within this project the University of Karlsruhe has developed a speech recognition engine that has been evaluated on a yearly basis during the project and shows very promising speech recognition word accuracy results on large vocabulary spontaneous speech. We introduce the Janus Speech Recognition Toolkit underlying the speech recognizer. The main new contributions to the acoustic modeling part of our 1996 evaluation system-speaker normalization, channel normalization and polyphonic clustering-are discussed and evaluated. Besides the acoustic models we delineate the different language models used in our evaluation system: word trigram models interpolated with class based models and a separate spelling language model were applied. As a result of using the toolk...

THE KARLSRUHE-VERBMOBIL SPEECH RECOGNITION ENGINE Michael Finke Petra Geutner Hermann Hild Thomas Kemp Interactive Systems Laboratories Klaus Ries Martin Westphal University of Karlsruhe, Germany Carnegie Mellon University, USA ABSTRACT Verbmobil, a German research project, aims at machine translation of spontaneous speech input. The ultimate goal is the development of a portable machine translator that will allow people to negotiate in their native language. Within this project the University of Karlsruhe has developed a speech recognition engine that has been evaluated on a yearly basis during the project and shows very promising speech recognition word accuracy results on large vocabulary spontaneous speech. In this paper we will introduce the Janus Speech Recognition Toolkit underlying the speech recognizer. The main new contributions to the acoustic modeling part of our 1996 evaluation system { speaker normalization, channel normalization and polyphonic clustering { will be discussed and evaluated. Besides the acoustic models we delineate the di erent language models used in our evaluation system: Word trigram models interpolated with class based models and a separate spelling language model were applied. As a result of using the toolkit and integrating all these parts into the recognition engine the word error rate on the German Spontaneous Scheduling Task (GSST) could be decreased from 30% word error rate in 1995 to 13.8% in 1996. 1. INTRODUCTION Verbmobil is a long-term research project aimed at machine translation of spontaneous speech input. The ultimate goal of the Verbmobil project is the development of a portable machine translator which is capable of assisting business people from di erent countries to negotiate with each other in their native language. As rst language to start with a German speech recognition system was to be built. In order to get a representative mix of di erent German dialects, data has been collected at four di erent sites in Germany. The domain of the system is restricted to scheduling of meetings, but no arti cial restrictions are placed upon the speakers and their speaking style. Therefore, spontaneous phenomena like noise, stuttering, restarts and nongrammatical sentences occur. All of these phenomena have to be dealt with in both the speech recognition and the machine translation component of Verbmobil. In this paper we will give an overview of the KarlsruheVerbmobil Speech Recognition Engine { a large vocabulary spontaneous speech recognizer developed to be used as speech recognition component in the Verbmobil speech-to- speech translation project. 2. JANUS RECOGNITION TOOLKIT The Karlsruhe-Verbmobil Speech Recognition Engine is based on the Janus Speech Recognition Toolkit (JRTk) developed at the Interactive Systems Laboratories in Karlsruhe and at Carnegie Mellon University in Pittsburgh. This toolkit implements a new object-oriented approach. A exible Tcl/Tk script based environment allows building stateof-the-art multimodal recognizers { this includes speech, handwriting and gesture recognition. Unlike other toolkits Janus is not a set of libraries and precompiled modules but a programmable shell with transparent, yet very ecient objects. Ranging from mixture of gaussian hidden markov models and hybrid neural network-HMM recognition approaches to hierarchical mixture of experts models a large variety of recognition approaches is addressed in our group. For all those approaches objects are available within the toolkit that serve as building blocks for applications. The underlying data structures of all these objects can be inspected and modi ed at script level. This makes Janus an easy-touse testbed for new research ideas. It also o ers a great exibility allowing rapid prototyping. The Tk component adds a graphical user interface to the recognition toolkit thereby simplifying setting up and running demos. The toolkit passed its rst test with the Janus Switchboard recognizer which was top ranking in DARPA's spring 96 LVCSR evaluation and currently has a state-of-the-art error rate of 36% [4, 10]. 3. ACOUSTIC MODELING Currently, approximately 32 hours of labelled spontaneous speech training material is available for training the acoustic models of our speech recognition engine. We will discuss in greater detail the preprocessing steps and the polyphonic modeling approach, because they can be considered the major new contributions to the 1996 system in terms of word accuracy. 3.1. Preprocessing From the short time spectral analysis of the 16 kHz sampled audio recordings we derive a 16ms wide power spectrum that is calculated every 10ms. Based on these spectral features the further preprocessing steps can be summarized as speaker normalization, channel normalization, and speech feature extraction. j System WER in % no VTLN 21.7 VTLN using reference 18.6 using hypothesis 19.0 Table 1. Word error rate of the VTLN system. Figure 1 shows the distribution of the optimal warping factors in the training set for all male and female speakers respectively. Warping factors above 1.0 correspond to frequency compression and those above 1.0 to frequency expansion. Speaker variability may also be dealt with by de ning speaker clusters and training di erent acoustic models for them. One way of clustering is to separate male and female speakers. With gender-dependent modeling we could achieve a 2% relative decrease in WER compared to the speaker independent non VTLN system assuming perfect gender detection. That means that gender-dependent modeling is outperformed by VTLN as it was observed on SWB, too [9]. One explanation is that the speaker clustering and subsequent training of independent acoustic models reduced the training data for each recognizer to about the half. The VTLN approach on the other hand aims at normalizing with respect to the speakers' vocal tract shape. With the vocal tract length normalization reducing the variability between speakers we can build more compact models and thus make more ecient use of the acoustic parameters. VTLN warping factors iteration 4 300 female male 250 200 speakers 3.1.1. Speaker Normalization One major source of interspeaker variability in automatic continuous speech recognition is the variation in vocal tract shape among speakers. Andreou et al [1] proposed a set of maximum likelihood speaker normalization procedures to explicitly compensate for these variations. Based on the observation that the position of the spectral formants peaks for utterances are inversely proportional to the length of the vocal tract, these procedures reduced speaker dependent variations between formant frequencies through a simple linear warping of the frequency axis. As a consequence a speaker normalization step was introduced into our preprocessing. In Janus we implemented a maximum likelihood approach similar to [7], where the goal is to determine a frequency warping factor ^ such that the warped speech signal ts best to the acoustic models. Let Oi be the acoustic observation vectors for utterance i warped by based on a piecewise linear warping function as described in [9]. During both testing and training, the warp scale is estimated by maximizing the likelihood of the utterance P (Oi Wi ), where Wi denotes the corresponding transcription of the utterance (for training this is the presumably correct transcription and for testing the hypothesis derived in a rst search pass). Since it is very dicult to obtain a closed form solution for the optimal warping factor we used a grid search over a set of 12 di erent factors to determine the shape of the warping function. Experiments showed that referring to the likelihood of voiced frames only instead of computing the likelihood of all speech frames to nd the best warping factor, reduced the word error rate by 5% relative. Table 1 shows the performance of our baseline system and the system with vocal tract length normalization. VTLN reduces the error rate by 12%. Using the hypothesis of the rst search pass instead of the correct transcription turned out to be almost as good as taking the reference to estimate the warping factor. 150 100 50 0 0.8 0.85 0.9 0.95 1 warp factor 1.05 1.1 1.15 Figure 1. Distribution of warping factors. 3.1.2. Channel Normalization Our channel normalization is a variant of the standard cepstral mean subtraction. The mean of a whole utterance that will be subtracted from each speech vector in the cepstral or log-spectral domain is a simple estimate of the acoustic channel. Since it contains not only channel distortions but also averaged speech (that should also be removed) the estimate depends on the silence-to-speech relation. Especially with utterances containing longer pauses { as it is the case in a spontaneous task like Verbmobil { we get a better and more consistent estimate when considering only speech frames to calculate the mean vector. In our Janus system this is done using a simple energy based speech detector. The "take all frames" method helped to reduce the word error rate by about 6% relative. 3.1.3. Speech Feature Extraction For speech feature extraction we used a 30 dimensional melscale lterbank and derived 13 cepstral coecients from it. The channel normalization technique is applied to this mel-cepstral feature stream. For the normalized coecients we added the rst and second order derivatives and reduced the dimension of the input space from 39 to 32 coecients using linear discriminant analysis. 3.2. Polyphonic Clustering Context-dependent acoustic models have been applied in speech recognition research for many years, and have proven to increase the recognition accuracy signi cantly. The most common approach is to use triphones. Recently, several speech recognition groups have started investigating the use of larger phonetic context windows when building acoustic models. We also make use of a larger context in our recognizer by allowing questions in the allophonic decision tree not only referring to the immediate neighboring phones but also to phones further away (for Verbmobil we used a con- text of two instead of the context of one as in the triphone setup). In a two stage decision tree based clustering approach the codebooks are clustered rst and, based on the clustered codebooks, in a second step the distributions are clustered. For Verbmobil we ended up having 2500 codebooks and 10000 distributions. This clustering approach implementing a exible parameter tying scheme gave us signi cant improvement across all tasks WSJ, SWB, and Spontaneous Scheduling Task, and across all languages involved (German i.e. Verbmobil, Spanish, English) [3]. Model Standard Trigram Standard Class Trigram Class Trigram, p(w c) adapted same, but adaptive clustering + Std. Trigram on small corpus + Std. Class Trigram on small corpus + Std. Trigram same, but no adaptive clustering same, but p(w c) not adapted j j 4. LANGUAGE MODELING In terms of language model training material the Verbmobil domain is a fairly small spontaneous speech corpus. As baseline we use a trigram backo model with absolute discounting and non-linear interpolation. Like on the much larger Switchboard corpus long-range language models like cache models did not result in any WER reduction [10]. 4.1. Function and Content Words In order to introduce longer-term dependencies than conventional trigrams, some linguistic constraints were introduced into our language models. The notion of function and content words [5] was used in order to predict the next word not only based on the last word pair, but also on the last function/content word pair. An improvement of 0.4% WA abolute was achieved. 4.2. Interpolation and Class-Based Models The Verbmobil domain contains 300:000 words with a vocabulary of 6000 words, i.e. trigram backo models are potentially not well trained. To make up for the lack of training data, word-dependent linear interpolation of the baseline language model with models built on di erent corpora was used. Also class-based trigram models [6] were applied where each word is assigned to exactly one class. We achieved a word error reduction of 0:3% absolute by interpolating the baseline with a class-based Verbmobil model and a model built on a large German newspaper corpus (FAZ). 4.3. Domain adaptation and phrase models Due to changes to the recording scenario within the course of the project there was a small domain shift in the collected data. It seems that the unigram distribution is most in uenced while the conditional class probabilities p(ci ci 1 ci 2 ) remain stable. So the idea is to adapt a language model to a new target corpus as: p^(wi wi 1 wi 2 ) = p^(wi ci ) p(ci ci 1 ci 2 ) where p^(w c) is estimated on the target corpus and p(ci ci 1 ci 2 ) on a corpus suciently similar to it. The classes were found by an adaptive clustering algorithm, a variant of [6] that minimizes the perplexity of the adapted bigram model. We also achieved a word accuracy improvement around 0:5% absolute using phrases of words as the base unit of language modeling [8] on an earlier version of the system without retraining acoustics. Since the Verbmobil evaluation conditions did not allow phrases in the lattices we j j j  j WA 78.8 78.8 79.2 79.5 79.8 79.7 79.6 Table 2. Language model experiments. didn't apply it. In the nal evaluation model the domain adaption technique was for conservative reasons just applied to adapt the newspaper corpus to Verbmobil. 4.4. Integrating Spelling Sequences A further diculty in the Verbmobil corpus is the presence of spelling sequences. If a language model is directly computed from the text corpora over the recognition vocabulary V = VW VL of words VW = w1 ; : : : ; wN and letters VL = A; : : : ; Z = L1 ; : : : ; LM , transitions both within as well as into or out of the letter sequences will be poorly modeled due to the small number (a few hundred) of available spelling examples. To allow for a more robust recognition, we can assume a letter sequence and its embedding text independent and collapse all letter sequences to an equivalence class LS. A new language model LMW is then computed over the vocabulary VW LS , resulting in transitions PW (LS wi ) and PW (wj LS) instead of the less robust estimates for P (Lj wi ) and P (wj Li ). The letter bigrams PL (Lj Li ) within a sequence can be modeled by a separate, independent \sublanguage model" LML . The nal language model LM recombines LMW and LML . For example, P (Lj wi ) = PW (LS wi ) PL (Lj <s>). Depending on the task, LML can be computed from a separate text source or an equal distribution can be assumed. In addition, a duration model can be implemented as illustrated in gure 2, where a length of one, two or more letters is explicitly modeled with probabilities 41 , 34 23 = 21 , 3 1 1 i3 1 and 1 i=0 4 3 ( 4 ) 4 = 4 , respectively. [ f f g [ f f g g g j j j j j j j  j P 1/4 j j PP 42.26 40.40 40.33 40.16 38.61 38.39 38.29 38.75 38.80 A : Z 1 3/4 1 1/4 A : Z 2 1/3 2 2/3 A : Z 3 3 3/4 Figure 2. Duration modeling for letter sequences. For the nal evaluation, we used a length model with ve states which discouraged one- and two-letter sequences in order to avoid false letter insertions. As most of the spellings in the training material were four-letter acronyms, the length model was adjusted to encourage sequences of this length. All bigrams in LML were considered equally likely. The trigrams found in the training texts were added to LML to account for the repetitive occurrence of some of the acronyms. With this measures, the recognition rate measured on the letter sequences improved from 89% to 92% on our development test set, which resulted in an relative overall improvement of 1%. 5. EVALUATION To assess and evaluate the performance of each of the components of Verbmobil, evaluations are conducted on a yearly basis. The evaluations are run by the University of Braunschweig, and the test data is chosen independently by LMU in Munich. Evaluation rules asked one mandatory test on exactly the same conditions from every participant, a suite of other test conditions were optional on a voluntary basis. In the 1996 mandatory test, the language model was constrained to a bigram with a test set perplexity of about 54, the training material was restricted to the ocial Verbmobil database, and the vocabulary size was 5300 words. 343 utterances (43 minutes of speech) were chosen as test set. Approximately one half of them originated from female speakers. The speakers came from di erent locations throughout Germany thereby providing a representative mixture of German dialects. The trigram test set perplexity was about 40. All types of spontaneous speech e ects like noise, restarts, hesitations, etc. were present in the testing material. There were four other sites participating in the evaluation in 1996: Daimler-Benz, the universities of Munich, Erlangen, and Hamburg. The evaluation results are given in table 3. Only one optional test (where no limits were imposed to the algorithms and databases used for recognition) was done by more than one institution. Therefore, we only report the results of the mandatory test and the unrestricted optional one. Error rate Error rate Site (mandatory test) (optional test) Daimler-Benz 21.7% Univ. Erlangen 24.5% Univ. Hamburg 20.0% JANUS 16.1% 13.2% Univ. Munich 25.2% - Table 3. Results of 1996 Verbmobil speech recognition evaluation. 6. CONCLUSION As can be seen in table 4, steady progress in performance has been achieved in the Verbmobil system during the last 3 years. In this paper we have described several techniques, including improved acoustic modeling and better language models, which were able to reduce the word error rate to less than 50% compared to the 1995 result. 7. ACKNOWLEDGEMENTS This research was partly funded by grant 413-400101IV101S3 from the German Ministry of Science and Technology (BMBF) as a part of the Verbmobil project. The JANUS project was supported in part by the Advanced Research Project Agency and the US Department of Defense. The authors wish to thank all members of the Interactive Systems Labs for useful discussions and active support. View publication stats Year Error Rate 1994 54.2% 1995 30.0% 1996 13.8% Table 4. Error rates of JANUS over the last three years. REFERENCES [1] A. Andreou, T. Kamm, and J. Cohen. Experiments in Vocal Tract Normalization. In Proceedings of the CAIP Workshop: Frontiers in Speech Recognition II, 1994. [2] Ellen Eide and Herbert Gish. A Parametric Approach to Vocal Tract Length Normalization. In IEEE International Conference on Acoustics, Speech, and Signal Processing, Atlanta, 1996. IEEE. [3] Michael Finke and Ivica Rogina. Wide Context Acoustic Modeling in Read vs. Spontaneous Speech. In IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, 1997. IEEE. [4] Michael Finke, Torsten Zeppenfeld, Martin Maier, Laura May eld, Klaus Ries, Puming Zhan, John Lafferty, and Alex Waibel. Switchboard April 1996 Evaluation Report. In Proceedings of LVCSR Hub 5 Workshop, April 1996. [5] Petra Geutner. Introducing Linguistic Constraints into Statistical Language Modeling. In Proceedings of the 1996 International Conference on Spoken Language Processing (ICSLP), Philadelphia, Pennsylvania, pages 402{405, October 1996. [6] Reinhard Kneser and Herman Ney. Improved Clustering Techniques for Class-Based Statistical Language Modeling. In Eurospeech, Berlin, Germany, 1993. [7] Li Lee and Richard C. Rose. Speaker Normalization using Ecient Frequency Warping Procedures. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 353{356, Atlanta, 1996. IEEE. [8] Klaus Ries, Finn Dag Bu, and Alex Waibel. Class phrase models for language modelling. In International Conference on Spoken Language Processing, Philadelphia, USA, 1996. [9] Steven Wegmann, Don McAllaster, Jeremy Orlo , and Barbara Peskin. Speaker Normalization on Conversational Telephone Speech. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 339{341, Atlanta, 1996. IEEE. [10] Torsten Zeppenfeld, Michael Finke, Klaus Ries, Martin Westphal, and Alex Waibel. Recognition of Conversational Telephone Speech using the JANUS Speech Engine. In IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, 1997. IEEE.