Academia.eduAcademia.edu
Festival-si: A Sinhala Text-to-Speech System Ruvan Weerasinghe, Asanka Wasala, Viraj Welgama and Kumudu Gamage Language Technology Research Laboratory, University of Colombo School of Computing, Colombo, Sri Lanka [email protected], [email protected], [email protected], [email protected] Abstract This paper brings together the development of the first Text-to-Speech (TTS) system for Sinhala using the Festival framework and practical applications of it. Construction of a diphone database and implementation of the natural language processing modules are described. The paper also presents the development methodology of direct Sinhala Unicode text input by rewriting Letter-to-Sound rules in Festival's context sensitive rule format and the implementation of Sinhala syllabification algorithm. A Modified Rhyme Test (MRT) was conducted to evaluate the intelligibility of the synthesized speech and yielded a score of 71.5% for the TTS system described. 1. Introduction A Text-To-Speech synthesizer is a computerbased system capable of converting computer readable text into speech. The conversion of text to speech involves many important processes. These processes can be divided mainly in to three stages; text analysis, linguistic analysis and wave-form generation [1]. The text analysis stage is responsible for converting the non-textual content into text. This stage also involves tokenization and normalization of the text; identification of words or chunks of text. Text normalization establishes the correct interpretation of the input text by expanding the abbreviations and acronyms. This is done by replacing the nonalphabetic characters, numbers, and punctuation with appropriate text-strings depending on the context. The linguistic analysis stage involves finding the correct pronunciation of words, and assigning prosodic features (e.g. phrasing, intonation, stress) to the phonemic string to be spoken. The final stage of a TTS system is waveform generation which involves the production of an acoustic digital signal using a particular synthesis approach such as formant synthesis, articulatory synthesis or waveform concatenation [6]. The text analysis and linguistic analysis stages together are known as the Natural Language Processing (NLP) component, while the waveform generation stage is known as the Digital Signal Processing (DSP) component of a TTS System. In this paper, we describe the implementation and evaluation of a Sinhala text-to-speech system based on the diphone concatenation approach. The Festival framework [1] was chosen for implementing the Sinhala TTS system. The Festival Speech Synthesis System is an open-source, stable and portable multilingual speech synthesis framework developed at the Center for Speech Technology Research (CSTR), of the University of Edinburgh. TTS systems have been developed using the Festival framework for different languages including English, Japanese [1], Welsh [12], [2], Turkish [9], and Hindi [5], [8], Telugu [3], [5], among others. However, no serious Sinhala speech synthesizer has been developed this far. This is the first known documented work on a Sinhala text-to-speech synthesizer. The system is named “Festival-si” in accordance with common practice. The rest of this paper is organized as follows: Section 2 gives an overview of the Sinhala phonemic inventory and writing system; Section 3 explains the diphone database construction process; the implementation of natural language processing modules is explained in section 4. Section 5 discusses the potential applications while Section 6 presents an evaluation of the current system. The work is summarized and future research directions and improvements are discussed in the last section. 2. Sinhala phonemic inventory and writing system 2.1. The Sinhala phonemic inventory Sinhala is one of the official languages of Sri Lanka and the mother tongue of the majority - 74% of its population. Spoken Sinhala contains 40 segmental phonemes; 14 vowels and 26 consonants as classified below in Table 1 and Table 2 [4]. Table 1: Spoken Sinhala vowel classification Table 3: Sinhala character set Vowels and corresponding vowel modifiers (within brackets): ( High Front Short Long i i: Mid e e: Low æ æ: Central Short Long : Back Short Long u u: o o: a a: ( &( m b n ŋ d m g n Trill ŋ r Lateral l Fricatives Unvoiced Voiced f s ʃ h v Approximants ) ) ( ( ) ( ) ) ( ( ) ) ( ) ( ! ( ) "( ) ( #) ) ( $( ) %) ') Special symbols: O P QR Inherent vowel remover (Hal marker): S Glo. Voiced Pre-nasalized voiced stops Nasals ( ( Consonants: ()*+,-./01 23456789:;<=> ?@ABCD EFGHIJKL M N Table 2: Spoken Sinhala consonant classification1 Lab. Den. Alv. Ret. Pal. Vel. Stops Unvoiced p t k Voiced b d g Affricates Unvoiced ) ) j There are four nasalized vowels occurring in two or three words in Sinhala. They are /ã/, /ã:/, /"/ and /":/ [4]. Spoken Sinhala also has following Diphthongs; /æi/, /iu/, /eu/, /æu/, /ou/, /au/, /ui/, /ei/, /oi/ and /ai/. A separate sign for vowel / / is not provided by the Sinhala writing system. In terms of distribution, the vowel / / does not occur at the beginning of a syllable except in the conjugational variants of verbs formed from the verbal stem “ ” /k r / (to do). In contrast to this, though the letter “ ”, which symbolizes the prenasalized consonant /nȴ/ exists, it is not considered a phoneme in Sinhala. 2.2. The Sinhala writing system The Sinhala character set has 18 vowels, and 42 consonants as shown in Table 3. Sinhala characters are written left to right in horizontal lines. Words are delimited by a space in general. Vowels have corresponding full-character forms when they appear in an absolute initial position of a word. In other positions, they appear as ‘strokes’ and, are used with consonants to denote vowel modifiers. All vowels except “ ” /iru:/, are able to occur in word initial positions. The vowel / / and / :/ occur only in loan words of English origin. Since there are no special symbols to represent them, frequently the character “ ” is used to symbolize them [4]. All consonants occur in word initial position except “ M” /ŋ/ (Anusvaraya) and the nasals. The symbols “6”, and “K” represent the retroflex nasal / / and the retroflex lateral /#/ respectively. But they are pronounced as their respective alveolar counterparts “<”-/n/ and “E”-/l/. Similarly, the symbol “H” representing the retroflex sibilant /$/, is pronounced as the palatal sibilant “G”-/%/. The corresponding aspirated symbols of letters , ), -, /, 2, 4, 8, :, >, @ namely (, *, ., 0, 5, 9, ;, ?, A respectively are pronounced like the corresponding un-aspirates [4]. When consonants are combined with /r/ or /j/, special conjunct symbols are used. “T”-/r/ immediately following a consonant can be marked by the symbol “ O” added to the bottom of the consonant preceding it. Similarly, “U”-/j/, immediately following consonant can be marked by the symbol “ P” added to the righthand side of the consonant preceding it [4]. “ ” /ilu/ and “ ” /ilu:/ do not occur in contemporary Sinhala. Thus, only 40 phonemes are necessary to represent the 60 symbols in the language. 3. Diphone database construction 1 Labial, Dental, Alveolar, Retroflex, Palatal, Velar, Glottal Designing, recording, and labeling a complete diphone database is a laborious and a time consuming task. The overall quality of the synthesized speech is entirely dependent on the quality of the diphone database. This section describes the methodology adopted in the construction of Sinhala diphone database. Prior to constructing the diphone database, the answers to the following two questions were investigated [9]: What diphone-pairs exist in the language? What carrier words should be used? Generally, the number of diphones in a language is roughly the square of the number of phones. Therefore, 40 phonemes for Sinhala identified in section 2.1 suggest roughly 1600 diphones should exist. The first phase involved the preparation of matrices mapping all possible combinations of consonants and vowels; i.e. CV, VC, VV, CC, _V, _C, C_ and, V_. Here ‘_’ denotes a short period of silence. Silence is also considered a phoneme, usually taken at the beginning and ending of the phonemes to match the silences occurring before, between and after words; they are therefore an important unit within the diphone inventory. In the second phase, redundant diphones were marked to be omitted from the recording. Due to various phonotactic constraints, not all phone-phone pairs occur physically (for instance, diphone “mb-ŋg” never occurs in Sinhala). All such non-existent diphones were identified after consulting a linguist. Finally, 1413 diphones were determined. The third phase involved in finding the answer to the second question; what carrier words should be used? In other words, to compile set of words each containing an encoded diphone. Following the guidelines given in the Festvox manual [1] it was decided to record nonsense words containing the targeted diphone. These nonsense words were embedded in carrier sentences including four other nonsensical context words. A care was taken when coining these nonsensical words, so that these words act in accordance with phonotactics of the Sinhala language. The diphone is extracted from the middle syllable of the middle word, minimizing the articulatory effects at the start and end of the word. Also, the use of nonsensical words helped the speaker to maintain a neutral prosodic context. The output of the third phase was 1413 sentences. The fourth phase involved recording the sentences. A native professional male speaker chosen for recording practiced the intelligibility and pronunciation of the sentences. He was advised to maintain a constant pitch, volume, and fairly constant speech rate during the recording. In order to maintain all of the above stated aspects, recordings were limited to two 30-minute sessions per day. At each session, 100 sentences were recorded on average. Recording was done in a professional studio with an optimum noise free environment. Initially the sentences were directly recorded to Digital Audio Tapes, and later transferred into wave files, redigitising at 44.1 kHz/16 bit quantization. The most tedious and painstaking tasks were carried out in the fifth phase where the recordings were split into individual files, and diphone boundaries hand-labeled using the speech analysis software tool ‘Praat’2. Afterwards, a script was written to transform Praat text-grid collection file into diphone index file (EST) as required by Festival [1]. The method for synthesis used in this project is Residual Excited Linear Predictive Coding (RELP Coding). As required by this method, pitch marks, Linear Predictive Coding (LPC) parameters and LPC residual values had to be extracted for each diphone in the diphone database. The script make_pm_wave provided by speech tools [1] was used to extract pitch marks from the wave files. Then, the make_lpc command was invoked in order to compute LPC coefficients and residuals from the wave files [1]. Having tested synthesizing different diphones, several diphones were identified problematic. An analysis of the errors revealed that most were due to incorrect pitch marking caused by the use of default parameters when extracting the pitch marks. The accurate parameters obtained by analyzing samples of speech with Praat were set in the scripts as per the guidelines given in [7]. Moreover, it was realized that lowering the pitch of the original wave files resulted in a more lucid speech. A proprietary software tool was used to lower the recorded pitch, and normalize it in terms of power so that all diphones had an approximately equivalent power. Subsequently, modified make_pm_wave and make_lpc scripts were used to extract the necessary parameters from the wave files. These overall post-processing steps significantly improved the voice quality. The final step was to group the diphone database in order to make it accessible for Festival’s UniSyn; the synthesizer module, and to make it ready for distribution. Preparing complete inventory of the diphones took virtually three months. A full listing of the scripts used for recording and creating the diphone database is available for download from http://www.ucsc.cmb.ac.lk/ltrl/si. 4. Natural language processing modules 2 Available from: http://www.praat.org When building a new voice using Festvox [1], templates of the natural language processing modules required by Festival are automatically generated as Scheme files. The NLP modules should be customized according to the language requirements. Hence, the language specific scripts (phone, lexicon, tokenization) and speaker specific scripts (duration and intonation) can be externally configured and implemented without recompiling the system [1], [9]. The NLP related tasks involved when building a new voice are [1], [5]: Defining the phone-set of the language Tokenization and text normalization Incorporation of letter-to-sound rules Incorporation of syllabification rules Assignment of stress patterns to the syllables in the word Phrase breaking Assignment of duration to phones Generation of f0 contour. 4.1. The phone set definition The identified phone-set for Sinhala in section 2.1 is implemented in the file festvox/ucsc_sin_sdn_phoneset.scm. This module defines the phones and describes their features. The proposed set of symbol scheme is found to be a versatile representation scheme for Sinhala phone-set. Along with the phone symbols, features such as vowel height, place of articulation and voicing are defined. Apart from the default set of features, new features that are useful in describing Sinhala phones are also defined. E.g. whether a consonant is pre-nasalized or not. These features will prove extremely useful when implementing prosody. 4.2. Tokenization and text normalization The default text tokenization methodology implemented in Festival (which is based on whitespace, and punctuation characters) is used to tokenize Sinhala text. Once the text has been tokenized, text normalization is carried out. This step converts digits, numerals, abbreviations, and non-alphabetic characters into word sequence depending on the context. Text normalization is a non trivial task. Therefore, prior to implementation, it was decided to analyze running text obtained from a corpus. Text obtained from the category “News Paper > Feature Articles > Other” of the UCSC Sinhala corpus BETA was chosen due to the heterogeneous nature of these texts and hence better representation of the language in this section of the corpus3. A script was written to extract sentences containing digits from the text corpus. The issues were identified by thoroughly analyzing the sentence. Strategies to address these issues were devised. A function is implemented to convert any number (decimal or integer up to 1 billion) into spoken words. In Sinhala, the conversion of common numbers is probably more complicated when compared to English. In certain numerical expressions, the number may be concluded from a word suffix. e.g. 5V pahen (out of five), 1 F W pal væni (1st). Such expressions are needed to be identified by taking into consideration the added suffix in a post-processing module. A function is implemented to expand abbreviations into full words. Common abbreviations found by the corpus analysis are listed, but our architecture allows easy incorporation of new abbreviations and corresponding words. In some situations, the word order had to be changed. For example, 50% must be expanded as “XDD2 ><J” si:j jaȘ pan ha (percent hundred), 50m should be expanded as Y2T ><J - mi:Ș r pan ha (meters fifty). All above mentioned functions are called effectively by analyzing the context, and then accurate expansions are obtained. The tokenization and text normalization modules are implemented in festvox/ucsc_sin_sdn_tokenizer.scm and capable of normalizing elements such as numbers, currency symbols, ratios, percentages, abbreviations, Roman numerals, time expressions, number ranges, telephone numbers, email addresses, English letters and various other symbols. 4.3. Letter-to-sound conversion The letter-to-sound module is used to convert an orthographic text into its corresponding phonetic representation. Sinhala being a phonetic language has an almost one-to-one mapping between letters and phonemes. We implemented the grapheme to phoneme (G2P) conversion architecture proposed by Wasala et al. in [10]. In this architecture, the UTF-8 textual input is converted to ASCII based phonetic 3 This accounts for almost two-thirds of the size of this version of the corpus representation defined in the Festival. This process takes place at the user-interface level. Owing to the considerable delay experienced when synthesizing the text, it was decided to re-write the above G2P rules in the Festival’s context sensitive format [1]. The rules were re-written in UTF-8 multi-byte format following the work done for Telugu language [3]. The method was proven to work well causing no delay at all. The 8 rules proposed in [10] expanded up to 817 rules when re-written in context sensitive format. However, some frequently encountered important words were found incorrectly phonetized by these rules. Hence, such words along with their correct pronunciation forms were included in Festival’s addenda, a part lexicon. The letter-to-sound rules and lexicon are implemented in festvox/ucsc_sin_lexi.scm. Festival’s UTF-8 support is still incomplete; however, we believe the above architecture as the best to deal with Unicode text input in Festival over other proposed methods [12], [10]. 4.4. Syllabification & stress assignment Instead of Festival’s default syllabification function lex.syllabify.phstress based on sonority sequencing profile [1], a new function (syllabify ‘phones) is implemented to syllabify Sinhala words. In this work, syllabification algorithm proposed by Weerasinghe et al. [11] is implemented. This algorithm is reported to have 99.95% accuracy [11]. The syllabification module is implemented in festvox/ucsc_sin_sdn_syl.scm. 4.5. Phrase breaking algorithm The assignment of intonational phrase breaks to the utterances to be spoken is an important task in a text-to-speech system. The presence of phrase breaks in the proper positions of an utterance affects the meaning, naturalness and intelligibility of speech. There are two methods for predicting phrase breaks in Festival. The first is to define a Classification and Regression Tree (CART). The second and more elaborate method of phrase break prediction is to implement a probabilistic model using probabilities of a break after a word based on the part of speech of the neighboring words and the previous word [1]. However, due to the unavailability of a Part-of-Speech (POS), and a POS tagger for Sinhala, probabilistic model cannot be constructed yet. Thus, we opted for the simple CART based phrase breaking algorithm described in [1]. The algorithm is based on the assumption that phrase boundaries are more likely between content words and function words. A rule is defined to predict a break if the current word is a content word and the next is seemingly a function word and the current word is more than 5 words from a punctuation symbol. This algorithm, initially developed for English, has proved to produce reasonable results for Sinhala as well. The phrasing algorithm is defined in festvox/ucsc_sin_sdn_phrase.scm. 4.6. Prosodic analysis Prosodic analysis is minimal in the current system and will be implemented in the future. The major challenge for building prosody for Sinhala is the lack of a POS tag-set, POS tagger and tagged text corpus. An experiment was carried out to adapt CART trees generated for an English voice prosody (f0 & duration) modules into Sinhala. The CART trees were carefully modified to represent the Sinhala Phone-set. The phone duration values were also hand modified to incorporate natural phone durations. The above steps resulted in more natural speech when compared to the monotonous speech produced before incorporating them. These adapted modules (cmu_us_kal_dur.scm, cmu_us_kal_int.scm) are incorporated to the Festivalsi system. 5. Integration with different platforms Festival offers a powerful platform for the development and deployment of speech synthesis systems. Since most Linux distributions now come with Festival pre-installed, the integration of Sinhala voice in such platforms is very convenient. Furthermore, following the work done for Festival-te, the Festival Telugu voice [3], the Sinhala voice developed here was made accessible to GNOMEOrca4 and Gnopernicus5 - powerful assistive screen reader software for people with visual impairments. The next task involved the building of Festival support natively on Windows. It is noteworthy to mention the modification to Festival’s text reading module as part of this task. We experienced that Festival is incapable of reading UTF-8 text files with byte-order marker (BOM). UTF-8 text files saved by Windows Notepad or OpenOffice Word will always include the byte-order-marker. Festival fails to read such files and will give an error “Un-pronunciation word”. Manual removal of BOM from file each time proved to be a repetitive process. Therefore, a short 4 5 Available from: http://live.gnome.org/Orca Available from: http://www.baum.ro/gnopernicus.html patch was written to make Festival capable of reading text files with UTF-8 byte-order-marker. The patch is available for download from http://www.ucsc.cmb.ac.lk/ltrl/si. Motivated by the work carried out in the Welsh & Irish Speech Processing Resources (WISPR) project [12], steps were taken to integrate Festival along with the Sinhala voice into the Microsoft Speech Application Programming Interface (MS-SAPI) which provides the standard speech synthesis and speech recognition interface within Windows applications [13]. As a result of this work, the MS-SAPI compliant Sinhala voice is accessible via any speech enabled Windows application. We believe that the visually impaired community would be benefited at large by this exercise owing to the prevalent use of Windows in the community. The Sinhala voice also proved to work well with Thunder6 a freely available screen reader for Windows. This will cater to the vast demand for a screen reader capable of speaking Sinhala text. It is noteworthy to mention that for the first time the print disabled community in Sri Lanka will be able to work on computers in their local language by using the current Sinhala text-to-speech system. In this paper we described the development and evaluation of the first TTS system for Sinhala language based on the Festival architecture. The design of a diphone database and the natural language processing modules developed has been described. Future work will mainly focus on improving the naturalness of the synthesizer. Work is in progress to improve the prosody modules. A speech corpus containing 2 hours of speech has been already recorded. The material is currently being segmented, and labeled. We are also planning to improve the duration model using the data obtained from the annotated speech corpus. A number of other ongoing projects are aimed at developing a POS tag set, POS tagger and a tagged corpus for Sinhala. Further work will focus on expanding the pronunciation lexicon. At present, the G2P rules are incapable of providing accurate pronunciation for most compound words. Thus, we are planning to construct a lexicon consisting of compound words along with common high frequency words found in our Sinhala text corpus, which are currently incorrectly phonetized. All resources developed under this project are made available at: http://www.ucsc.cmb.ac.lk/ltrl/si. 6. Evaluation 8. Acknowledgements Text-to-speech systems can be compared and evaluated with respect to intelligibility, naturalness, and suitability for used application [6]. As the Sinhala TTS system is a general-purpose synthesizer, a decision was made to evaluate it under the intelligibility criterion. A Modified Rhyme Test (MRT) [6], [9] was designed to test the Sinhala TTS system. The test consists of 50 sets of 6 one or two syllable words which makes a total set of 300 words. The words are chosen to evaluate phonetic characteristics such as voicing, nasality, sibilation, and consonant germination. Out of 50 sets, 20 sets were selected for each listener. The set of 6 words is played one at the time and the listener marks the synthesized word. The overall intelligibility of the system from 20 listeners is found to be 71.5%. This is the first known documented work on a Sinhala text-to-speech synthesizer, and also the first Sinhala TTS system, which had been evaluated. This work was made possible through the PAN Localization Project, (http://www.PANL10n.net) a grant from the International Development Research Center (IDRC), Ottawa, Canada, administered through the Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan. 7. Conclusions and future work 6 Available from: http://www.screenreader.net/ 9. References [1] A.W. Black, and K.A. Lenzo, Building Synthetic Voices, Language Technologies Institute, Carnegie Mellon University and Cepstral LLC. Retrieved from: http://festvox.org/bsv/, 2003. [2] R.J. Jones, A. Choy and B. Williams, “Integrating Festival and Windows”, InterSpeech 2006, 9th International Conference on Spoken Language Processing, Pittsburgh, USA, 2006. [3] C. Kamisetty and S.M. Adapa, Telugu Festival Text-to-Speech System. Retrieved from: http://festivalte.sourceforge.net/wiki/Main_Page, 2006. [4] W.S. Karunatillake, An Introduction to Spoken Sinhala, 3rd edn., M.D. Gunasena & Co. ltd., 217, Olcott Mawatha, Colombo 11, 2004. [5] S.P. Kishore, R. Sangal and M. Srinivas, “Building Hindi and Telugu Voices using Festvox”, Proceedings of the International Conference On Natutal Language Processing 2002 (ICON-2002), Mumbai, India, 2002. [6] S. Lemmetty, Review of Speech Synthesis Technology, MSc. thesis, Helsinki University of Technology, 1999. [7] A. Louw, A Short Guide to Pitch-Marking in the Festival Speech Synthesis System and Recommendations for improvement. Local Language Speech Technology Initiative (LLSTI) Reports. Retrieved from: http://www.llsti.org/documents.htm, (n.d.). [8] A.G. Ramakishnan, K. Bali, P.P. Talukdar and N.S. Krishna, “Tools for the Development of a Hindi Speech Synthesis System”, In 5th ISCA Speech Synthesis Workshop, Pittsburgh, 2004, pp. 109-114. [9] Ö. Salor, B. Pellom and M. Demirekler, “Implementation and Evaluation of a Text-to-Speech Synthesis System for Turkish”, Proceedings of Eurospeech-Interspeech 2003, Geneva, Switzerland, 2003, pp. 1573-1576. [10] A. Wasala, R. Weerasinghe and K. Gamage, “Sinhala Grapheme-to-Phoneme Conversion and Rules for Schwa epenthesis”, Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, Sydney, Australia, 2006, pp. 890-897. [11] R. Weerasinghe, A. Wasala, and K. Gamage, “A Rule Based Syllabification Algorithm for Sinhala”, Proceedings of 2nd International Joint Conference on Natural Language Processing (IJCNLP-05), Jeju Island, Korea, 2005, 438-449. [12] B. Williams, R.J. Jones and I. Uemlianin, “Tools and Resources for Speech Synthesis Arising from a Welsh TTS Project”, Fifth Language Resources and Evaluation Conference (LREC), Genoa, Italy, 2006. [13] Microsoft Corporation.: Microsoft Speech SDK Version 5.1, Retrieved from: http://msdn2.microsoft.com/ens/library/ms990097.aspx, (n.d.).