Fuzzy Approaches to Speech and Speaker Recognition

Dat Tran

Fuzzy Approaches to Speech and Speaker Recognition

2000

Fuzzy Approaches to Speech and Speaker Recognition A thesis submitted for the degree of Doctor of Philosophy of the University of Canberra Dat Tat Tran May 2000 Summary of Thesis Statistical pattern recognition is the most successful approach to automatic speech and speaker recognition (ASASR). Of all the statistical pattern recognition techniques, the hidden Markov model (HMM) is the most important. The Gaussian mixture model (GMM) and vector quantisation (VQ) are also effective techniques, especially for speaker recognition and in conjunction with HMMs, for speech recognition. However, the performance of these techniques degrades rapidly in the context of insufficient training data and in the presence of noise or distortion. Fuzzy approaches with their adjustable parameters can reduce such degradation. Fuzzy set theory is one of the most successful approaches in pattern recognition, where, based on the idea of a fuzzy membership function, fuzzy C-means (FCM) clustering and noise clustering (NC) are the most important techniques. To establish fuzzy approaches to ASASR, the following basic problems are solved. First, a time-dependent fuzzy membership function is defined for the HMM. Second, a general distance is proposed to obtain a relationship between modelling and clustering techniques. Third, fuzzy entropy (FE) clustering is proposed to relate fuzzy models to statistical models. Finally, fuzzy membership functions are proposed as discriminant functions in decison making. The following models are proposed: 1) the FE-HMM, NC-FE-HMM, FE-GMM, NC-FEGMM, FE-VQ and NC-FE-VQ in the FE approach, 2) the FCM-HMM, NC-FCM-HMM, FCM-GMM and NC-FCM-GMM in the FCM approach, and 3) the hard HMM and GMM as the special models of both FE and FCM approaches. Finally, a fuzzy approach to speaker verification and a further extension using possibility theory are also proposed. The evaluation experiments performed on the TI46, ANDOSL and YOHO corpora show better results for all of the proposed techniques in comparison with the non-fuzzy baseline techniques. ii Certificate of Authorship of Thesis Except as specially indicated in footnotes, quotations and the bibliography, I certify that I am the sole author of the thesis submitted today entitled— Fuzzy Approaches to Speech and Speaker Recognition in terms of the Statement of Requirements for a Thesis issued by the University Higher Degrees Committee. Papers containing some of the material of the thesis have been published as Tran [1999, 1998], Tran and Wagner [2000a-h, 1999a-e, 1998], and Tran et al. [2000a,b, 1999a-d, 1998ad]. For all of the above joint papers, I certify that the contributions of my co-authors, Michael Wagner and Tu Van Le, were solely made in their respective roles of primary and secondary thesis supervisors. For the joint papers Tran et al. [2000a,b, 1999c, 1998c], my co-author Tuan Pham contributed comments on my literature review and discussions on the theoretical development while Michael Wagner contributed as a thesis supervisor. For the joint papers Tran et al. [1999a,b,d], my co-author Tongtao Zheng contributed discussions on the experimental results while Michael Wagner contributed as a thesis supervisor. For the joint papers Tran et al. [1998d], my co-author Minh Do contributed part of the programming and some discussions on the theoretical development and the experimental results while Michael Wagner and Tu Van Le contributed as a thesis supervisor. Acknowledgements First and foremost, I would like to thank my primary supervisor, Professor Michael Wagner, for his enormous support and encouragement during my research study at the University of Canberra. I am also thankful for the advice and guidance he gave me in spite of his busy schedule, for helping me organise the thesis draft and refine its contents, and for his patience in answering my inquiries. I would like to thank my secondary supervisor, Associate Professor Tu Van Le, for his teaching and support he gave me to be a PhD candidate at the University of Canberra. I would also like to thank staff members as well as research students at the University of Canberra, for their support and for maintaining the excellent computing facilities which were crucial for carrying out my research. I am grateful for the University of Canberra Research Scholarship, which enabled me to undertake this research in the period February 1997 - February 2000. I would also like to thank the School of Computing and Division of Management and Technology which provided funding for attending several conferences. I would like to express my gratitude to all my lecturers and colleagues at the Department of Theoretical Physics, Faculty of Physics, and Faculty of Mathematics, University of Ho Chi Minh City, Viet Nam. Many special thanks to my family members. I am indebted to my parents for the sacrifices they have made for me. I wish to thank my brothers-in-law and sisters-in-law as well as my wife, Phuong Dao, my son, Nguyen Tran, and my daughter, Thao Tran, for they have given me much support throughout the years of my thesis research. Finally, this work is to the memory of my previous supervisor, Professor Phi Van Duong, a scientist of the Abdus Salam International Centre for Theoretical Physics (ICTP), Trieste, Italy. A special thanks for his teaching, love, advice, guidance, support and encouragement throughout 12 years at the University of Ho Chi Minh City, Viet Nam. iv Contents Summary of Thesis ii Acknowledgements iv List of Abbreviations xv 1 Introduction 1.1 1 Current Approaches to Speech and Speaker Recognition . . . . . . . . 1 1.1.1 Statistical Pattern Recognition Approach . . . . . . . . . . . . 2 1.1.2 Modelling Techniques: HMM, GMM and VQ . . . . . . . . . . 2 Fuzzy Set Theory-Based Approach . . . . . . . . . . . . . . . . . . . 3 1.2.1 The Membership Function . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Clustering Techniques: FCM and NC . . . . . . . . . . . . . . 4 1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Contributions of This Thesis . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.1 Fuzzy Entropy Models . . . . . . . . . . . . . . . . . . . . . . 7 1.4.2 Fuzzy C-Means Models . . . . . . . . . . . . . . . . . . . . . . 8 1.4.3 Hard Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.4 A Fuzzy Approach to Speaker Verification . . . . . . . . . . . 8 1.4.5 Evaluation Experiments and Results . . . . . . . . . . . . . . 9 Extensions of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 1.5 2 Literature Review 2.1 10 Speech Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 11 Speech Sounds . . . . . . . . . . . . . . . . . . . . . . . . . . v CONTENTS 2.2 2.3 2.4 2.5 vi 2.1.2 Speech Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Speech and Speaker Recognition . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Statistical Modelling Techniques . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Maximum A Posteriori Rule . . . . . . . . . . . . . . . . . . . 20 2.3.2 Distribution Estimation Problem . . . . . . . . . . . . . . . . 21 2.3.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . 22 2.3.4 Hidden Markov Modelling . . . . . . . . . . . . . . . . . . . . 24 Parameters and Types of HMMs . . . . . . . . . . . . . . . . 24 Three Basic Problems for HMMs . . . . . . . . . . . . . . . . 26 2.3.5 Gaussian Mixture Modelling . . . . . . . . . . . . . . . . . . . 29 2.3.6 Vector Quantisation Modelling . . . . . . . . . . . . . . . . . . 31 2.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Fuzzy Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . . 34 2.4.1 Fuzzy Sets and the Membership Function . . . . . . . . . . . . 34 2.4.2 Maximum Membership Rule . . . . . . . . . . . . . . . . . . . 35 2.4.3 Membership Estimation Problem . . . . . . . . . . . . . . . . 35 2.4.4 Pattern Recognition and Cluster Analysis . . . . . . . . . . . 35 2.4.5 Hard C-Means Clustering . . . . . . . . . . . . . . . . . . . . 36 2.4.6 Fuzzy C-Means Clustering . . . . . . . . . . . . . . . . . . . . 37 Fuzzy C-Means Algorithm . . . . . . . . . . . . . . . . . . . . 38 Gustafson-Kessel Algorithm . . . . . . . . . . . . . . . . . . . 38 Gath-Geva Algorithm . . . . . . . . . . . . . . . . . . . . . . 39 2.4.7 Noise Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Fuzzy Approaches in the Literature . . . . . . . . . . . . . . . . . . . 42 2.5.1 Maximum Membership Rule-Based Approach . . . . . . . . . 42 2.5.2 FCM-Based Approach . . . . . . . . . . . . . . . . . . . . . . 43 CONTENTS vii 3 Fuzzy Entropy Models 46 3.1 Fuzzy Entropy Clustering . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Modelling and Clustering Problems . . . . . . . . . . . . . . . . . . . 50 3.3 Maximum Fuzzy Likelihood Estimation . . . . . . . . . . . . . . . . . 51 3.4 Fuzzy Entropy Hidden Markov Models . . . . . . . . . . . . . . . . . 52 3.4.1 Fuzzy Membership Functions . . . . . . . . . . . . . . . . . . 52 3.4.2 Fuzzy Entropy Discrete HMM . . . . . . . . . . . . . . . . . . 54 3.4.3 Fuzzy Entropy Continuous HMM . . . . . . . . . . . . . . . . 57 3.4.4 Noise Clustering Approach . . . . . . . . . . . . . . . . . . . . 58 Fuzzy Entropy Gaussian Mixture Models . . . . . . . . . . . . . . . . 59 3.5.1 Fuzzy Entropy GMM . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.2 Noise Clustering Approach . . . . . . . . . . . . . . . . . . . . 60 3.6 Fuzzy Entropy Vector Quantisation . . . . . . . . . . . . . . . . . . . 61 3.7 A Comparison Between Conventional and Fuzzy Entropy Models . . . 61 3.8 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5 4 Fuzzy C-Means Models 66 4.1 Minimum Fuzzy Squared-Error Estimation . . . . . . . . . . . . . . . 66 4.2 Fuzzy C-Means Hidden Markov Models . . . . . . . . . . . . . . . . . 67 4.2.1 FCM Discrete HMM . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.2 FCM Continuous HMM . . . . . . . . . . . . . . . . . . . . . 68 4.2.3 Noise Clustering Approach . . . . . . . . . . . . . . . . . . . . 69 Fuzzy C-Means Gaussian Mixture Models . . . . . . . . . . . . . . . 70 4.3.1 Fuzzy C-Means GMM . . . . . . . . . . . . . . . . . . . . . . 70 4.3.2 Noise Clustering Approach . . . . . . . . . . . . . . . . . . . . 70 4.4 Fuzzy C-Means Vector Quantisation . . . . . . . . . . . . . . . . . . . 71 4.5 Comparison Between FCM and FE Models . . . . . . . . . . . . . . . 71 4.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 5 Hard Models 76 5.1 From Fuzzy To Hard Models . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Hard Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . 78 5.2.1 80 Hard Discrete HMM . . . . . . . . . . . . . . . . . . . . . . . CONTENTS 5.2.2 viii Hard Continuous HMM . . . . . . . . . . . . . . . . . . . . . 80 5.3 Hard Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . 81 5.4 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 83 6 A Fuzzy Approach to Speaker Verification 85 6.1 A Speaker Verification System . . . . . . . . . . . . . . . . . . . . . . 86 6.2 Current Normalisation Methods . . . . . . . . . . . . . . . . . . . . . 87 6.3 Proposed Normalisation Methods . . . . . . . . . . . . . . . . . . . . 89 6.4 The Likelihood Transformation . . . . . . . . . . . . . . . . . . . . . 92 6.5 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 96 7 Evaluation Experiments and Results 7.1 97 Database Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.1.1 The TI46 database . . . . . . . . . . . . . . . . . . . . . . . . 97 7.1.2 The ANDOSL Database . . . . . . . . . . . . . . . . . . . . . 98 7.1.3 The YOHO Database . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.3 Algorithmic Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.4 7.5 7.6 7.7 7.3.1 Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.3.2 Constraints on parameters during training . . . . . . . . . . . 100 Isolated Word Recognition . . . . . . . . . . . . . . . . . . . . . . . . 102 7.4.1 E set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.4.2 10-Digit&10-Command Set Results . . . . . . . . . . . . . . . 105 7.4.3 46-Word Set Results . . . . . . . . . . . . . . . . . . . . . . . 106 Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.5.1 TI46 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.5.2 ANDOSL Results . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.5.3 YOHO Results . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.6.1 TI46 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.6.2 ANDOSL Results . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.6.3 YOHO Results . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 115 CONTENTS ix 8 Extensions of the Thesis 8.1 8.2 123 Possibility Theory-Based Approach . . . . . . . . . . . . . . . . . . . 123 8.1.1 Possibility Theory . . . . . . . . . . . . . . . . . . . . . . . . . 123 8.1.2 Possibility Distributions . . . . . . . . . . . . . . . . . . . . . 125 8.1.3 Maximum Possibility Rule . . . . . . . . . . . . . . . . . . . . 125 Possibilistic C-Means Approach . . . . . . . . . . . . . . . . . . . . . 126 8.2.1 Possibilistic C-Means Clustering . . . . . . . . . . . . . . . . . 126 8.2.2 PCM Approach to FE-HMMs . . . . . . . . . . . . . . . . . . 127 8.2.3 PCM Approach to FCM-HMMs . . . . . . . . . . . . . . . . . 128 8.2.4 PCM Approach to FE-GMMs . . . . . . . . . . . . . . . . . . 129 8.2.5 PCM Approach to FCM-GMMs . . . . . . . . . . . . . . . . . 130 8.2.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . 130 9 Conclusions and Future Research 133 9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 9.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . 135 Bibliography 136 A List of Publications 152 List of Figures 2.1 The speech signal of the utterance “one” (a) in the long period of time from t = 0.3 sec to t = 0.6 sec and (b) in the short period of time from t = 0.4 sec to t = 0.42 sec. . . . . . . . . . . . . . . . . . . . . . . . . 2.2 13 Block diagram of LPC front-end processor for speech and speaker recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 An N -state left-to-right HMM with ∆i = 1 . . . . . . . . . . . . . . . 28 2.4 Relationships between HMM, GMM, and VQ techniques . . . . . . . 33 2.5 A statistical classifier for isolated word recognition and speaker identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6 Clustering techniques and their extended versions . . . . . . . . . . . 41 3.1 Generating 3 clusters with different values of n: hard clustering as n → 0, clusters increase their overlap with increasing n > 0, and are identical to a single cluster as n → ∞ . . . . . . . . . . . . . . . . . . 3.2 49 States at each time t = 1, . . . , T are regarded as time-dependent fuzzy sets. There are N × T fuzzy states connected by arrows into N T fuzzy state sequences in the fuzzy HMM. . . . . . . . . . . . . . . . . . . . 3.3 The observation sequence O belongs to fuzzy state sequences being in fuzzy state i at time t and fuzzy state j at time t + 1. . . . . . . . . . 3.4 53 54 The observation sequence X belongs to fuzzy state j and fuzzy mixture k at time t in the fuzzy continuous HMM. . . . . . . . . . . . . . . . 55 3.5 From fuzzy entropy models to conventional models . . . . . . . . . . 62 3.6 The membership function uit with different values of the degree of fuzzy 3.7 entropy n versus the distance dit between vector xt and cluster i . . . 64 Fuzzy entropy models for speech and speaker recognition . . . . . . . 65 x LIST OF FIGURES xi 4.1 The relationship between FE model groups versus the degree of fuzzy entropy n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 The relationship between FCM model groups versus the degree of fuzziness m 72 4.3 The FCM membership function uit with different values of the degree of fuzziness m versus the distance dit between vector xt and cluster i 4.4 73 Curves representing the functions used in the FE and FCM memberships, where x = d2it , m = 2 and n = 1 . . . . . . . . . . . . . . . . . 74 4.5 Fuzzy C-means models for speech and speaker recognition . . . . . . 75 5.1 From hard VQ to fuzzy VQ: an additional fuzzy entropy term for fuzzy entropy VQ, and a weighting exponent m > 1 on each uit for fuzzy Cmeans VQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 77 From fuzzy VQ to (hard) VQ: n → 0 for FE-VQ or m → 1 for FCM-VQ or using the minimum distance rule to compute uit directly. . . . . . . 77 5.3 Mutual relations between fuzzy and hard models . . . . . . . . . . . . 78 5.4 Possible state sequences in a 3-state Bakis HMM and a 3-state fuzzy Bakis HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.5 A possible single state sequence in a 3-state hard HMM . . . . . . . . 79 5.6 A mixture of three Gaussian distributions in the GMM or the fuzzy GMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.7 A set of three non overlapping Gaussian distributions in the hard GMM. 82 5.8 Relationships between hard models . . . . . . . . . . . . . . . . . . . 84 6.1 A Typical Speaker Verification System . . . . . . . . . . . . . . . . . 86 6.2 The transformation T where T (P )/P increases and T (P ) is non positive for 0 ≤ P ≤ 1: values of 4 ratios at A, B, C, and D are moved to those at A’, B’, C’, and D’ . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Histograms of speaker f7 in the TI46 using 16-mixture GMMs. The EER is 6.67% for Fig. 6.3a and is 5.90% for Fig. 6.3b. . . . . . . . . 7.1 93 94 Isolated word recognition error (%) versus the number of state N for the digit-set vocabulary, using left-to-right DHMMs, codebook size K = 16, TI46 database . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 LIST OF FIGURES 7.2 xii Speaker Identification Error (%) versus the degree of fuzziness m using FCM-VQ speaker models, codebook size K = 16, TI46 corpus . . . . 102 7.3 Isolated word recognition error (%) versus the degree of fuzzy entropy n for the E-set vocabulary, using 6-state left-to-right FE-DHMMs, codebook size of 16, TI46 corpus . . . . . . . . . . . . . . . . . . . . . . . 103 7.4 Speaker identification error rate (%) versus the number of mixtures for 16 speakers, using conventional GMMs, FCM-GMMs and NC-FCMGMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.5 Speaker identification error rate (%) versus the codebook size for 16 speakers, using VQ, FE-VQ and NC-FE-VQ codebooks . . . . . . . . 110 7.6 EERs (%) for GMM-based speaker verification performed on 16 speakers, using GMMs, FCM-GMMs and NC-FCM-GMMs. . . . . . . . . . 112 7.7 EERs (%) for VQ-based speaker verification performed on 16 speakers, using VQ, FE-VQ and NC-FE-VQ codebooks . . . . . . . . . . . . . 113 8.1 PCM Clustering in Clustering Techniques . . . . . . . . . . . . . . . 131 8.2 PCM Approach to FE models for speech and speaker recognition 8.3 PCM Approach to FCM models for speech and speaker recognition . 132 . . 131 List of Tables 3.1 An example of memberships for the GMM and the FE-GMM . . . . . 6.1 The likelihood values for 4 input utterances X1 −X4 against the claimed 64 speaker λ0 and 3 impostors λ1 −λ3 , where X1c , X2c are from the claimed speaker and X3i , X4i are from impostors . . . . . . . . . . . . . . . . . 90 6.2 Scores of 4 utterances using L3 (X) and L8 (X) . . . . . . . . . . . . . 91 6.3 Scores of 4 utterances using L3nc (X) and L8nc (X) . . . . . . . . . . . 92 7.1 Isolated word recognition error rates (%) for the E set . . . . . . . . . 104 7.2 Speaker-dependent recognition error rates (%) for the E set . . . . . . 106 7.3 Isolated word recognition error rates (%) for the 10-digit set . . . . . 107 7.4 Isolated word recognition error rates (%) for the 10-command set . . 107 7.5 Isolated word recognition error rates (%) for the 46-word set . . . . . 108 7.6 Speaker-dependent recognition error rates (%) for the 46-word set . . 117 7.7 Speaker identification error rates (%) for the ANDOSL corpus using conventional GMMs, FE-GMMs and FCM-GMMs. . . . . . . . . . . 118 7.8 Speaker identification error rates (%) for the YOHO corpus using conventional GMMs, hard GMMs and VQ codebooks. . . . . . . . . . . . 119 7.9 EER Results (%) for the ANDOSL corpus using GMMs with different background speaker sets. Rows in bold are the current normalisation methods , others are the proposed methods. The index “nc” denotes noise clustering-based methods. . . . . . . . . . . . . . . . . . . . . . 120 7.10 Equal Error Rate (EER) Results (%) for the YOHO corpus. Rows in bold are the current normalisation methods , others are the proposed methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 xiii LIST OF TABLES xiv 7.11 Comparisons of EER Results (%) for the YOHO corpus using GMMs, hard GMMs and VQ codebooks. Rows in bold are the current normalisation methods, others are the proposed methods. . . . . . . . . . . . 122 List of Abbreviations ANN artificial neural network CHMM continuous hidden Markov model DTW dynamic time warping DHMM discrete hidden Markov model EM expectation maximisation FE fuzzy entropy FCM fuzzy C -means GMM Gaussian mixture model HMM hidden Markov model LPC linear predictive coding MAP maximum a posteriori ML maximum likelihood MMI maximum mutual information NC noise clustering pdf probability density function PCM possibilistic C -means VQ vector quantisation xv Chapter 1 Introduction Research in automatic speech and speaker recognition by machine has been conducted for more than four decades. Speech recognition is the process of automatically recognising the linguistic content in a spoken utterance. Speaker recognition can be classified into two specific tasks: identification and verification. Speaker identification is the process of determining who is speaking based on information obtained from the speaker’s speech. Speaker verification is the process of accepting or rejecting the identity claim of a speaker. 1.1 Current Approaches to Speech and Speaker Recognition Three current approaches to speech and speaker recognition by machine are the acoustic-phonetic approach, the pattern-recognition approach and the artifical intelligence approach. The acoustic-phonetic approach is based on the theory of acoustic phonetics which postulates that there exist finite, distinctive phonetic units in spoken language and that the phonetic units are broadly characterised by sets of properties that are manifest in the speech signal or its spectrum over time. The acoustic-phonetic approach is usually based on segmentation of the speech signal and subsequent feature extraction. The main problem with the acoustic-phonetic approach is the variability of the acoustic properties of a phoneme depending on many factors including acoustic context, speaker gender, age, emotional state, etc. The 1 1.1 Current Approaches to Speech and Speaker Recognition 2 pattern-recognition approach generally uses the speech patterns directly, i.e. without explicit feature determination and segmentation. This method has two steps: training of speech patterns, and recognition of patterns via pattern comparison. Finally, the artifical intelligence approach is a hybrid of the acoustic-phonetic approach and the pattern-recognition approach in that it exploits ideas and concepts of both methods, especially the use of an expert system for segmentation and labelling, and the use of neural networks for learning the relationships between phonetic events and all known inputs [Rabiner and Juang 1993]. 1.1.1 Statistical Pattern Recognition Approach In general, the pattern-recognition approach is the method of choice for speech and speaker recognition because of its simplicity of use, proven high performance, and relative robustness and invariance to different speech vocabularies, users, algorithms and decision rules. The most successful approach to speech and speaker recognition is to treat the speech signal as a stochastic pattern and to use a statistical pattern recognition technique. The statistical formulation has its root in the classical Bayes decision theory, which links a recognition task to the distribution estimation problem. In order to be practically implementable, the distributions are usually parameterised and thus the distribution estimation problem becomes a parameter estimation problem, where a reestimation algorithm using a set of parameter estimation equations is established to find the right parametric form of the distributions. The unknown parameters defining the distribution have to be estimated from the training data. To obtain reliable parameter estimates, the training set needs to be of sufficient size in relation to the number of parameters. When the amount of the training data is not sufficient, the quality of the distribution parameter estimates cannot be guaranteed. In other words, the minimum Bayes risk generally remains an unachievable lower bound [Huang et al. 1996]. 1.1.2 Modelling Techniques: HMM, GMM and VQ In the statistical pattern recognition approach, hidden Markov modelling of the speech signal is the most important technique. It has been used extensively to model funda- 1.2 Fuzzy Set Theory-Based Approach 3 mental speech units in speech recognition because the hidden Markov model (HMM) can adequately characterise both the temporally and spectrally varying nature of the speech signal [Rabiner and Juang 1993]. In speech recognition, the left-to-right HMM with only self and forward transitions is the simplest word and subword model. In text-dependent speaker recognition, training comprises spectral and temporal representations of specific utterances and testing must use the same utterances, the ergodic HMM with all possible transitions can be used for broad phonetic categorisation [Matsui and Furui 1992]. In text-independent speaker recognition, training comprises a good representation of all the speaker’s speech sounds and testing can be any utterance, there are no constraints on the training and test text and the temporal information has not been shown to be useful [Reynolds 1992]. The Gaussian mixture model (GMM) is used in this case. The GMM uses a mixture of Gaussian densities to model the distribution of a speaker-specific feature vectors. When little data is available, the vector quantisation (VQ) technique is also effective in characterising speaker-specific features. The VQ model is a codebook and is generated by clustering the training feature vectors of each speaker. 1.2 Fuzzy Set Theory-Based Approach An alternative successful approach in pattern recognition is the fuzzy set theorybased approach. Fuzzy set theory was introduced by Zadeh [1965] to represent and manipulate data and information that possess nonstatistical uncertainty. Fuzzy set theory is a generalisation of conventional set theory that was introduced as a new way of representing the vagueness or imprecision that is ever present in our daily experience as well as in natural language [Bezdek 1993]. 1.2.1 The Membership Function The membership function is the basic idea in fuzzy set theory. The membership of a point in a fuzzy set represents the degree to which the point belongs to this fuzzy set. The first fuzzy approach to speech and speaker recognition was the use of the memberships as discriminant functions in decision making [Pal and Majumder 1977]. In fuzzy clustering, the memberships are not known in advance and have to be estimated from 1.3 Problem Statement 4 a training set of observations with known class labels. The membership estimation procedure in fuzzy pattern recognition is called the abstraction [Bellman et al. 1966]. 1.2.2 Clustering Techniques: FCM and NC The most successful technique in fuzzy cluster analysis is fuzzy C-means (FCM) clustering, it is widely used in both theory and practical applications of fuzzy clustering techniques to unsupervised classification [Zadeh 1977]. FCM clustering [Dunn 1974, Bezdek 1973] is an extension of hard C-means (HCM), also known as K-means clustering [Duda and Hart 1973]. A general estimation procedure for the FCM technique has been established and its convergence has been shown [Bezdek and Pal 1992]. However, the FCM technique has a problem of sensitivity to outliers. The sum of the memberships of a feature vector across classes is always equal to one both for clean data and for noisy data. It would be more reasonable that, if the feature vector comes from noisy data or outliers, the memberships should be as small as possible for all classes and the sum should be smaller than one. This property is important since all parameter estimates are computed based on these memberships. An idea of a noise cluster has been proposed [Davé 1991] in the noise clustering (NC) technique to deal with noisy data or outliers. All the above-mentioned subjects are reviewed respect to speech and speaker recognition, statistical modelling and clustering in Chapter 2. 1.3 Problem Statement In general, the most successful approach in speech and speaker recognition is the statistical pattern recognition approach where the HMM is the most important technique and with the GMM and VQ also used in speaker recognition. However, the performance of these techniques degrades rapidly in the context of insufficient training data and in the presence of noise or distortion. Fuzzy approaches with their adjustable parameters can hopefully reduce such degradation. In pattern recognition, the fuzzy set theory-based approach is one of the most succesful approaches and FCM clustering is the most important technique. Therefore, to obtain fuzzy pattern recognition approaches to statistical methods in speech and speaker recognition, we 1.3 Problem Statement 5 need to solve the following basic problems. In the training phase, these are: 1) how to determine the fuzzy membership functions for the statistical models and 2) how to estimate these fuzzy membership functions from a training set of observations? In the recognition phase, the basic problem is: 3) how to use the fuzzy membership functions as discriminant functions for recognition? For the first problem, we begin with the HMM technique in the statistical pattern recognition approach and through considering the HMM, we show how the concept of the fuzzy membership function can be used. In hidden Markov modelling, the underlying assumption is that the observation sequence obtained from speech processing can be characterised as generated by a stochastic process. Each observation is regarded as the output of another stochastic process—the hidden Markov process—which is governed by the output probability distribution. A first-order hidden Markov process consists of a finite state sequence where the initial state is governed by an initial state distribution and state transitions which occur at discrete time t are governed by transition probabilities which only depend on the previous state. The observation sequence is regarded as being produced by different state sequences with corresponding probabilities. This situation can be more effectively represented by using fuzzy set theory, where states at each time t are regarded as time-dependent fuzzy sets and called fuzzy states. The time-dependent fuzzy membership ust =i (ot ) can be defined as the degree of belonging of the observation ot to the fuzzy state st = i at time t. However, the observations are always considered in the sequence O and related to the state sequence S. Fuzzy state sequences are thus also defined as a sequence of fuzzy states in time and the fuzzy membership function is defined for the observation sequence O in fuzzy state sequence S, based on the fuzzy membership function of the observation ot in the fuzzy state st [Tran and Wagner 1999a]. For example, to compute the state transition matrix A, we consider fuzzy states at time t and time t+1 included in corresponding fuzzy state sequences and define the fuzzy membership function ust =i st+1 =j (O). This membership denotes the degree of belonging of the observation sequence O to fuzzy state sequences being in fuzzy state st = i at time t and fuzzy state st+1 = j at time t+1. In this approach, probability and fuzziness are complementary rather than competitive. For the HMM, probability deals with stochastic processes for the observation sequence and state sequences whereas fuzziness deals 1.3 Problem Statement 6 with the relationship between these sequences [Tran and Wagner 1999c]. For the second problem, estimating the fuzzy membership functions is based on the selection of an optimisation criterion. The minimum squared-error criterion used in the FCM clustering and the NC techniques is very effective in cluster analysis [Bezdek and Pal 1992, Davé 1990], thus it can be applied to statistical modelling techniques. However, there are two sub-problems to be solved before applying it. The first question is, how to obtain a relationship between the clustering and modelling techniques, since the goal of clustering techniques is to find optimal partitions of data [Davé and Krishnapuram 1997] whereas the goal of statistical modelling techniques is to find the right parametric form of the distributions [Juang et al. 1996]. For this sub-problem, a general distance for clustering techniques is proposed [Tran and Wagner 1999b]. The distance is defined as a decreasing function of the component probability density, and hence grouping similar feature vectors into a cluster becomes classification of these vectors into a component distribution. Clusters are now represented by component distribution functions and hence characteristics of a cluster are not only its shape and location, but also the data density in the cluster and possibly the temporal structure of data if the Markov process is applied [Tran and Wagner 2000a]. Finding good partitions of data in clustering techniques thus leads to finding the right parametric form of distributions in the modelling techniques. The second sub-problem is the relationship between the fuzzy and conventional models. Fuzzy models using FCM clustering can reduce to hard models using HCM clustering if the degree of fuzziness m > 1 tends to 1 [Tran et al. 2000a]. However, the conventional HMM is not a hard model since there is more than one possible state at each time t, therefore the relationship between the fuzzy HMM and the HMM has not been established as well since the hard HMM has not yet been defined. For solving this problem, we propose an alternative clustering technique called fuzzy entropy (FE) clustering [Tran and Wagner 2000f] and apply this technique to the HMM to obtain the fuzzy entropy HMM (FE-HMM) [Tran and Wagner 2000b]. The degree of fuzzy entropy n > 0 is introduced in the FE-HMM. As n tends to 1, FE-HMMs reduce to conventional HMMs. We also propose the hard HMM where only the best state sequence is employed by using a binary (zero-one) membership function [Tran et al. 2000a]. A hard GMM is also proposed, which employs only 1.4 Contributions of This Thesis 7 the most likely Gaussian distribution among the mixture of Gaussians to represent a feature vector [Tran et al. 2000b]. For the third problem, it can be seen that the role of the fuzzy membership function in fuzzy set theory and of the a posteriori probability in the Bayes decision theory are quite similar. Therefore the currently used maximum a posteriori (MAP) decision rule can be generalised to the maximum fuzzy membership decision rule [Tran et al. 1998b, Tran et al. 1998d]. Depending on which fuzzy technique is applied, we can find a suitable form for the fuzzy membership function. For example, in speaker verification, the fuzzy membership of an input utterance in the claimed speaker’s fuzzy set of utterances is used as a similarity score to compare with a given threshold in order to accept or reject this speaker. The fuzzy membership function is determined as the ratio of f unctions of the claimed speaker’s and impostors’ likelihood functions [Tran and Wagner 2000c, Tran and Wagner 2000d]. 1.4 Contributions of This Thesis Based on solving the above-mentioned problems, fuzzy approaches to speech and speaker recognition are proposed and evaluated in this thesis as follows. 1.4.1 Fuzzy Entropy Models This fuzzy approach is presented in Chapter 3. FE models are based on a basic algorithm termed FE clustering. The goal of this approach is not only to propose a new fuzzy approach but also to show that statistical models, such as the HMM and the GMM in the maximum likelihood scheme, can be viewed as fuzzy models, where probabilities of unobservable data, given observable data, are used as fuzzy membership functions. Fuzzy entropy clustering, the maximum fuzzy likelihood criterion, the fuzzy EM algorithm, the fuzzy membership function as well as FE-HMMs, FE-GMMs and FE-VQ and their NC versions are proposed in this chapter [Tran and Wagner 2000a, Tran and Wagner 2000b, Tran and Wagner 2000f, Tran and Wagner 2000g, Tran 1999]. The adjustment of the degree of fuzzy entropy n in FE models is an advantage. When conventional models do not work well because of the insufficient training data 1.4 Contributions of This Thesis 8 problem or the complexity of the speech data, such as the nine English E-set letters, a suitable value of n can be found to obtain better models. 1.4.2 Fuzzy C-Means Models Chapter 4 presents this fuzzy approach. It is based on FCM clustering in fuzzy pattern recognition. FCM models are estimated by the minimum fuzzy squared-error criterion used in FCM clustering. The fuzzy EM algorithm is reformulated for this criterion. FCM-HMMs, FCM-GMMs and FCM-VQ are respectively presented as well as their NC versions. A discussion on the role of fuzzy memberships of FCM models and a comparison between FCM models and FE models are also presented. Similarly to FE models, FCM models also have an adjustable parameter called the degree of fuzziness m > 1. Better models can be obtained in the case of the insufficient training data problem and the complexity of speech data problem using a suitable value of m [Tran and Wagner 1999a]-[Tran and Wagner 1999e]. 1.4.3 Hard Models As the degrees of fuzzy entropy and fuzziness tend to their minimum values, both fuzzy entropy and fuzzy C-means models tend to the same limit which is the corresponding hard model. The simplest hard model is the VQ model, which is effective for speaker recognition. This chapter proposes new hard models—hard HMMs and hard GMMs. These models emerge as interesting consequences of investigating fuzzy approaches. The hard HMM employs only the best path for estimating model parameters and for recognition and the hard GMM employs only the most likely Gaussian distribution among the mixture of Gaussians to represent a feature vector. Hard models can be very useful because they are simple yet efficient [Tran et al. 2000a, Tran et al. 2000b]. 1.4.4 A Fuzzy Approach to Speaker Verification An even more interesting fuzzy approach is proposed in Chapter 6. The speaker verification process is reconsidered from the viewpoint of fuzzy set theory and hence a likelihood transformation and seven fuzzy normalisation methods are proposed 1.5 Extensions of This Thesis 9 [Tran and Wagner 2000c, Tran and Wagner 2000d]. This fuzzy approach also leads to a noise clustering-based version for all normalisation methods, which improves speaker verification performance markedly. 1.4.5 Evaluation Experiments and Results The evaluation of FE, FCM and hard models is presented in Chapter 7. Proposed normalisation methods for speaker verification are also evaluated in this chapter. The three speech corpora used in the experiments were the TI46, the ANDOSL and the YOHO corpora. Isolated word recognition experiments were performed on the E set, 10-digit set, 10-command set and 46-word set of the TI46 corpus. Speaker identification and verification experiments were performed on the TI46 (16 speakers), the ANDOSL (108 speakers) and the YOHO (138 speakers) corpora. Experiments show that fuzzy models and their noise clustering versions outperform conventional models in most of experiments. Hard hidden Markov models also achieved good results. 1.5 Extensions of This Thesis The fuzzy membership function in fuzzy set theory and the a posteriori probability in the Bayes decision theory have very similar meanings, however it can be shown that the minimum Bayes risk for the recognition process is obtained by the maximum a posteriori probability rule whereas the maximum membership rule does not lead to such a minimum risk. This problem can be overcome by using a well developed branch of fuzzy set theory, namely posibility theory. A possibilistic pattern recognition approach in our viewpoint is nearly as developed as the statistical pattern recognition approach. In the last chapter, we present the fundamentals of possibility theory and propose a possibilistic C-means approach to speech and speaker recognition. Future research into the possibility approach is suggested. Chapter 2 Literature Review This chapter provides a background review of statistical modelling techniques in speech and speaker recognition and clustering techniques in pattern recognition. The essential characteristics of speech signals and speech processing are summarised in Section 2.1. An overview of the various disciplines required for understanding aspects of speech and speaker recognition is presented in Section 2.2. Statistical modelling techniques are overviewed in Section 2.3. In this section, we first attend to the basic issues in statistical pattern recognition techniques including Bayes decision theory, the distribution estimation problem, maximum likelihood estimation and the expectation-maximisation algorithm. Second, three widely used statistical modelling techniques—hidden Markov modelling, Gaussian mixture modelling and vector quantisation—are described. Fuzzy cluster analysis techniques are overviewed in Section 2.4. Fuzzy set theory, the fuzzy membership function, the role of cluster analysis in pattern recognition and three basic clustering techniques—hard C-means, fuzzy C-means and noise clustering—are reviewed in this section. The last section 2.5 reviews the literature on fuzzy approaches to speech and speaker recognition. 2.1 Speech Characteristics Speech is the most natural means of communication among human beings, therefore it plays a key role in the development of a natural interface to enhance human-machine communication. This section briefly presents the nature of speech sounds and features 10 2.1 Speech Characteristics 11 of speech signals that lead to methods used to process speech. 2.1.1 Speech Sounds Speech is produced as a sequence of speech sounds corresponding to the message to be conveyed. The state of the vocal cords as well as the positions, shapes, and sizes of the various articulators, change over time in the speech production process [O’Shaughnessy 1987]. There are three states of the vocal cords: silence, unvoiced and voiced. Unvoiced sounds are produced when the glottis is open and the vocal cords are not vibrating, so the resulting speech waveform is aperiodic or random in nature. Voiced sounds are produced by forcing air through the glottis with the tension of the vocal cords adjusted so that they vibrate in a relaxation oscillation with a resulting speech waveform which is quasi-periodic [Rabiner and Schafer 1978]. Note that the segmentation of the waveform into well-defined regions of silence, unvoiced, and voiced signals is not exact. It is difficult to distinguish a weak, unvoiced sound from silence, or a weak, voiced sound from unvoiced sounds or even silence [Rabiner and Juang 1993]. Phonemes are the smallest distinctive class of individual speech sounds in a language. The number of phonemes varies according variant to different linguists. Vowel sounds are produced by exciting an essentially fixed vocal tract shape with quasiperiodic pulses of air caused by the vibration of the vocal cords. Vowels have the largest amplitudes among phonemes and range in duration from 50 to 400 ms in normal speech [O’Shaughnessy 1987]. Diphthong sounds are gliding monosyllabic speech sounds that start at or near the articulatory position for one vowel and move to or toward the position for another. They are produced by varying the vocal tract smoothly between vowel configurations appropriate to the diphthong [Rabiner and Juang 1993]. The group of sounds consisting of /w/, /l/, /r/, and /j/ is quite difficult to characterise, /l/ and /r/ are called semivowels because of their vowel-like nature, /w/ and /j/ are called glides because they are generally characterised by a brief gliding transition of the vocal tract shape between adjacent phonemes. The nasal consonants /m/, /n/, and /η/ are produced with glottal excitation and the vocal tract totally constricted at some point along the oral passageway while the velum is open and allows air flow through the nasal tract. The unvoiced fricatives /f/, /θ/, /s/, and /sh/ are produced by air flowing over a constriction in the vocal tract, with the lo- 2.1 Speech Characteristics 12 cation of the constriction determining the particular fricative sound produced. The voiced fricatives /v/, /th/, /z/, and /zh/ are the counterparts of the unvoiced fricatives /f/, /θ/, /s/, and /sh/, respectively. For voiced fricatives, the vocal cords are vibrating, and thus one excitation source is at the glottis. The unvoiced stop consonants /p/, /t/, and /k/, and the voiced stop consonants /b/, /d/, and /g/ are transient, noncontinuant sounds produced by building up pressure behind a total constriction somewhere in the oral tract and then suddenly releasing the pressure. For the voiced stop consonants, the release of sound energy is accompanied by vibrating vocal cords while for the unvoiced stop consonants the glottis is open [Harrington and Cassidy 1996, Rabiner and Juang 1993, Flanagan 1972]. 2.1.2 Speech Signals The speech signal produced by the human vocal system is one of the most complex signals known. In addition to the inherent physiological complexity of the human vocal tract, the physical production system differs from one person to another. Even when an utterance is repeated by the same person, the observed speech signal is different each time. Moreover, the speech signal is influenced by the speaking environment, the channel used to transmit the signal, and, when recording it, also by the transducer used to capture the signal [Rabiner et al. 1996]. The speech signal is a slowly time varying signal in the sense that, when examined over a sufficiently short period of time, depending on the speech sound between 5 and 100 ms, its characteristics are approximately stationary. But over longer periods of time, the signal characteristics are non-stationary. They change to reflect the sequence of different speech sounds being spoken [Juang et al. 1996]. An illustration of this characteristic is given in Figure 2.1, which shows the time waveform corresponding to the word “one” as spoken by a female speaker. The non-stationarity is observed in the long period of time from t = 0.3 sec to t = 0.6 sec (300 msec) in Figure 2.1.a. In the short period of time from t = 0.4 sec to t = 0.42 sec (20 msec), Figure 2.1.b shows the stationarity of the speech signal. The “quasi-stationarity” is the first characteristic of speech that distinguishes it from other random, non-stationary signals. Based on this characterisation of the speech signal, a reasonable speech model should have the following components. First, 2.1 Speech Characteristics 13 Figure 2.1: The speech signal of the utterance “one” (a) in the long period of time from t = 0.3 sec to t = 0.6 sec and (b) in the short period of time from t = 0.4 sec to t = 0.42 sec. short-time measurements at an interval of the order of 10 ms are to be made along the pertinent speech dimensions that best carry the relevant information for linguistic or speaker distinction. Second, because of the existence of the quasi-stationary region, the neighbouring short-time measurements on the order of 100 ms need to be simultaneously considered, either as a group of identically and independently distributed observations or as a segment of a non-stationary random process covering two quasi-stationary regions. The last component is a mechanism that describes the sound change behaviour among the sound segments in the utterance. This characteristic takes into account the implicit structure of the utterance, words, syntax, and so on in a probability distribution sense [Juang et al. 1996]. 2.1.3 Speech Processing The speech signal can be parametrically represented by a number of variables related to short time energy, fundamental frequency and sound spectrum. Probably the most important parametric representation of speech is the short time spectral envelope, in 2.1 Speech Characteristics 14 which the two most common choices of spectral analysis are the filterbank and the linear predictive coding spectral analysis models. In the filterbank model, the speech signal is passed through a bank of Q bandpass filters whose coverage spans the frequency range of interest in the signal (e.g., 300-3000 Hz for telephone-quality signals, 50-8000 Hz for broadband signals). The individual filters can overlap in frequency and the output of a bandpass filter is the short-time spectral representation of the speech signal. The linear predictive coding (LPC) model performs spectral analysis on blocks of speech (speech frames) with an all-pole modelling constraint. Each individual frame is windowed so as to minimise the signal discontinuities at the beginning and end of each frame. Thus the output of the LPC spectral analysis block is a vector of coefficients that specify the spectrum of an all-pole model that best matches the speech signal [Rabiner and Schafer 1978]. The most common parameter set derived from either LPC or filterbank spectra is a vector of cepstral coefficients. The cepstral coefficients cm (t), m = 1, . . . , Q, which are the coefficients of the Fourier transform representation of the log magnitude spectrum of speech frame t, have been shown to be a more robust, reliable feature set for speech recognition than the spectral vectors. The temporal cepstral derivatives (also known as delta cepstra) are often used as additional features to model trajectory information [Campbell 1997]. For each speech frame, the result of the feature analysis is a vector of Q weighted cepstral coefficients and an appended vector of Q cepstral time derivatives as follows [Rabiner and Juang 1993] x′t = (ĉ1 (t), . . . , ĉQ (t), △ĉ1 (t), . . . , △ĉQ (t)) (2.1) where t is the index of the speech frame, x′t is tranpose of vector xt with 2Q components, and ĉm (t) = wm cm (t), wm is the weighting function to truncate the computation and de-emphasizes cm around m = 1 and around m = Q wm = 1 + πm Q sin 2 Q ( ) 1≤m≤Q (2.2) If second-order temporal derivatives △2 ĉm (t) are computed, these are appended to vector xt giving a vector with 3Q components. A block diagram of the LPC front-end processor is shown in Figure 2.2. The speech signal is preemphasised for spectrally flattening, then is blocked into frames. Frames 2.2 Speech and Speaker Recognition ✲ Digitised Speech Preemphasis 15 ✲ Frame Blocking ✲ Windowing ĉm (t) ✛ ∆ĉm (t) ✛ Temporal ✛ Derivative Parameter ✛ Weighting ❄ Parameter Computing Figure 2.2: Block diagram of LPC front-end processor for speech and speaker recognition are Hamming windowed, a typical window used for the autocorrelation method of LPC. Then the cepstral coefficients cm (t) weighted by wm and the temporal cepstral derivatives are computed for each frame. 2.1.4 Summary The quasi-stationarity is an important characteristic of speech. The speech signal after spectral analysis is converted into a feature vector sequence. The current most commonly used short-term measurements are cepstral coefficients which form a robust, reliable feature set of speech for speech and speaker recognition. 2.2 Speech and Speaker Recognition Recognising the liguistic content in a spoken utterance and identifying the talker of the utterance through developing algorithms and implementing them on machines are the goal of speech and speaker recognition [C.-H. Lee et al. 1996]. A brief overview of some of the fundamental aspects of speech and speaker recognition is given in this section. 2.2.1 Speech Recognition Broadly speaking, there are three approaches to speech recognition by machine, namely, the acoustic-phonetic approach, the pattern-recognition approach, and the artificial intelligence approach. The acoustic-phonetic approach is based on the theory 2.2 Speech and Speaker Recognition 16 of acoustic phonetics that postulates that there exists a finite set of distinctive phonetic units in spoken language and that the phonetic units are broadly characterised by sets of properties that are manifest in the speech signal or its spectrum over time. The problem with this approach is the fact that the degree to which each phonetic property is realised in the acoustic signal varies greatly between speakers, between phonetic contexts and even between repeated realisations of the phoneme by the same speaker in the same context. This approach generally requires the segmentation of the speech signal into acoustic-phonetic units and the identification of those units through their known properties or features. The pattern-recognition approach is basically one in which the speech patterns are used directly without explicit feature determination and segmentation. This method has two steps: training of speech patterns, and recognition of patterns via pattern comparison. The artifical intelligence approach is a hybrid of the acoustic-phonetic approach and the pattern-recognition approach in that it exploits ideas and concepts of both methods, especially the use of an expert system for segmentation and labelling, and the use of neural networks for learning the relationships between phonetic events and all known inputs. Currently, the patternrecognition approach is the method of choice for speech recognition because of its simplicity of use, proven high performance, robustness to different acoustic-phonetic realisations and invariance to different speech vocabularies, users, algorithms and decision rules [Rabiner and Juang 1993]. Depending on the mode of speech that the system is designed to handle, three tasks of speech recognition can be distinguished: isolated-word, connected-word, and continuous speech recognition. Continuous speech recognition allows natural conversational speech—150-250 words/min, with little or no adaptation of speaking style imposed on system users. Isolated-word recognition requires the speaker to pause for at least 100-250 ms after each word. It is unnatural for speakers and slows the processing rate to about 20-100 words/min. Continuous speech recognition is much more difficult than isolated word recognition due to the absence of word boundary information. Connected-word speech recognition represents a compromise between the two extremes, the speaker needs not pause but must pronounce and stress each word clearly [O’Shaughnessy 1987]. Restrictions on the vocabulary size differentiate speech recognition systems. Small vocabulary is about 100-200 words, large vocabulary is about 2.2 Speech and Speaker Recognition 17 1000 words and very large vocabulary—5000 words or greater. An alternative factor affecting speech recognition performance is speaker dependence/independence. Generally, speaker-dependent systems achieve better recognition performance than speakerindependent systems—identifying speech from many talkers—because of the limited variability in the speech signal coming from a single speaker. Speaker-dependent systems demonstrate good performance only for speakers who have previously trained the system [Kewley-Port 1995]. Research in automatic speech and speaker recognition by machine has been conducted for almost four decades and the earliest attempts to devise systems for automatic speech recognition by machine were made in the 1950s. Several fundamental ideas in speech recognition were published in the 1960s and speech-recognition research achieved a number of significant milestones in the 1970s. Just as isolated word recognition was a key focus of research in the 1970s, the problem of connected word recognition was a focus of research in the 1980s. Speech research in the 1980s was characterised by a shift in technology from template-based approaches to statistical modelling methods, especially the hidden Markov model approach [Rabiner 1989]. Since then, hidden Markov model techniques have become widely applied in virtually every speech-recognition system. Speech recognition systems have been developed for a wide variety of applications both within telecommunications and in the business arena. In telecommunications, a speech recognition system can provide information or access to data or services over the telephone line. It can also provide recognition capability on the desktop/office including voice control of PC and workstation environments. In manufacturing and business, a recognition capability is provided to aid in the manufacturing processes. Other applications include the use of speech recognition in toys and games. A significant portion of the research in speech processing in the past few years has gone into studying practical methods for speech recognition in the world. In the United States, major research efforts have been carried out at AT&T (the Next-Generation Text-toSpeech System) [AT&T’s web site], IBM (ViaVoice Speech Recognition and the TANGORA System) [IBM’s web site, Das and Picheny 1996], BBN (the BYBLOS and SPIN Systems) [BBN’s web site], Dragon (the Dragon NaturallySpeaking Products) [DRAGON’s web site], CMU (the SPHINX-II Systems) [Huang et al. 1996], Lincoln 2.2 Speech and Speaker Recognition 18 Laboratory [Paul 1989] and MIT (the Spoken Language Systems) [MIT’s web site]. The Hearing Health Care Research Unit Projects [Western Ontario’s web site] and the INRS 86,000-word isolated word recognition system in Canada as well as the Philips rail information system, the CSELT system for Eurorail information services [CSELT’s web site], the University of Duisburg [Duisburg’s web site], the Cambridge University systems [Cambridge’s web site], and the LIMSI voice recognition [LIMSI’s web site] in Europe, are examples of the current activity in speech recognition research. Large vocabulary recognition systems are being developed based on the concept of interpreting telephony and telephone directory assistance in Japan . Syllable-recognisers have been designed to handle large vocabulary Mandarin dictation in China and Taiwan [Rabiner et al. 1996]. 2.2.2 Speaker Recognition Compared to speech recognition, there has been much less research in speaker recognition because fewer applications exist than for speech recognition. Speaker recognition is the process of automatically recognising who is speaking based on information obtained from speech waves. Speaker recognition techniques can be used to verify the identity claimed by people accessing certain protected systems. It enables access control of various services by voice. Voice dialing, banking over a telephone network, database access services, security control for confidential information, remote access of computers and the use for forensic purposes are important applications of speaker recognition technology [Furui 1997, Kunzel 1994]. Variation in signal characteristics from trial to trial is the most important factor affecting speaker recognition performance. Variations arise not only between speakers themselves but also from differences in recording and transmission conditions, noise and from a variety of psychological and physiological function within an individual speaker. Normalisation and adaptation techniques have been applied to compensate for these variations [Matsui and Furui 1993, Rosenberg et al. 1992, Higgins et al. 1991, Varga and Moore 1990, Gish 1990]. Speaker recognition can be classified into two specific tasks: identification and verification. Speaker identification is the process of determining which one of the voices known to the system best matches the input voice sample. When an unknown speaker must be identified as one 2.2 Speech and Speaker Recognition 19 of the set of known speakers, the task is known as closed-set speaker identification. If the input voice sample does not have a close enough match to anyone of the known speakers and the system can produce a“no match” decision [Reynolds 1992], the task is known as open-set speaker identification. Speaker verification is the process of accepting or rejecting the identity claim of a speaker. An identity claim is made by an unknown speaker, and an utterance of this unknown speaker is compared with the model for the speaker whose identity is claimed. If the match is good enough, that is, above a given threshold, the identity claim is accepted. The use of a “cohort speaker” set that is representative of the population close to the claimed speaker has been proposed [Rosenberg et al. 1992]. In all verification paradigms, there are two classes of errors: false rejections and false acceptances. A false rejection occurs when the system incorrectly rejects a true spaker and a false acceptance occurs when the system incorrectly accepts an imposter. An equal error rate condition is often used to adjust system parameters so that the two types of errors are equally likely [O’Shaughnessy 1987]. Speaker recognition methods can also be divided into text-dependent and textindependent. When the same text is used for both training and testing, the system is said to be text-dependent. For text-independent operation, the text used to test the system is theoretically unconstrained [Furui 1996]. Both text-dependent and independent speaker recognition systems can be defeated by playing back recordings of a registered speaker. To overcome this problem, a small set of pass phrases can be used, one of which is randomly chosen every time the system is used [Higgins et al. 1991]. Another method is text-prompted speaker recognition, which prompts the user for a new pass phrase or every occasion [Matsui and Furui 1993]. An extension of speaker recognition technology is the automatic extraction of the “turns” of each speaker from a dialogue involving two or more speakers [Gish et al. 1991, Siu et al. 1992, Wilcox et al. 1994]. 2.2.3 Summary Statistical pattern recognition is the method of choice for speech and speaker recognition, in which hidden Markov modelling of the speech signal is the most important technique that has helped to advance the state of the art of automatic speech and 2.3 Statistical Modelling Techniques 20 speaker recognition. 2.3 Statistical Modelling Techniques This section begins with a brief review of the classical Bayes decision theory and its application to the formulation of statistical pattern recognition problems. The distribution estimation problem in classifier design is then discussed in the Bayes decision theory framework. The maximum likelihood estimation is reviewed as one of the most widely used parametric unsupervised learning methods. Finally, three widely-used techniques—hidden Markov modelling, Gaussian mixture modelling and vetor quantisation modelling are reviewed in this framework. 2.3.1 Maximum A Posteriori Rule The task of a recogniser (classifier) is to achieve the minimum recognition error rate. A loss function is defined to measure this performance. It is generally non-negative with a value of zero representing correct recognition [Juang et al. 1996]. Let X be a random observation sequence from an information source, consisting of M classes of events. The task of a recogniser is to correctly classify each X into one of the M classes Ci , i = 1, 2, . . . , M . Suppose that when X truly belongs to class Cj but the recogniser classifies X as belonging to class Ci , we incur a loss ℓ(Ci |Cj ). Since the a posteriori probability P (Cj |X) is the probability that the true class is Cj , the expected loss or risk associated with classifying X to class Ci is [Duda and Hart 1973] R(Ci |X) = M ∑ ℓ(Ci |Cj )P (Cj |X) (2.3) j=1 In speech and speaker recognition, the following zero-one loss function is usually chosen ℓ(Ci |Cj ) = { 0 i=j 1 i 6= j i, j = 1, . . . , M (2.4) This loss function assigns no loss to correct classification and a unit loss to any error regardless of the class. The conditional loss becomes R(Ci |X) = ∑ j6=i P (Cj |X) = 1 − P (Ci |X) (2.5) 2.3 Statistical Modelling Techniques 21 Therefore in order to achieve the minimum error rate classification, we have to select the decision that Ci is correct, if the a posteriori probability P (Ci |X) is maximum [Allerhand 1987] C(X) = Ci if P (Ci |X) = max P (Cj |X) 1≤j≤M (2.6) The decision rule of (2.6) is called the maximum a posteriori (MAP) decision rule and the minimum error rate achieved by the MAP decision is called Bayes risk [Duda and Hart 1973]. 2.3.2 Distribution Estimation Problem For the implementation of the MAP rule, the required knowledge for an optimal classification decision is thus the set of a posteriori probabilities. However, these probabilities are not known in advance and have to be estimated from a training set of observations with known class labels. The Bayes decision theory thus effectively transforms the classifier design problem into the distribution estimation problem. This is the basis of the statistical approach to pattern recognition [Juang et al. 1996]. The a posteriori probability can be computed by using the Bayes rule P (Cj |X) = P (X|Cj )P (Cj ) P (X) (2.7) It can be seen from (2.7) that decision making based on the a posteriori probability employs both a priori knowledge from the a priori probability P (Cj ) together with present observed data from the conditional probability P (X|Cj ). For the simple case of isolated word recognition, the observations are the word utterances and the class labels are the word identities. The conditional probability P (X|Cj ) is often referred to as the acoustic model and the a priori probability P (Cj ) is known as the language model [C.-H. Lee et al. 1996]. In order to be practically implementable, the acoustic models are usually parameterised, and thus the distribution estimation problem becomes a parameter estimation problem, where a reestimation algorithm and a set of parameter estimation equations are established to find the best parameter set λi for each class Ci , i = 1, . . . , M based on the given optimisation criterion. The final task is to determine the right parametric form of the distributions and to estimate the unknown parameters defining the distribution from the training data. To obtain 2.3 Statistical Modelling Techniques 22 reliable parameter estimates, the training set needs to be of sufficient size in relation to the number of parameters. However, collecting and labelling data are labor intensive and resource demanding processes. When the amount of the training data is not sufficient, the quality of the distribution parameter estimates cannot be guaranteed. In other words, a true MAP decision can rarely be implemented and the minimum Bayes risk generally remains an unachievable lower bound [Juang et al. 1996]. 2.3.3 Maximum Likelihood Estimation As discussed above, the distribution estimation problem P (X|C) in the acoustic modelling approach becomes the parameter estimation problem P (X|λ). If λj denotes the parameter set used to model a particular class Cj , the likelihood function of model λj is defined as the probability P (X|λj ) treated as a function of the model λj . Maximising P (X|λj ) over λj is referred to as the maximum likelihood (ML) estimation problem. It has been shown that if the model is capable of representing the true distribution and enough training data is available, the ML estimate will be the best estimate of the true parameters [Nadas 1983]. However, if the form of the distribution is not known or the amount of training data is insufficient, the resulting parameter set is not guaranteed to produce a Bayes classifier. The expectation-maximisation (EM) algorithm proposed by Dempster, Laird and Rubin [1977] is a general approach to the iterative computation of ML estimates when the observation sequence can be viewed as incomplete data. Each iteration of this algorithm consists of an expectation (E) step followed by a maximisation (M) step. Many of its extensions and variations are popular tools for modal inference in a wide variety of statistical models in the physical, medical and biological sciences [Booth and Hobert 1999, Liu et al. 1998, Freitas 1998, Ambroise et al. 1997, Ghahramani 1995, Fessler and Hero 1994, Liu and Rubbin 1994]. In unsupervised learning, information on the class and state is unavailable, therefore the class and state are unobservable and only the data X are observable. Observable data are called incomplete data because they are missing the unobservable data, and data composed both of observable data and unobservable data are called complete data [Huang et al. 1990]. The purpose of the EM algorithm is to maximise the log-likelihood log P (X|λ) from incomplete data. Suppose a measure space Y of 2.3 Statistical Modelling Techniques 23 unobservable data exists corresponding to a measure space X of observable (incomplete) data. For given X ∈ X, Y ∈ Y, and the parameter model set λ, let P (X|λ) and P (Y |λ) be probability distribution functions defined on X and Y respectively. To maximise the log-likelihood of the observable data X over λ, we obtain L(X, λ) = log P (X|λ) = log P (X, Y |λ) − log P (Y |X, λ) (2.8) For two parameter sets λ and λ, the expectation of the incomplete log-likelihood L(X, λ) over the complete data (X, Y ) conditioned by X and λ is E[L(X, λ)|X, λ] = E[log P (X|λ)|X, λ] = log P (X|λ) = L(X, λ) (2.9) where E[.|X, λ] is the expectation conditioned by X and λ over complete data (X, Y ). Using (2.8), we obtain L(X, λ) = Q(λ, λ) − H(λ, λ) (2.10) where Q(λ, λ) = E[log P (X, Y |λ)|X, λ] and H(λ, λ) = E[log P (Y |X, λ)|X, λ] (2.11) The basis of the EM algorithm lies in the fact that if Q(λ, λ) ≥ Q(λ, λ) then L(X, λ) ≥ L(X, λ) since it follows from Jensen’s inequality that H(λ, λ) ≤ H(λ, λ). This implies that L(X, λ) increases monotonically on any iteration of parameter updates from λ to λ via maximisation of the Q-function. When Y is discrete, the Q-function and the H-function are represented as Q(λ, λ) = ∑ P (Y |X, λ) log P (X, Y |λ) and H(λ, λ) = Y ∈Y ∑ P (Y |X, λ) log P (Y |X, λ) Y ∈Y (2.12) The following EM algorithm permits an easy maximisation of the Q-function instead of maximising L(X, λ) directly. Algorithm 1 (The EM Algorithm) 1. Initialisation: Fix Y and choose an initial estimate λ 2. E-step: Compute Q(λ, λ) based on the given λ 3. M-step: Use a certain optimisation method to determine λ , for which Q(λ, λ) ≥ Q(λ, λ) 2.3 Statistical Modelling Techniques 24 4. Termination: Set λ = λ, repeat from E-step until the change of Q(λ, λ) falls below a preset threshold. 2.3.4 Hidden Markov Modelling The underlying assumption of the HMM is that the speech signal can be well characterised as a parametric random process, and that the parameters of the stochastic process can be estimated in a precise, well-defined manner. The HMM method provides a reliable way of recognizing speech for a wide range of applications [Juang 1998, Ghahramani 1997, Furui 1997, Rabiner et al. 1996, Das and Picheny 1996]. There are two assumptions in the first-order HMM. The first is the Markov assumption, i.e. a new state is entered at each time t based on the transition probability, which only depends on the previous state. It is used to characterise the sequence of the time frames of a speech pattern. The second is the output-independence assumption, i.e. the output probability depends only on the state at that time regardless of when and how the state is entered [Huang et al. 1990]. A process satisfying the Markov assumption is called a Markov model [Kulkarni 1995]. An observable Markov model is a process where the output is a set of states at each instant of time and each state corresponds to an observable event. The hidden Markov model is a doubly stochastic process with an underlying Markov process which is not directly observable (hidden) but which can be observed through another set of stochastic processes that produce observable events in each of the states [Rabiner and Juang 1993]. Parameters and Types of HMMs Let O = (o1 o2 . . . oT ) be the observation sequence, S = (s1 s2 . . . sT ) the unobservable state sequence, X = (x1 x2 . . . xT ) the continuous vector sequence, V = {v1 , v2 , . . . , vK } the discrete symbol set, and N the number of states. A compact notation λ = {π, A, B} is proposed to indicate the complete parameter set of the HMM [Rabiner and Juang 1993], where • π = {πi }, • A = {aij }, πi = P (s1 = i|λ), 1 ≤ i ≤ N : the initial state distribution; aij = P (st+1 = j|st = i, λ), 1 ≤ i, j ≤ N , and 1 ≤ t ≤ T − 1: the state transition probability distribution, denoting the transition probability 2.3 Statistical Modelling Techniques 25 from state i at time t to state j at time t + 1; and • B = {bj (ot )}, bj (ot ) = P (ot |st = j, λ), 1 ≤ j ≤ N , and 1 ≤ t ≤ T : the observation probability distribution, denoting the probability of generating an observation ot in state j at time t with probability bj (ot ). One way to classify types of HMMs is by the structure of the transition matrix A of the Markov chain [Huang et al. 1990]: • Ergodic or fully connected HMM: every state can be reached from every other state in a finite number of states. The initial state probabilities and the state transition coefficients have the properties [Rabiner and Juang 1986] 0 ≤ πi ≤ 1, N ∑ πi = 1 and 0 ≤ aij ≤ 1, N ∑ aij = 1 (2.13) j=1 i=1 • Left-to-right: as time increases, the state index increases or stays the same. The state sequence must begin in state 1 and end in state N , i.e. πi = 0 if i 6= 1 and πi = 1 if i = 1. The state- transition coefficients satisfy the following fundamental properties aij = 0 j < i, 0 ≤ aij ≤ 1, and N ∑ aij = 1 (2.14) j=1 The additional constraint aij = 0, j > (i + ∆i), where ∆i > 0 is often placed on the state-transition coefficients to make sure that large changes in state indices do not occur [Rabiner 1989]. Such a model is called the Bakis model [Bakis 1976], i.e. a left-to-right model which allows some states to be skipped. An alternative way to classify types of HMMs is based on observations and their representations [Huang et al. 1990] • Discrete HMM (DHMM): the observations ot , 1 ≤ t ≤ T are discrete symbols in V = {v1 , v2 , . . . , vK }, which are normally codevector indices of a VQ sourcecoding technique, and B = {bj (k)}, 1 ≤ j ≤ N, bj (k) = P (ot = vk |st = j, λ), 1 ≤ k ≤ K, K ∑ k=1 bj (k) = 1 (2.15) 2.3 Statistical Modelling Techniques 26 • Continuous HMM (CHMM): the observations ot ∈ O are vectors xt ∈ X and the parametric representation of the observation probabilities is a mixture of Gaussian distributions B = {bj (xt )}, 1 ≤ j ≤ N, K ∑ bj (xt ) = P (xt |st = j, λ) = 1≤t≤T wjk N (xt , µjk , Σjk ), k=1 ∫ X where wjk is the kth mixture weight in state j satisfying bj (xt )dxt = 1 (2.16) ∑K k=1 wjk = 1 and N (xt , µjk , Σjk ) is the kth Gaussian component density in state j with mean vector µjk and covariance matrix Σjk (see Section 2.3.5 for detail). Other variants have been proposed, such as factorial HMMs [Ghahramani 1997], tied-mixture continuous HMMs [Bellegarda and Nahamoo 1990] and semi-continuous HMMs [Huang and Jack 1989]. Three Basic Problems for HMMs There are three basic problems to be solved for HMMs. The parameter estimation problem is to train speech and speaker models, the evaluation problem is to compute likelihood functions for recogniton and the decoding problem is to determine the best fitting (unobservable) state sequence [Rabiner and Juang 1993, Huang et al. 1990]. The parameter estimation problem: This problem determines the optimal model parameters λ of the HMM according to given optimisation criterion. A variant of the EM algorithm, known as the Baum-Welch algorithm, yields an iterative procedure to reestimate the model parameters λ using the ML criterion [Baum 1972, Baum and Sell 1968, Baum and Eagon 1967]. In the Baum-Welch algorithm, the unobservable data are the state sequence S and the observable data are the observation sequence O. From (2.12), the Q-function for the HMM is as follows Q(λ, λ) = ∑ P (S|O, λ) log P (O, S|λ) (2.17) S Computing P (O, S|λ) [Rabiner and Juang 1993, Huang et al. 1990], we obtain Q(λ, λ) = T∑ −1 ∑ ∑ t=0 st st+1 P (st , st+1 |O, λ) log [ast st+1 bst+1 (ot+1 )] (2.18) 2.3 Statistical Modelling Techniques 27 where π s1 is denoted by as0 s1 for simplicity. Regrouping (2.18) into three terms for the π, A, B coefficients, and applying Lagrange multipliers, we obtain the HMM parameter estimation equations • For discrete HMM: π j = γ1 (i), aij = T∑ −1 ξt (i, j) t=1 T∑ −1 , bj (k) = γt (i) t=1 T ∑ γt (j) t=1 s.t. ot =vk T ∑ (2.19) γt (j) t=1 where γt (i) = N ∑ ξt (i, j), j=1 ξt (i, j) = P (st = i, st+1 = j|O, λ) = αt (i)aij bj (ot+1 )βt+1 (j) N N ∑ ∑ (2.20) αt (i)aij bj (ot+1 )βt+1 (j) i=1 j=1 • For continuous HMM: estimation equations for the π and A distributions are unchanged, but the output distribution B is estimated via Gaussian mixture parameters as represented in (2.16) wjk = T ∑ ηt (j, k) t=1 K T ∑ ∑ , µjk = ηt (j, k) t=1 k=1 T ∑ T ∑ ηt (j, k)xt t=1 T ∑ , ηt (j, k) t=1 ηt (j, k)(xt − µjk )(xt − µjk )′ Σjk = t=1 T ∑ (2.21) ηt (j, k) t=1 where ηt (j, k) = αt (j)βt (j) N ∑ j=1 αt (j)βt (j) × wjk N (xt , µjk , Σjk ) K ∑ (2.22) wjk N (xt , µjk , Σjk ) k=1 Note that for practical implementation, a scaling procedure [Rabiner and Juang 1993] is required to avoid number underflow on computers with ordinary floating-point number representations . 2.3 Statistical Modelling Techniques 28 The evaluation problem: How can we efficiently compute P (O|λ), the probability that the observation sequence O was produced by the model λ? For solving this problem, we obtain P (O|λ) = ∑ P (O, S|λ) = ∑ πs1 bs1 (o1 )as1 s2 bs2 (o2 ) . . . asT −1 sT bsT (oT ) (2.23) s1 ,s2 ,...,sT all S An interpretation of the computation in (2.23) is the following. At time t = 1, we are in state s1 with probability πs1 , and generate the symbol o1 with probability bs1 (o1 ). A transition is made from state s1 at time t = 1 to state s2 at time t = 2 with probability as1 s2 and we generate a symbol o2 with probability bs2 (o2 ). This process continues in this manner until the last transition at time T from state sT −1 to state sT is made with probability asT −1 sT and we generate symbol oT with probability bsT (oT ). Figure 2.3 shows an N -state left-to-right HMM with ∆i in (2.14) set to 1. Figure 2.3: An N -state left-to-right HMM with ∆i = 1 To reduce computations, the forward and the backward variables are used. The forward variable αt (i) is defined as αt (i) = P (o1 o2 . . . ot , st = i|λ), α1 (i) = πi bi (o1 ), and αt+1 (j) = [∑ N i=1 which can be computed iteratively as 1≤i≤N ] αt (i)aij bj (ot+1 ), 1 ≤ j ≤ N, 1≤t≤T −1 (2.24) 2.3 Statistical Modelling Techniques 29 and the backward variable βt (i) is defined as βt (i) = P (ot+1 ot+2 . . . oT |st = i, λ), βT (i) = 1, and βt (i) = N ∑ which can be computed iteratively as 1≤i≤N aij bj (ot+1 )βt+1 (j), 1 ≤ i ≤ N, t = T − 1, . . . , 1 (2.25) j=1 Using these variables, the probability P (O|λ) can be computed following the forward variable or the backward variable or both the forward and backward variables as follows P (O|λ) = N ∑ αT (i) = i=1 N ∑ πi bi (o1 )β1 (i) = i=1 N ∑ αt (i)βt (i) (2.26) i=1 The decoding problem: Given the observation sequence O and the model λ, how do we choose a corresponding state sequence S that is optimal in some sense? This problem attempts to uncover the hidden part of the model. There are several possible ways to solve this problem, but the most widely used criterion is to find the single best state sequence that can be implemented by the Viterbi algorithm. In practice, it is preferable to base recognition on the maximum likelihood state sequence since this generalises easily to the continuous speech case. This likelihood is computed using the same algorithm as the forward algorithm except that the summation is replaced by a maximum operation. 2.3.5 Gaussian Mixture Modelling Gaussian mixture models (GMMs) are effective models capable of achieving high recognition accuracy for speaker recognition. As discussed above, HMMs can adequately characterise both the temporal and spectral varying nature of the speech signal [Rabiner and Juang 1993], however for speaker recognition, the temporal information has been used effectively only in text-dependent mode. In text-independent mode, there are no constraints on the training and test text and this temporal information has not been shown to be useful [Reynolds 1992]. On the other hand, the performance of text-independent speaker identification depends mostly on the total number of mixture components (number of states times number of mixture components assigned to each state) per speaker model [Matsui and Furui 1993, Matsui and Furui 1992, Reynolds 1992]. Therefore, it can be seen that, the N-state M-mixture continuous 2.3 Statistical Modelling Techniques 30 ergodic HMM is roughly equivalent to the NM-mixture GMM in text-independent speaker recognition applications. In this case, the number of states does not play an important role, and hence for simplicity, the 1-state HMM, i.e. the GMM is currently used for text-independent speaker recognition [Furui 1994]. Although we can get equations for the GMM from the continuous HMM with the number of states N = 1, for practical applications, they are summarised as follows. The parameter estimation problem: The Q-function for the GMM is of the form [Huang et al. 1990] Q(λ, λ) = T K ∑ ∑ P (i|xt , λ) log P (xt , i|λ) = i=1 t=1 T K ∑ ∑ P (i|xt , λ) log[wi N (xt , µi , Σi )] i=1 t=1 (2.27) where P (i|xt , λ) is the a posteriori probability for the ith mixture, i = 1, . . . , K and satisfies P (i|xt , λ) = P (xt , i|λ) K ∑ wi N (xt , µi , Σi ) = K ∑ P (xt , k|λ) k=1 (2.28) wk N (xt , µk , Σk ) k=1 λ = {w, µ, Σ} denotes a set of model parameters, where w = {wi }, µ = {µi }, Σ = {Σi }, i = 1, . . . , K, wi are mixture weights satisfying ∑K i=1 wi = 1 and N (xt , µi , Σi ) are the d-variate Gaussian component densities with mean vectors µi and covariance matrices Σi N (xt , µi , Σi ) = 1 1 d (2π) 2 |Σi | 2 exp { 1 − (xt − µi )′ Σ−1 i (xt − µi ) 2 } (2.29) where (xt − µi )′ is the transpose of (xt − µi ), Σ−1 is the inverse of Σi , and |Σi | is the i determinant of Σi . Setting derivatives of the Q-function with respect to λ to zero, the following reestimation formulas are found [Huang et al. 1990, Reynolds 1995b] T 1∑ P (i|xt , λ), wi = T t=1 µi = T ∑ P (i|xt , λ)xt t=1 T ∑ t=1 , P (i|xt , λ) Σi = T ∑ P (i|xt , λ)(xt − µi )(xt − µi )′ t=1 T ∑ P (i|xt , λ) t=1 (2.30) The evaluation problem: For a training vector sequence X = (x1 x2 . . . xT ), 2.3 Statistical Modelling Techniques 31 the likelihood of the GMM is log P (X|λ) = T ∑ log P (xt |λ) = t=1 t=1 2.3.6 T ∑ log K ∑ wi N (xt , µi , Σi ) (2.31) i=1 Vector Quantisation Modelling Vector quantisation (VQ) is a data reduction method, which is used to convert a feature vector set into a small set of distinct vectors using a clustering technique. Advantages of this reduction are reduced storage, reduced computation, and efficient representation of speech sounds [Furui 1996, Bellegarda 1996]. The distinct vectors are called codevectors and the set of codevectors that best represents the training vector set is called the codebook. The VQ codebook can be used as a speech or speaker model and a good recognition performance can be obtained in many cases [Rabiner et al. 1983, Soong et al. 1987, Tseng et al. 1987, Tsuboka and Nakahashi 1994, Bellegarda 1996]. Since there is only a finite number of code vectors, the process of choosing the best representation of a given feature vector is equivalent to quantising the vector and leads to a certain level of quantisation error. This error decreases as the size of the codebook increases, however the storage required for a large codebook is nontrivial. The key point of VQ modelling is to derive an optimal codebook which is commonly achieved by using the hard Cmeans (K-means) algorithm reviewed in Section 2.4.5. A variant of this algorithm is the LBG algorithm [Linde et al. 1980], which is widely used in speech and speaker recognition. The difference between the GMM and VQ is the change from a “soft” mapping in the GMM to a “hard” mapping in VQ of feature vectors into clusters [Chou et al. 1989]. In the GMM, we obtain P (xt |λ) = K ∑ wi N (xt , µi , Σi ) (2.32) i=1 It means that vector xt can belong to K clusters (soft mapping) represented by K Gaussian distributions. The degree of belonging of xt to the ith cluster is represented by the probability P (i|xt , λ) and is determined as in (2.28). For the GMM, we obtain 0 ≤ P (i|xt , λ) ≤ 1. For VQ, vector xt is or is not in the ith cluster and the probability 2.3 Statistical Modelling Techniques 32 P (i|xt , λ) is determined as follows [Chou et al. 1989, Duda and Hart 1973] P (i|xt , λ) = { 1 if dit < djt 0 otherwise ∀j 6= i (2.33) where ties are broken randomly, dit denotes the distance between vector xt to the ith cluster. If a particular distance is defined and (2.33) is replaced into (2.30), variants of VQ are determined as follows • Conventional VQ: the Euclidean distance d2it = (xt − µi )2 is used and µi = 1 Ti ∑ xt (2.34) xt ∈cluster i where Ti is the number of vectors in the ith cluster, and ∑K i=1 Ti = T . • Extended VQ: the Mahalanobis distance d2it = (xt − µi )′ Σ−1 i (xt − µi ) is used and µi = 1 Ti ∑ Σi = xt , xt ∈cluster i 1 Ti ∑ (xt − µi )(xt − µi )′ (2.35) xt ∈cluster i • Entropy-Constrained VQ: d2it = (xt − µi )′ Σ−1 (xt − µi ) − 2 log wi , assuming that Σi = Σ ∀i, Σ fixed [Chou et al. 1989], and µi = 1 Ti ∑ xt ∈cluster i xt , wi = Ti T (2.36) The relationships between the statistical modelling techniques are summarised in Figure 2.4 2.3.7 Summary The statistical classifier based on the Bayes decision theory has been reviewed in this section. The classifier design problem is to achieve the minimum recognition error rate, which is performed by the MAP decision rule. Since the a posteriori probabilities are not known in advance, the problem becomes a distribution estimation problem. This is solved by determining the a priori probability and the likelihood function. As discussed in Section 2.3.2, the former is derived from the language models in speech recognition. In speaker identification, this is often simplified by assuming an equal 2.3 Statistical Modelling Techniques 33 Modelling Techniques ❄ ❄ Discrete HMM ❄ Continuous HMM λ = {π, A, B} ❄ GMM (1-state HMM) λ = {π, A, B = {w, µ, Σ}} λ = {w, µ, Σ} Changing from soft to hard mapping ❄ ❄ Extended VQ ECVQ λ = {µ, w} λ = {µ, Σ} ❄ VQ λ = {µ} Figure 2.4: Relationships between HMM, GMM, and VQ techniques Test Speech S ❄ Speech Analysis Bayes Theory X Prior Knowledge {P (Ci )} 1≤i≤M ✲ ❄ MAP Rule ✛ {P (X|λi )} 1≤i≤M {P (Ci |X)} ❄ Recognised Class (Word, Speaker) Ci∗ = arg max P (Ci |X) Training Speech {S (1) ,...,S (M ) } ❄ Speech Analysis {X (1) ,...,X (M ) } ❄ ❄ Compute ✛ Models Parameter Likelihood {λ1 ,...,λM } Estimation ✻ 1≤i≤M 1≤i≤M Select a Modelling Technique (HMM, GMM, VQ) Figure 2.5: A statistical classifier for isolated word recognition and speaker identification 2.4 Fuzzy Clustering Techniques 34 a priori probability for all speakers. The latter is derived from the acoustic models, which, in order to be practically implementable, are usually parameterised. Now we need to solve the model parameter estimation problem. Model parameters are determined such that the likelihood function is maximised. This is performed by the EM algorithm. The HMM is the most effective model for solving this problem. In the continuous case, the one-state HMM is identical to the GMM. If a hard mapping of a vector to the Gaussian distribution is applied rather than a soft mapping in the GMM, we obtain the VQ model and its variants. The block diagram in Figure 2.5 illustrates what we have summarised. 2.4 Fuzzy Clustering Techniques This section begins with a brief review of fuzzy set theory and the membership function. The membership estimation problem is then mentioned. The role of cluster analysis in pattern recognition as well as hard C-means, fuzzy C-means, noise clustering, and possibilistic C-means clustering techniques are then reviewed. 2.4.1 Fuzzy Sets and the Membership Function Fuzzy set theory was introduced in 1965 by Lotfi Zadeh to represent and manipulate data and information that possess nonstatistical uncertainty. Fuzzy set theory [Zadeh 1965] is a generalisation of conventional set theory that was introduced as a new way to represent the vagueness or imprecision that is ever present in our daily experience as well as in natural language [Bezdek 1993]. Let X be a feature vector space. A set A is called a crisp set if every feature vector x in X either is in A (x ∈ A) or is not in A (x 6∈ A). A set B is called a fuzzy set in X if it is characterized by a membership function uB (x), taking values in the interval [0, 1] and representing the “degree of membership” of x in B. With the ordinary set A, the membership value can take on only two values 0 and 1, with uA (x) = 1 if x ∈ A or uA (x) = 0 if x 6∈ A [Zadeh 1965]. With the fuzzy set B, the membership function uB (x) can take any value between 0 and 1. The membership function is the basic idea in fuzzy set theory. 2.4 Fuzzy Clustering Techniques 2.4.2 35 Maximum Membership Rule A membership function uCi (x) can represent the degree to which an observation x belongs to a class Ci . In order to correctly classify an unknown observation x into one of the classes Ci , i = 1, 2, . . . , M , the following maximum membership decision rule can be used C(x) = Ci if uCi (x) = max uCj (x) 1≤j≤M (2.37) Therefore in order to achieve the best classification, we have to decide on class Ci , if the membership function uCi (x) is maximum [Keller et al. 1985]. 2.4.3 Membership Estimation Problem For the implementation of the maximum membership rule, the required knowledge for an optimal classification decision is that of the membership functions. These functions are not known in advance and have to be estimated from a training set of observations with known class labels. The estimation procedure in fuzzy pattern recognition is called the abstraction and the use of estimates to compute the membership values for unknown observations not contained in the training set is called the generalisation procedure [Bellman et al. 1966]. An estimate of the membership function is referred to as an abstracting function. To generate a “good” abstracting function from the knowledge of its values over a finite set of observations, we need some a priori information about the class of functions to which the abstracting function belongs, such that this information in combination with observations from X is sufficient for estimating. This approach involves choosing a family of abstracting functions and finding a member of this family which fits “best”, in some specified sense, the given observation sequence X. In most practical situations, the a priori information about the membership function of a fuzzy class is insufficient to generate an abstracting function, which is “optimal” in a meaningful sense. 2.4.4 Pattern Recognition and Cluster Analysis Pattern recognition can be characterised as “a field concerned with machine recognition of meaningful regularities in noisy or complex environments” [Duda and Hart 1973]. 2.4 Fuzzy Clustering Techniques 36 A workable definition for pattern recognition is “the search for structure in data” [Bezdek 1981]. Three main issues of the search for structure in data are: feature selection, cluster analysis, and classification. Feature selection is the search for structure in data items, or observations xt ∈ X. The feature space X may be compressed by eliminating redundant and unimportant features via selection or transformation. Cluster analysis is the search for structure in data sets, or sequences X ∈ X. Since “optimal” features are not known in advance, we often attempt to discover these by clustering the feature variables. Finally, classification is the search for structure in data spaces X. A pattern classifier designed for X is a device or means whereby X itself is partitioned into “decision regions” [Bezdek 1981]. Clustering is the grouping of similar objects [Hartigan 1975]. Clustering in the given unlabeled data X is to assign to feature vectors labels that identify “natural subgroups” in X [Bezdek 1993]. In other words, clustering known as unsupervised learning in X is a partitioning of X into C subsets or C clusters. The most important requirement is to find a suitable measure of clusters, referred to as a clustering criterion. Objective function methods allow the most precise formulation of the clustering criterion. To construct an objective function, a similarity measure is required. A standard way of expressing similarity is through a set of distances between pairs of feature vectors. Optimising the objective function is performed to find optimal partitions of data. The partitions generated by a clustering method define for all data elements to which cluster they belong. The boundaries of partitions are sharp in the hard clustering method or vague in the fuzzy clustering method. Each feature vector of a fuzzy partition belongs to different clusters with different membership values. Cluster validity is an important issue, which deals with the significance of the structure imposed by a clustering method. It is required in order to determine an optimal partition in the sense that it best explains the unknown structure in X. 2.4.5 Hard C-Means Clustering Let U = [uit ] be a matrix whose elements are memberships of xt in the ith cluster, i = 1, . . . , C, t = 1, . . . , T . Hard C-partition space for X is the set of matrices U such 2.4 Fuzzy Clustering Techniques 37 that [Bezdek 1993] uit ∈ {0, 1} C ∑ ∀i, t, uit = 1 ∀t, 0< T ∑ uit < T ∀i (2.38) t=1 i=1 where uit = ui (xt ) is 1 or 0, according to whether xt is or is not in the ith cluster, ∑C i=1 uit = 1 ∀t means each xt is in exactly one of the C clusters, and 0 < ∑T t=1 uit < T ∀i means that no cluster is empty and no cluster is all of X because of 2 ≤ C < T . The HCM method [Duda and Hart 1973] is based on minimisation of the sum-ofsquared-errors function as follows J(U, λ; X) = C ∑ T ∑ uit d2it (2.39) i=1 t=1 where U = {uit } is a hard C-partition of X, λ is a set of prototypes, in the simplest case, it is the set of cluster centers: λ = {µ}, µ = {µi }, i = 1, . . . , C, and dit is the distance in the A norm (A is any positive definite matrix) from xt to µi , known as a measure of dissimilarity d2it = ||xt − µi ||2A = (xt − µi )′ A(xt − µi ) (2.40) Minimising the hard objective function J(U, λ; X) in (2.39) gives uit = µi = { 1 dit < djt 0 otherwise T ∑ uit xt t=1 /∑ T j = 1, . . . , C, uit j 6= i (2.41) (2.42) t=1 where ties are broken randomly. 2.4.6 Fuzzy C-Means Clustering The fuzzy C-means (FCM) method is the most widely used approach in both theory and practical applications of fuzzy clustering techniques to unsupervised classification [Zadeh 1977]. It is an extension of the hard C-means method that was first introduced by Dunn [1974]. A weighting exponent m on each fuzzy membership called the degree of fuzziness was introduced in the FCM method [Bezdek 1981] and hence a general estimation procedure for the FCM has been established and its convergence has been shown [Bezdek 1990, Bezdek and Pal 1992]. 2.4 Fuzzy Clustering Techniques 38 Fuzzy C-Means Algorithm Let U = [uit ] be a matrix whose elements are memberships of xt in cluster i, i = 1, . . . , C, t = 1, . . . , T . Fuzzy C-partition space for X is the set of matrices U such that [Bezdek 1993] 0 ≤ uit ≤ 1 ∀i, t, C ∑ uit = 1 ∀t, 0< T ∑ uit < T ∀i (2.43) t=1 i=1 where 0 ≤ uit ≤ 1 ∀i, t means it is possible for each xt to have an arbitrary distribution of membership among the C fuzzy clusters. The FCM method is based on minimisation of the fuzzy squared-error function as follows [Bezdek 1981] Jm (U, λ; X) = T C ∑ ∑ 2 um it dit (2.44) i=1 t=1 where U = {uit } is a fuzzy C-partition of X, m > 1 is a weighting exponent on each fuzzy membership uit and is called the degree of fuzziness, λ and dit are defined as in (2.39) The basic idea of the FCM method is to minimise Jm (U, λ; X) over the variables U and λ on the assumption that matrices U that are part of optimal pairs for Jm (U, λ; X) identify good partitions of the data. Minimising the fuzzy objective function Jm (U, λ; X) in (2.44) gives uit = 1 C ∑ (2.45) (d2it /d2kt )1/(m−1) k=1 µi = T ∑ t=1 um it xt /∑ T um it (2.46) t=1 Gustafson-Kessel Algorithm An interesting modification of the FCM has been proposed by Gustafson and Kessel [1979]. It attempts to recognise the fact that different clusters in the same data set X may have differing geometric shapes. A generalisation to a metric that appears more natural was made through the use of a fuzzy covariance matrix. Replacing the distance in (2.40) by an inner product induced a norm of the form d2it = (xt − µi )′ Mi (xt − µi ) (2.47) 2.4 Fuzzy Clustering Techniques 39 where the Mi , i = 1, . . . , C are symmetric and positive definite and subject to the following constraints |Mi | = ρi , with ρi > 0 and fixed for each i. Define a fuzzy covariance matrix Σi by Σi = T ∑ ′ um it (xt − µi )(xt − µi ) /∑ T um it (2.48) t=1 t=1 then we have Mi−1 = (|Mi ||Σi |)−1/d Σi , i = 1, . . . , C, where |Mi | and |Σi | are the determinants of Mi and Σi , respectively and d is the vector space dimension. The parameter set in this algorithm is λ = {µ, Σ}, where µ = {µi } and Σ = {Σi }, i = 1, . . . , C, are computed by (2.46) and (2.48), respectively. Gath-Geva Algorithm The algorithm proposed by Gath and Geva [1989] is an extension of the GustafsonKessel algorithm that also takes the size and density of the clusters into account. The distance is chosen to be indirectly proportional to the probability P (xt , i|λ) 1 1 = d2it = P (xt , i|λ) wi N (xt , µi , Σi ) (2.49) where the Gaussian distribution N (xt , µi , Σi ) is defined in (2.29). The parameter set in this algorithm is λ = {w, µ, Σ}, where µ = {µi }, Σ = {Σi } are computed as the Gustafson-Kessel algorithm, and w = {wi }, wi are mixture weights computed as follows wi = T ∑ t=1 um it /∑ C T ∑ um it (2.50) t=1 i=1 In contrast to the FCM and the Gustafson-Kessel algorithms, the Gath-Geva algorithm is not based on an objectice function, but is a fuzzification of statistical estimators. If we were to apply for the Gath-Geva algorithm the same technique as for the FCM and the Gustafson-Kessel algorithms, i.e. minimising the least-squares function in (2.44), the resulting system of equations could not be solved analytically. In this sense, the Gath-Geva algorithm is a good heuristic on the basis of an analogy with probability theory [Höppner et al. 1999]. 2.4.7 Noise Clustering Both HCM and FCM clustering methods have a common disadvantage in the problem of sensitivity to outliers. As can be seen from (2.38) and (2.43), the memberships are 2.4 Fuzzy Clustering Techniques 40 relative numbers. The sum of the memberships of a feature vector xt across classes is always equal to one both for clean data and for noisy data, i.e. data “contaminated” by erroneous points or “outliers”. It would be more reasonable that, if the feature vector xt comes from noisy data or outliers, the memberships should be as small as possible for all classes and the sum should be smaller than one. This property is important since all parameter estimates are computed based on these memberships. An idea of a noise cluster has been proposed by Davé [1991] to deal with noisy data or outliers for fuzzy clustering methods. The noise is considered to be a separate class and is represented by a prototype— a parameter subset characterising a cluster—that has a constant distance δ from all feature vectors. The membership u•t of a vector xt in the noise cluster is defined to be u•t = 1 − C ∑ uit t = 1, . . . , T (2.51) i=1 Therefore, the membership constraint for the “good” clusters is effectively relaxed to C ∑ uit <1 t = 1, . . . , T (2.52) i=1 This allows noisy data and outliers to have arbitrarily small membership values in good clusters. The objective function in the noise clustering (NC) approach is as follows Jm (U, λ; X) = T C ∑ ∑ 2 um it dit + T ∑ t=1 i=1 t=1 δ 2 ( 1− C ∑ uit i=1 )m (2.53) where U = {uit } is a noise-clustering C-partition of X and m > 1. Since the second term in (2.53) is independent of the parameter set λ and the distance measure, parameters are estimated by minimising the first term—the squared-errors function in FCM clustering (see 2.44), with respect to λ. Therefore (2.46) and (2.50) still apply to this approach for parameter estimation. Minimising the objective function Jm (U, λ; X) in (2.53) with respect to uit gives uit = 1 C ∑ (d2it /d2kt )1/(m−1) (2.54) + (d2it /δ 2 )1/(m−1) k=1 The second term in the denominator of (2.54) becomes quite large for outliers, resulting in small membership values in all the good clusters for outliers. The advantage of 2.4 Fuzzy Clustering Techniques 41 this approach is that it forms a more robust version of the FCM algorithm and can be used instead of the FCM algorithm provided a suitable value for constant distance δ can be found [Davé and Krishnapuram 1997]. 2.4.8 Summary Fuzzy set theory, membership functions and clustering techniques have been reviewed in this Section. Figure 2.6 illustrates the clustering techniques and shows the constraints on memberships for each technique. Since the Gath-Geva technique is not based on an objective function as discussed above, corresponding versions of the Gath-Geva for the NC and the PCM techniques have not been proposed. Extended versions of the HCM are also extensions of VQ, which have been reviewed in Section 2.3.6. Clustering Techniques Memberships 0 ≤ uit ≤ 1, 0 < T ∑ uit < T t=1 uit ∈ {0, 1} C ∑ uit = 1 λ = {µ} λ = {µ, Σ} λ = {µ, Σ, w} ❄ HCM uit < 1 i=1 i=1 Parameter set C ∑ ❄ ❄ FCM NC ❄ GustafsonKessel ❄ Extended NC ❄ GathGeva Figure 2.6: Clustering techniques and their extended versions The fuzzy membership of a feature vector in a cluster depends not only on where the feature vector is located with respect to the cluster, but also on how far away it is with respect to other clusters. Therefore fuzzy memberships are spread across 2.5 Fuzzy Approaches in the Literature 42 the classes and depend on the number of clusters present. FCM clustering has been shown to be advantageous over hard clustering. It has become more attractive with the connection to neural networks [Kosko 1992]. Recent advances in fuzzy clustering have shown spectacular ability to detect not only hypervolume clusters, but also clusters which are actually “thin shells” such as curves and surfaces [Davé 1990, Krishnapuram et al. 1992]. However, the FCM membership values are relative numbers and thus cannot distinguish between feature vectors and outliers. It has been shown that the NC approach is quite successful in improving the robustness of a variety of fuzzy clustering algorithms. A robust-statistical foundation for the NC method was established by Davé and Krishnapuram [1997]. Another approach is the possibilistic C-means method, which is presented in Section 8.2.1 as a further extension of fuzzy set theorybased clustering techniques. 2.5 Fuzzy Approaches in the Literature This chapter presents some fuzzy approaches to speech and speaker recognition in the literature. The first approach is to apply the maximum membership decision rule in Section 2.4.2. The second is the use of the FCM algorithm instead of the HCM (K-means) algorithm in coding a cepstral vector sequence X for the discrete HMM. The third approach is to apply fuzzy rules in hybrid neuro-fuzzy systems. The last approach is not reviewed in this section since it is out of the scope of this thesis. The works relating this approach can be found in Kasabov [1998] and Kasabov et al. [1999]. 2.5.1 Maximum Membership Rule-Based Approach An early application based on fuzzy set theory for decision making was proposed [Pal and Majumder 1977]. Recognition of vowels and identifying speakers using the first three f ormants (F1 , F2 , and F 3) were implemented by using the membership function ui (x) associated with an unknown vector x = {x1 , . . . , xn } for each model 2.5 Fuzzy Approaches in the Literature 43 λi , i = 1, . . . , M as follows ui (x) = 1 1 + [d(x, λi )/E]F (2.55) where E is an arbitrary positive constant, F is any integer, and d(x, λi ) is the weighted distance from vector x to the nearest prototype of model λi . Prototype points chosen are the average of the coordinate values corresponding to the entire set of samples in a particular class. Experiments were carried out on a set of Telugu (one of the major Indian languages) words containing about 900 commonly used speech units for 10 vowels and uttered by three male informants in the age group of 28-30 years. Overall recognition is about 82%. An alternative application is the use of fuzzy algorithms for assigning phonetic and phonemic labels to speech segments was presented in [De Mori and Laface 1980]. A method consisting of fuzzy restriction for extracting features, fuzzy relations for relating these features with phonetic and phonemic interpretation, and their use for interpretation of a speech pattern in terms of possibility theory has been described. Experimental results showed an overall recognition of about 95% for 400 samples pronounced by the four talkers. 2.5.2 FCM-Based Approach This approach investigates the use of the fuzzy C-means algorithm instead of the K-means (hard C-means) algorithm in coding a spectral vector sequence X for the discrete HMM. This modification of the VQ is called the fuzzy VQ (FVQ). Since the input of the discrete HMM is an observation sequence O = (o1 . . . oT ) consisting of discrete symbols, the spectral continuous vector sequence X = (x1 . . . xT ) needs to be transformed into the discrete symbol sequence O. This is normally performed by a VQ source coding technique, where each vector xt is coded into a discrete symbol vk —the index of the codevector closest to vector xt ot = vk = arg min d(xt , µi ) 1≤i≤K (2.56) and the observation probability distribution B is of the form defined in (2.15). The FVQ uses fuzzy C-partitioning on X, each vector xt belongs to classes with corresponding memberships, thus the FVQ maps vector xt into an observation vector 2.5 Fuzzy Approaches in the Literature 44 ot = (u1t , . . . , uCt ), where uit is the membership of vector xt in class i and is computed by using (2.45). For the observation probability distribution B = {bj (ot )}, authors have proposed different computation methods. Following [Tseng et al. 1987], B is computed as follows bj (ot ) = C ∑ uit bij , 1 ≤ j ≤ N, 1≤t≤T (2.57) i=1 where bij is reestimated by bij = T∑ −1 uit αt (j)βt (j) t=1 T∑ −1 , 1 ≤ i ≤ C, 1≤j≤N (2.58) αt (j)βt (j) t=1 Experiments were conducted to compare three cases: using VQ/HMM, using FVQ/HMM for training only, and using FVQ/HMM for both training and recognition, where the HMMs are 5-state left-right ones. The highest isolated-word recognition rates for the three cases are 72%, 77%, and 77%, respectively, where the degree of fuzziness is m = 1.25, 10 training utterances are used, and the vocabulary is the E-set consisting of 9 English letters {b, c, d, e, g, p, t, v, z}. To obtain more tractable computation, Tsuboka and Nakahashi [1994] have proposed two alternative methods 1. Multiplication-type FVQ: bj (ot ) = C ∏ buijit , 1 ≤ j ≤ N, 1≤t≤T (2.59) i=1 and bij is computed as in (2.58) 2. Addition-type FVQ: bj (ot ) = C ∑ uit bij , 1 ≤ j ≤ N, 1≤t≤T (2.60) i=1 and bij is computed as bij = T∑ −1 t=1 ζij (t), ζij (t) = αt (j)βt (j) T∑ −1 αt (j)βt (j) t=1 × uit bij C ∑ , 1 ≤ i ≤ C, 1≤j≤N uit bij i=1 (2.61) 2.5 Fuzzy Approaches in the Literature 45 It was reported that for practical applications the multiplication type is more suitable than the addition type. In isolated-word recogniton experiments, the number of states of each HMM was set to be 1/5 of the average length in frames of training data. The vocabulary is 100 city names in Japan and the degree of fuzziness is m = 2. The highest recognition rates reached with a codebook size of 256 were 98.5% for the multiplication type, 98.2% for the addition type, and 97.5% for the VQ/HMM. An extension of this approach has been proposed by Chou and Oh [1996] where a distribution normalisation dependent on the codevectors and a fuzzy contribution based on weighting and smoothing the codevectors by distance have been applied. Chapter 3 Fuzzy Entropy Models This chapter proposes a new fuzzy approach to speech and speaker recognition as well as to cluster analysis in pattern recognition. Models developed in this approach can be called fuzzy entropy models since they are based on a basic algorithm called fuzzy entropy clustering. The goal of this approach is not only to propose a new fuzzy method but also to show that statistical models, such as HMMs in the maximum likelihood scheme, can be viewed as fuzzy models, where probabilities of unobservable data given observable data are used as fuzzy membership functions. An introduction of fuzzy entropy clustering techniques is presented in Section 3.1. Relationships between clustering and modelling problems are shown in Section 3.2. Section 3.3 presents an optimisation criterion proposed as maximum fuzzy likelihood and formulates the fuzzy EM algorithm. Fuzzy entropy models for HMMs, GMMs and VQ are presented in the next Sections 3.4, 3.5 and 3.6, respectively. The noise clustering approach is also considered for these models. Section 3.7 presents a comparison between conventional models and fuzzy entropy models. 3.1 Fuzzy Entropy Clustering Let us consider the following function [Tran and Wagner 2000f] Hn (U, λ; X) = T C ∑ ∑ uit d2it + n T C ∑ ∑ uit log uit (3.1) i=1 t=1 i=1 t=1 where n > 0, λ is the model parameter set, dit is the distance between vector xt and cluster i, and U = [uit ] with uit being the membership of vector xt in cluster i. 46 3.1 Fuzzy Entropy Clustering 47 Assuming that the matrices U satisfy the following conditions C ∑ uit = 1 ∀t, 0< T ∑ uit < T ∀i (3.2) t=1 i=1 which mean that each xt belongs to C clusters, no cluster is empty and no cluster is all of X because of 2 ≤ C < T . We wish to show that minimising the function Hn (U, λ; X) on U yields solutions uit ∈ [0, 1], hence all constraints in (2.43) are satisfied. This means that the matrices U determine the fuzzy C-partition space for X and Hn (U, λ; X) is a fuzzy objective function. The first term on the right-hand side in (3.1) is the sum-of-squared-errors function J1 (U, λ; X) defined in (2.39) for hard C-means clustering. The second term is the negative of the following function E(U ) multiplied by n E(U ) = − T C ∑ ∑ uit log uit (3.3) i=1 t=1 The function E(U ) is maximum if uit = 1/C ∀i, and minimum if uit = 1 or 0. On the other hand, the function J1 (U, λ; X) needs to be minimised to obtain a good partition for X. Therefore, we can see that uit can take values in the interval [0, 1] if the function Hn (U, λ; X) is minimised over U . Indeed, with the assumption in (3.2), the Lagrangian Hn∗ (U, λ; X) is of the form Hn∗ (U, λ; X) = T C ∑ ∑ uit d2it + n T C ∑ ∑ uit log uit + i=1 t=1 i=1 t=1 C ∑ ki ( i=1 T ∑ uit − 1) (3.4) t=1 Hn∗ (U, λ; X) is minimised by setting its gradients with respect to U and the Lagrange multipliers {ki } to zero  2   dit + n(1 + log uit ) + ki = 0 C ∑  uit = 1 ∀t  ∀i, t (3.5) i=1 This is equivalent to 2 uit = Si e−dit /n , Si = e−1 − (ki /n) (3.6) Using the constraint in (3.5), we can compute Si and hence 2 uit = e−dit /n C ∑ k=1 2 e−dkt /n (3.7) 3.1 Fuzzy Entropy Clustering 48 From (3.7), it can be seen that 0 ≤ uit ≤ 1. Therefore, the matrices U determine a fuzzy C-partition space for X. In this case, the function E(U ) is called the fuzzy entropy function which has been considered by many authors. Clusters are considered as fuzzy sets and the fuzzy entropy function expresses the uncertainty of determining whether xt belongs to a given cluster or not. Measuring the degree of uncertainty of fuzzy sets themselves was first proposed by De Luca and Termini [1972]. In other words, the function E(U ) expresses the average degree of nonmembership of members in a fuzzy set [Li and Mukaidono 1999]. The function E(U ) was also considered by Hathaway [1986] for mixture distributions in relating the EM algorithm to clustering techniques. For the function Hn (U, λ; X) in (3.1), the function E(U ) is employed to “pull” memberships away from values equal to 0 or 1. Based on the above discussions, this clustering technique is called fuzzy entropy clustering [Tran and Wagner 2000f] to distinguish it from FCM clustering that has been reviewed in Chapter 2. In general, the task of fuzzy entropy clustering is to minimise the fuzzy objective function Hn (U, λ; X) over variables U and λ, namely, finding a pair of (U , λ) such that Hn (U , λ; X) ≤ Hn (U, λ; X). This task is implemented by an iteration of the two steps: 1) Finding U such that Hn (U , λ; X) ≤ Hn (U, λ; X), and 2) Finding λ such that Hn (U , λ; X) ≤ Hn (U , λ; X). U is obtained by using the solution in (3.7), which can be presented in a similar form to FCM clustering: uit = [ C ( ∑ 2 2 edit edjt / j=1 )1/n ]−1 (3.8) Since the function E(U ) in (3.1) is not dependent on dit , determining λ is performed by minimising the first term, that is the function J1 (U, λ; X). Thus the parameter estimation equations are identical to those in HCM clustering. For the Euclidean distance d2it = (xt − µi )2 , we obtain [Tran and Wagner 2000g] µi = T ∑ uit xt t=1 /∑ T uit (3.9) t=1 For the Mahalanobis distance d2it = (xt − µi )′ Σ−1 i (xt − µi ) , we obtain µi = T ∑ t=1 uit xt /∑ T t=1 uit , Σi = T ∑ t=1 uit (xt − µi )(xt − µi )′ /∑ T t=1 uit (3.10) 3.1 Fuzzy Entropy Clustering 49 The function Hn (U, λ; X) also has a physical interpretation. In statistical physics, Hn (U, λ; X) is known as free energy, the first term in Hn (U, λ; X) is the expected energy under U and the second one is the entropy of U [Jaynes 1957]. The expression of uit in (3.8) is of the form of the Boltzmann distribution exp(−ɛ/kB τ ), a special case of the Gibbs distribution, where ɛ is the energy, kB is the Boltzmann constant, and τ is the temperature. Based on this property, we can apply a simulated annealing method [Otten and Ginnenken 1989] to find a global minimum solution for λ (the way that liquids freeze and crystallise in thermodynamics) by decreasing the temperature τ , i.e. decreasing the value of n. The degree of fuzzy entropy n determines the partition of X. As n → ∞, we have uit → (1/C), each feature vector is equally assigned to C clusters, so we have only a single cluster. As n → 0, uit → 0 or 1, and the function Hn (U, λ; X) approaches J1 (U, λ; X), it can be said that FE clustering reduces to HCM clustering [Tran and Wagner 2000f]. Figure 3.1 illustrates the generation of clusters with different values of n. On the other hand, for data having well-separated clusters with Figure 3.1: Generating 3 clusters with different values of n: hard clustering as n → 0, clusters increase their overlap with increasing n > 0, and are identical to a single cluster as n → ∞ memberships converging to the values 0 or 1, the fuzzy entropy term approaches 0 due to 1 log 1 = 0 log 0 = 0, and the FE function itself reduces to the HCM function for all n > 0. 3.2 Modelling and Clustering Problems 3.2 50 Modelling and Clustering Problems To apply FE clustering to statistical modelling techniques, we need to determine relationships between the modelling and clustering problems. The first task for solving modelling and clustering problems is to establish an optimisation criterion known as an objective function. For modelling purposes, optimising the objective function is to find the right parametric form of the distributions. For clustering purposes, the optimisation is to find optimal partitions of data. Clustering is a geometric method, where considering data involves considering shapes and locations of clusters. In statistical modelling, data structure can be described by considering data density. Instead of finding clusters, we find high data density areas, and thus the consideration of data structure involves considering data distributions via the use of statistical distribution functions. A mixture of normal distributions, or Gaussians, is effective in the approximate description of a complicated distribution. Moreover, an advantage of statistical modelling is that it can effectively express the temporal structure of data through the use of a Markov process, a problem which is not addressed by clustering. It would be useful if we could take advantages of both methods in a single approach. In order to implement this, we first define a general distance dXY for clustering. It denotes a dissimilarity between observable data X and unobservable data (cluster, state) Y as a decreasing function of the distribution of X on component Y , given a model λ d2XY = − log P (X, Y |λ) (3.11) This distance is used to relate the clustering problem to the statistical modelling problem as well as the minimum distance rule to the maximum likelihood rule. Indeed, since minimising this distance leads to maximising the component distribution P (X, Y |λ), grouping similar data points into a cluster by the minimum distance rule thus becomes grouping these into a component distribution by the maximum likelihood rule. Clusters are now represented by component distribution functions and hence the characteristics of a cluster are not only its shape and location, but also the data density in the cluster and, possibly, the temporal structure of data if the Markov process is also applied [Tran and Wagner 2000a]. 3.3 Maximum Fuzzy Likelihood Estimation 3.3 51 Maximum Fuzzy Likelihood Estimation The distance defined in (3.11) is used to transform the clustering problem to a modelling problem. Using this distance, we can relate the FE function in (3.1) to the likelihood function. For example, we consider the case that feature vectors are assumed to be statistically independent. Using Jensen’s inequality [Ghahramani 1995] for the log-likelihood L(λ; X), we can show that L(λ; X) = log P (X|λ) = log T ∏ P (xt |λ) = ≥ ≥ T ∑ log C ∑ P (xt , i|λ) = t=1 i=1 C T ∑∑ t=1 i=1 C T ∑ ∑ T ∑ log t=1 uit log log P (xt |λ) t=1 t=1 = T ∑ P (xt , i|λ) uit uit log P (xt , i|λ) − C ∑ uit i=1 C T ∑ ∑ P (xt , i|λ) uit uit log uit (3.12) t=1 i=1 t=1 i=1 On the other hand, according to (3.11), replacing the distance d2it = − log P (xt , i|λ) (3.13) into the FE function in (3.1) and into the membership in (3.8), we obtain Hn (U, λ; X) = − and uit = T C ∑ ∑ uit log P (xt , i|λ) + n i=1 t=1 {∑ C T C ∑ ∑ [P (xt , i|λ)/P (xt , j|λ)]1/n j=1 uit log uit (3.14) i=1 t=1 }−1 (3.15) From (3.12), (3.14) and (3.15) we can show that and L(λ; X) ≥ −H1 (U, λ; X) (3.16) L(λ; X) = −H1 (U , λ; X) (3.17) where according to (3.15) we obtain U = {uit }, uit = P (i|xt , λ) as n = 1. The equality in (3.17) shows that, if we find λ such that H1 (U , λ; X) ≤ H1 (U , λ; X) then we will have L(λ; X) ≥ L(λ; X). It means that, as n = 1, minimising the FE function in (3.1) using the distance in (3.13) leads to maximising the likelihood function. Therefore, 3.4 Fuzzy Entropy Hidden Markov Models 52 we can define a function Ln (U, λ; X) as follows [Tran and Wagner 2000a] Ln (U, λ; X) = −Hn (U, λ; X) = C T ∑ ∑ uit log P (xt , i|λ) − n t=1 i=1 C T ∑ ∑ uit log uit (3.18) t=1 i=1 From the above consideration, Ln (U, λ; X) can be called the fuzzy likelihood function. Maximising this function (also minimising the FE function) is implemented by the fuzzy EM algorithm, which is different from the standard EM algorithm in the E-step. The fuzzy EM algorithm can be formulated as follows Algorithm 2 (The Fuzzy EM Algorithm) 1. Initialisation: Fix n and choose an initial estimate λ 2. Fuzzy E-step: Compute U and Ln (U , λ; X) 3. M-step: Use a certain optimisation method to determine λ , for which Ln (U , λ; X) is maximised 4. Termination: Set λ = λ and U = U , repeat from the E-step until the change of Ln (U, λ; X) falls below a preset threshold. 3.4 Fuzzy Entropy Hidden Markov Models This section is to apply the proposed fuzzy methods to the parameter estimation problem for the fuzzy entropy HMM (FE-HMM). The fuzzy EM algorithm can be viewed as a generalised Baum-Welch algorithm. FE-HMMs reduce to conventional HMMs as the degree of fuzzy entropy n = 1. 3.4.1 Fuzzy Membership Functions Fuzzy sets in the fuzzy HMM are determined in this section to compute the matrices U for the fuzzy EM algorithm in Section 3.3. In the conventional HMM, each observation ot is in each of N possible states at time t with a corresponding probability. In the fuzzy HMM, each observation ot is regarded as being in N possible states at time t with a corresponding degree of belonging known as the fuzzy membership function. A state at time t is thus considered as a time-dependent fuzzy set or fuzzy state st . 3.4 Fuzzy Entropy Hidden Markov Models 53 Fuzzy states s1 s2 . . . sT are also considered as a fuzzy state sequence S. There are N fuzzy states at each time t = 1, . . . , T , and a total of N T possible fuzzy state sequences in the fuzzy HMM. Figure 3.2 illustrates all fuzzy states as well as fuzzy state sequences in the HMM. Figure 3.2: States at each time t = 1, . . . , T are regarded as time-dependent fuzzy sets. There are N ×T fuzzy states connected by arrows into N T fuzzy state sequences in the fuzzy HMM. On the other hand, the observations are always considered in the sequence O and related to the state sequence S. Therefore we define the fuzzy membership function of the sequence O in fuzzy state sequence S based on the fuzzy membership function of the observation ot in the fuzzy state st . For example, the fuzzy membership ust =i (O) denotes the degree of belonging of the observation sequence O to fuzzy state sequences being in fuzzy state st = i at time t, where i = 1, . . . , N . For computing the state transition matrix A, we consider 2N fuzzy states at time t and time t + 1 included in 2N fuzzy state sequences and define the fuzzy membership function ust =i st+1 =j (O). This membership denotes the degree of belonging of the observation sequence O to fuzzy state sequences being in fuzzy state st = i at time t and fuzzy state st+1 = j at time t + 1, where i, j = 1, . . . , N . For simplicity, this membership can be rewritten as 3.4 Fuzzy Entropy Hidden Markov Models 54 uijt (O) or uijt . Figure 3.3 illustrates such fuzzy state sequences in the fuzzy HMM. Figure 3.3: The observation sequence O belongs to fuzzy state sequences being in fuzzy state i at time t and fuzzy state j at time t + 1. Similarly, we can determine fuzzy sets for computing the parameters {w, µ, Σ} in the fuzzy continuous HMM, where the observation sequence O is the vector sequence X. Fuzzy sets are fuzzy states and fuzzy mixtures at time t. The fuzzy membership function ust =j mt =k (X), or ujkt for simplicity, denotes the degree of belonging of the observation sequence X to fuzzy state j and fuzzy mixture k at time t as illustrated in Figure 3.4. 3.4.2 Fuzzy Entropy Discrete HMM From (3.18), the fuzzy likelihood function for the fuzzy entropy discrete HMM (FEDHMM) is proposed as follows [Tran and Wagner 2000a] Ln (U, λ; O) = − T∑ −1 ∑ ∑ t=0 st st+1 ust st+1 d2st st+1 − n T∑ −1 ∑ ∑ ust st+1 log ust st+1 (3.19) t=0 st st+1 where n > 0, ust st+1 = ust st+1 (O) and d2st st+1 = − log P (O, st , st+1 |λ). Note that πs1 is denoted by as0 s1 in (3.19) for simplicity. Assuming that we are in state i at time t 3.4 Fuzzy Entropy Hidden Markov Models 55 and state j at time t + 1, the function Ln (U, λ; O) can be rewritten as follows Ln (U, λ; O) = − N N ∑ T∑ −1 ∑ uijt d2ijt − n N N ∑ T∑ −1 ∑ uijt log uijt (3.20) t=0 i=1 j=1 t=0 i=1 j=1 where [ ] d2ijt = − log P (O, st = i, st+1 = j|λ) = − log αt (i)aij bj (ot+1 )βt+1 (j) (3.21) uijt = uijt (O) is the fuzzy membership function denoting the degree to which the observation sequence O belongs to the fuzzy state sequences being in state i at time t and state j at time t + 1. From the definition of the fuzzy membership (2.43), we obtain 0 ≤ uijt ≤ 1 N N ∑ ∑ ∀i, j, t, uijt = 1 i=1 j=1 The inequalities in 0 < state do not occur. ∑T t=1 ∀t, 0< T ∑ uiit < T ∀i t=1 (3.22) uiit < T mean that the state sequences having only one Fuzzy E-Step: Since maximising the function Ln (U, λ; O) on U is also minimising the corresponding function Hn (U, λ; O), the solution U in (3.8) is used. The distance Figure 3.4: The observation sequence X belongs to fuzzy state j and fuzzy mixture k at time t in the fuzzy continuous HMM. 3.4 Fuzzy Entropy Hidden Markov Models 56 is defined in (3.21). We obtain 2 uijt = e−dijt /n N N ∑ ∑ e [P (O, st = i, st+1 = j|λ)]1/n = N N ∑ ∑ −d2klt /n k=1 l=1 (3.23) 1/n [P (O, st = k, st+1 = l|λ)] k=1 l=1 M-step: Note that the second term of the function Ln (U, λ; O) is not dependent on λ, therefore maximising Ln (U , λ; O) over λ is equivalent to maximising the following function L∗n (U , λ; O) = − N N ∑ T∑ −1 ∑ uijt d2ijt (3.24) t=0 i=1 j=1 Replacing the distance (3.21) into (3.24), we can regroup the function in (3.24) into four terms as follows L∗n (U , λ; O) = N N (∑ ∑ i=1 ) K ( N ∑ ∑ T ∑ j=1 + uij0 log πj + j=1 k=1 + i=1 j=1 t=1 s.t. ot =vk −1 N T∑ N ∑ ∑ −1 N ( T∑ N ∑ ∑ N ∑ t=1 ) uijt log aij ) uijt log bj (k) i=1 uijt log [αt (i)βt+1 (j)] (3.25) i=1 j=1 t=1 where the last term including αt (i)βt+1 (j) can be ignored, since the forward-backward variables can be computed from π, A, B by the forward-backward algorithm (see Section 2.3.4). Maximising the function L∗n (U , λ; O) on π, A, B is performed by using Lagrange multipliers and the following constraints N ∑ πj = 1, N ∑ aij = 1, j=1 j=1 K ∑ bj (k) = 1 (3.26) k=1 We obtain the parameter reestimation equations as follows [Tran 1999] πj = N ∑ uij0 , i=1 aij = T∑ −1 uijt t=1 N T∑ −1 ∑ , uijt t=1 j=1 bj (k) = T ∑ t=1 s.t. ot =vk N ∑ uijt i=1 N T ∑ ∑ (3.27) uijt t=1 i=1 In the case of n = 1, the membership function in (3.23) becomes uijt = P (O, st = i, st+1 = j|λ) N N ∑ ∑ k=1 l=1 P (O, st = k, st+1 = l|λ) = P (st = i, st+1 = j|O, λ) = ξt (i, j) (3.28) 3.4 Fuzzy Entropy Hidden Markov Models 57 where ξt (i, j) is defined in (2.20). The parameter reestimation equations in (3.27) are now identical to those obtained by the Baum-Welch algorithm in Section 2.3.4. 3.4.3 Fuzzy Entropy Continuous HMM Similarly, ujkt = ujkt (X) is defined as the fuzzy membership function denoting the degree to which the vector sequence X belongs to fuzzy state st = i and fuzzy Gaussian mixture mt = k at time t, satisfying 0 ≤ ujkt ≤ 1 N ∑ K ∑ ∀j, k, t, ujkt = 1 ∀t, 0< T ∑ ujkt < T ∀j, k t=1 j=1 k=1 (3.29) and the distance djkt is of the form d2jkt = − log P (X, st = j, mt = k|λ) = − log [∑ N αt−1 (i)aij wjk N (xt , µjk , Σjk )βt (j) i=1 ] (3.30) We obtain the fuzzy EM algorithm for the fuzzy entropy continuous HMM (FECHMM) as follows [Tran and Wagner 2000a] Fuzzy E-Step: 2 e−djkt /n ujkt = N ∑ K ∑ = −d2ilt /n e i=1 l=1 [P (X, st = j, mt = k|λ)]1/n N ∑ K ∑ [P (X, st = i, mt = l|λ)] (3.31) 1/n i=1 l=1 M-step: Similar to the continuous HMM, the parameter estimation equations for the π and A distributions are unchanged, but the output distribution B is estimated via Gaussian mixture parameters (w, µ, Σ) as follows wjk = T ∑ ujkt t=1 K T ∑ ∑ , ujkt t=1 k=1 µjk = T ∑ ujkt xt t=1 T ∑ , ujkt t=1 Σjk = T ∑ ujkt (xt − µjk )(xt − µjk )′ t=1 T ∑ ujkt t=1 (3.32) In the case of n = 1, the membership function in (3.31) becomes ujkt = P (X, st = j, mt = k|λ) N N ∑ ∑ i=1 l=1 P (X, st = i, mt = l|λ) = P (st = j, mt = k|X, λ) = ηt (j, k) (3.33) 3.4 Fuzzy Entropy Hidden Markov Models 58 where ηt (j, k) is defined in (2.22). The parameter reestimation equations in (3.32) are now identical to those obtained by the Baum-Welch algorithm in Section 2.3.4. 3.4.4 Noise Clustering Approach The speech signal is influenced by the speaking environment, the transmission channel, and the transducer used to capture the signal. So there exist some bad observations regarded as outliers, which influence speech recognition performance. For the fuzzy entropy HMM in the noise clustering approach (NC-FE-HMM), a separate state is used to represent outliers and is termed the garbage state [Tran and Wagner 1999a]. This state has a constant distance δ from all observation sequences. The membership u•t of an observation sequence O at time t in the garbage state is defined to be u•t = 1 − N N ∑ ∑ uijt 1≤t≤T (3.34) i=1 j=1 Therefore, the membership constraint for the “good” states is effectively relaxed to N N ∑ ∑ uijt < 1 1≤t≤T (3.35) i=1 j=1 This allows noisy data and outliers to have arbitrarily small membership values in good states. The fuzzy likelihood function for the FE-DHMM in the NC approach (NC-FE-DHMM) is as follows Ln (U, λ; O) = − N ∑ T∑ −1 ∑ N uijt d2ijt − n − uijt log uijt t=0 i=1 j=1 t=0 i=1 j=1 T∑ −1 N N ∑ T∑ −1 ∑ 2 u•t δ − n T∑ −1 u•t log u•t (3.36) t=0 t=0 Replacing u•t in (3.34) into (3.36) and maximising the fuzzy likelihood function over U , we obtain [Tran and Wagner 2000a] Fuzzy E-Step: uijt = 1 N N ∑ ∑ 2 2 2 2 (edijt /edklt )1/n + (edijt /eδ )1/n k=1 l=1 = N N ∑ ∑ k=1 l=1 [P (O, st = i, st+1 = j|λ)]1/n [P (O, st = k, st+1 = l|λ)] 1/n (3.37) −δ 2 /n +e 3.5 Fuzzy Entropy Gaussian Mixture Models 59 M-Step: The M-step is identical to the M-step of the FE-DHMM in Section 3.4.2. where the distance dijt is computed as in (3.21). The second term in the denominator of (3.37) becomes quite large for outliers, resulting in small membership values in all the good states for outliers. The advantage of this approach is that it forms a more robust version of the fuzzy entropy algorithm and can be used instead of the fuzzy entropy algorithm provided a suitable value for the constant distance δ can be found. Similarly, the FE-CHMM in the NC approach (NC-FE-CHMM) is as follows Fuzzy E-Step: ujkt = 1 N ∑ K ∑ 2 2 2 2 (edjkt /edilt )1/n + (edjkt /eδ )1/n i=1 l=1 = K N ∑ ∑ [P (X, st = j, mt = k|λ)]1/n 1/n [P (X, st = i, mt = l|λ)] +e (3.38) −δ 2 /n i=1 l=1 M-Step: The M-step is identical to the M-step of the FE-CHMM in Section 3.4.3. 3.5 Fuzzy Entropy Gaussian Mixture Models Although we can obtain equations for the fuzzy entropy GMM (FE-GMM) from the FE-CHMM with the number of states set to N = 1, for practical applications, they are summarised as follows. 3.5.1 Fuzzy Entropy GMM For a training vector sequence X = (x1 x2 . . . xT ), the fuzzy likelihood of the FE-GMM is Ln (U, λ; X) = − T K ∑ ∑ i=1 t=1 uit d2it − n T K ∑ ∑ uit log uit (3.39) i=1 t=1 where d2it = − log P (xt , i|λ) = − log [wi N (xt , µi , Σi )] (3.40) 3.5 Fuzzy Entropy Gaussian Mixture Models 60 and uit = ui (xt ) is the fuzzy membership function denoting the degree to which feature vector xt belongs to fuzzy Gaussian mixture i, satisfying 0 ≤ uit ≤ 1 K ∑ ∀i, t, uit = 1 ∀t, 0< T ∑ uit < T ∀i (3.41) t=1 i=1 Fuzzy E-Step: Maximising the fuzzy likelihood function on U gives [P (xt , i|λ)]1/n 2 uit = e−dit /n K ∑ e = −d2kt /n k=1 K ∑ (3.42) 1/n [P (xt , k|λ)] k=1 M-Step: Maximising the fuzzy likelihood function on λ gives wi = T ∑ uit t=1 K T ∑∑ µi = , ukt t=1 k=1 T ∑ uit xt t=1 T ∑ Σi = , T ∑ uit (xt − µi )(xt − µi )′ t=1 uit t=1 Again if n = 1 we obtain uit = uit t=1 P (xt , i|λ) K ∑ (3.43) T ∑ = P (i|xt , λ) (3.44) P (xt , k|λ) k=1 and the FE-GMM reduces to the conventional GMM. 3.5.2 Noise Clustering Approach The fuzzy likelihood function for the FE-GMM in the NC approach (NC-FE-GMM) is as follows Ln (U, λ; X) = − − K T ∑ ∑ t=1 i=1 T ∑ uit d2it and λ gives ∑K i=1 uit log uit t=1 i=1 2 u•t δ − n t=1 where u•t = 1 − −n K T∑ −1 ∑ T ∑ u•t log u•t (3.45) t=1 uit , 1 ≤ t ≤ T . Maximising the fuzzy likelihood function on U Fuzzy E-Step: [P (xt , i|λ)]1/n 2 uit = e−dit /n K ∑ k=1 e −d2kt /n +e = −δ 2 /n K ∑ 1/n [P (xt , k|λ)] (3.46) −δ 2 /n +e k=1 M-Step: the M-step is identical to the M-step of the FE-GMM in Section 3.5.1. 3.6 Fuzzy Entropy Vector Quantisation 3.6 61 Fuzzy Entropy Vector Quantisation Reestimation algorithms for fuzzy entropy VQ (FE-VQ) are derived from the algorithms for FE-GMMs by using the Euclidean or the Mahalanobis distances. We obtain the following [Tran and Wagner 2000g, Tran and Wagner 2000h]: Fuzzy E-Step: Choose one of the following cases • FE-VQ: uit = 1 ) d2 − d2kt exp it n k=1 K ∑ (3.47) ( • FE-VQ with noise clusters (NC-FE-VQ): uit = 1 K ∑ exp k=1 ( d2it − n ) d2kt d2 − δ 2 + exp it n ( ) (3.48) M-Step: Choose one of the following cases • Using the Euclidean distance: d2it = (xt − µi )2 µi = T ∑ uit xt t=1 T ∑ (3.49) uit t=1 • Using the Mahalanobis distance: d2it = (xt − µi )′ Σ−1 i (xt − µi ) µi = T ∑ uit xt t=1 T ∑ , uit t=1 3.7 Σi = T ∑ uit (xt − µi )(xt − µi )′ t=1 T ∑ (3.50) uit t=1 A Comparison Between Conventional and Fuzzy Entropy Models It may be useful for applications to consider the differences between the conventional models and FE models. In general, the difference is mainly the conventional Estep and the fuzzy E-step in the parameter reestimation procedure. A weighting 3.7 A Comparison Between Conventional and Fuzzy Entropy Models 62 exponent 1/n on each joint probability P (X, Y |λ) between observable data X and unobservable data Y is introduced in the FE models. If n = 1, the FE models reduce to the conventional models, as shown in Figure 3.5 [ ]1/n P (X, Y |λ) uY (X) = ∑ [ ]1/n P (X, Y |λ) n=1 ✲ P (X, Y |λ) = P (Y |X, λ) uY (X) = ∑ P (X, Y |λ) Y Y Fuzzy Entropy Models Conventional Models Figure 3.5: From fuzzy entropy models to conventional models The role of the degree of fuzzy entropy n can be considered in depth via its influence on the parameter reestimation equations. Without loss of generality, let us consider a problem of GMMs. Given a vector xt , let us assume that the cluster i among K clusters has the highest component density P (xt , i|λ), i.e. the shortest distance d2it = − log P (xt , i|λ). Consider the membership uit of the fuzzy entropy GMM with n > 1. From (3.42), it can be rewritten as 2 uit = e−dit /n K ∑ k=1 = −d2kt /n e = [P (xt , i|λ)]1/n K ∑ [P (xt , k|λ)]1/n k=1 [P (xt , i|λ)]1/n [P (xt , i|λ)]1/n + K ∑ k=1 k6=i [P (xt , k|λ)]1/n = 1+ K ∑ k=1 k6=i [ 1 (3.51) ] P (xt , k|λ) 1/n P (xt , i|λ) For all k = 1, . . . , K and k 6= i, we obtain the following equivalent inequalities: ⇔ ⇔ P (xt , k|λ) 1 3.7 A Comparison Between Conventional and Fuzzy Entropy Models ⇔ 1+ K ∑ k=1 k6=i [ 63 1 1 ]1/n < K ∑ P (xt , k|λ) P (xt , k|λ) 1+ P (xt , i|λ) k=1 P (xt , i|λ) k6=i ⇔ uit P (i|xt , λ)       0<n<1 = P (i|xt , λ) n=1 1 (3.53) • P (xt , i|λ) ≤ P (xt , k|λ) ∀k 6= i : xt is furthest from cluster i uit  P (i|xt , λ) n>1 (3.54) As discussed above, the distance between vector xt and cluster i is a monotonically decreasing function of the joint probability P (xt , i|λ) (see (3.40)), therefore we can have an interpretation for the parameter n. If 0 < n < 1, comparing with the a posteriori probability P (i|xt , λ) in the GMM, the degree of belonging uit of vector xt to cluster i is higher than P (i|xt , λ) if xt is close to cluster i and is lower than P (i|xt , λ) if xt is far from cluster i. The reverse result is obtained for n > 1. Since model parameters λ = {w, µ, Σ} are determined by memberships (see (3.43) in the M-step), we can expect that the use of the parameter n will yield a better parametric form for the GMMs. As discussed in Section 2.3.2, when the amount of the training data is insufficient, the quality of the distribution parameter estimates cannot be guaranteed. In reality, this problem often occurs. Therefore fuzzy entropy models may enhance the above quality by their adjustable parameter n. If we wish to decrease the influence of vectors far from cluster center, we reduce the value of n to less than 1. Inversely, values of n greater than 1 increase the influence of those vectors. In general, there does not exist a best value of n in all cases. For different applications and data, suitable values of n may be different. The membership function with different values of n versus the distance between vector xt and cluster i is shown in Figure 3.6. The limit of n = 0 represents hard models, which will be presented in Chapter 5. 3.8 Summary and Conclusion 64 Figure 3.6: The membership function uit with different values of the degree of fuzzy entropy n versus the distance dit between vector xt and cluster i For example, consider the case of 4 clusters. Given P (xt , i|λ), i = 1, 2, 3, 4, we compute uit with n = 1 for the GMM, and with n = 0.5 and n = 2 for the FE-GMM. Table 3.1 shows these values for comparison. Cluster i i=1 Given P (xt , i|λ) uit = P (i|xt , λ) for GMM i=2 i=3 i=4 0.0016 0.0025 0.0036 0.0049 (n = 1.0) 0.13 0.20 0.28 0.39 uit for fuzzy entropy GMM (n = 0.5) 0.06 0.14 0.28 0.52 uit for fuzzy entropy GMM (n = 2.0) 0.18 0.23 0.27 0.32 Table 3.1: An example of memberships for the GMM and the FE-GMM 3.8 Summary and Conclusion Fuzzy entropy models have been presented in this chapter. Relationships between fuzzy entropy models are summarised in Figure 3.7 below. A parameter is introduced for the degree of fuzzy entropy n > 0. With n → 0, we obtain hard models, which will be presented in Chapter 5. With n → ∞, we obtain maximally fuzzy entropy 3.8 Summary and Conclusion 65 models, equivalent to only a single state or a single cluster. With n = 1, fuzzy entropy models reduce to conventional models in the maximum likelihood scheme. This result shows that the statistical models can be viewed as special cases of fuzzy models. An advantage obtained from this viewpoint is that we can get ideas from fuzzy methods and apply them to statistical models. For example, by letting n = 1 in the noise clustering approach, we obtain new models for HMMs, GMMs and VQ without any fuzzy parameters. Moreover, the adjustibility of the degree of fuzzy entropy n in FE models is also an advantage. When conventional models do not work well in some cases due to the insufficiency of the training data or the complexity of the speech data, such as the nine English E-set words, a suitable value of n can be found to obtain better models. Experimental results for these models will be reported in Chapter 7. FE Models ❄ FE Discrete Models ❄ FE DHMM ❄ FE Continuous Models ❄ NC-FE DHMM ❄ NC CHMM ❄ NC-FE CHMM ❄ FE GMM ❄ NC-FE GMM ❄ FE VQ ❄ NC-FE VQ Figure 3.7: Fuzzy entropy models for speech and speaker recognition Chapter 4 Fuzzy C-Means Models This chapter proposes a fuzzy approach based on fuzzy C-means (FCM) clustering to speech and speaker recognition. Models in this approach can be called FCM models and are estimated by the minimum fuzzy squared-error criterion used in FCM clustering. This criterion is different from the maximum likelihood criterion, hence FCM models cannot reduce to statistical models. However, the parameter estimation procedure of FCM models is similar to that of FE models, where distances are defined in the same way. In this chapter, the fuzzy EM algorithm is reformulated for the minimum fuzzy squarederror criterion in Section 4.1. FCM models for HMMs, GMMs and VQ are presented in the next Sections 4.2, 4.3 and 4.4, respectively. The noise clustering approach is also considered for these models. A discussion on the role of fuzzy memberships of FCM models and a comparison between FCM models and FE models are presented in Section 4.5. 4.1 Minimum Fuzzy Squared-Error Estimation The fuzzy squared-error function in (2.44) page 38 is used as an optimisation criterion to estimate FCM models. For convenience, we reintroduce this function as follows Jm (U, λ; X) = T C ∑ ∑ 2 um it dit (4.1) i=1 t=1 where U = {uit } is a fuzzy C-partition of X, m > 1 is a weighting exponent on each fuzzy membership uit and is called the degree of fuzziness, λ and dit are defined 66 4.2 Fuzzy C-Means Hidden Markov Models 67 for particular models. Minimising the function Jm (U, λ; X) over the variables U and λ is implemented by an iteration of the two steps: 1) Finding U such that Jm (U , λ; X) ≤ Jm (U, λ; X), and 2) Finding λ such that Jm (U , λ; X) ≤ Jm (U , λ; X). The fuzzy EM algorithm on page 52 is reformulated for FCM models as follows [Tran and Wagner 1999b] Algorithm 3 (The Fuzzy EM Algorithm) 1. Initialisation: Fix m and choose an initial estimate λ 2. Fuzzy E-step: Compute U and Jm (U , λ; X) 3. M-step: Use a certain minimisation method to determine λ , for which Jm (U , λ; X) is minimised 4. Termination: Set λ = λ and U = U , repeat from the E-step until the change of Jm (U, λ; X) falls below a preset threshold. 4.2 Fuzzy C-Means Hidden Markov Models This section proposes a parameter estimation procedure for the HMM using the minimum fuzzy squared-error estimation. Noise clustering and possibilistic C-means approaches are also considered for both discrete and continuous HMMs. 4.2.1 FCM Discrete HMM From (4.1), the fuzzy squared-error function for the FCM discrete HMM (FCMDHMM) is proposed as follows [Tran and Wagner 1999c] Jm (U, λ; O) = T∑ −1 ∑ ∑ t=0 st st+1 2 um st st+1 dst st+1 (4.2) where m > 1, ust st+1 = ust st+1 (O) and d2st st+1 = − log P (O, st , st+1 |λ). Note that πs1 is denoted by as0 s1 in (4.2) for simplicity. Assuming that we are in state i at time t and state j at time t + 1, the function Jm (U, λ; O) can be rewritten as follows Jm (U, λ; O) = N N ∑ T∑ −1 ∑ t=0 i=1 j=1 2 um ijt dijt (4.3) 4.2 Fuzzy C-Means Hidden Markov Models 68 where uijt and dijt are defined in (3.22) and (3.21), respectively. Minimising the function Jm (U, λ; O) over U and λ gives the following fuzzy EM algorithm for the FCM-DHMM Fuzzy E-Step: Minimising the function Jm (U, λ; O) in (4.2) over U gives uijt = [ N ( N ∑ ∑ d2ijt k=1 l=1 / d2klt )1/(m−1) ]−1 (4.4) M-step: Replacing the distance (3.21) into (4.3), we can regroup the function in (4.3) into four terms as follows Jm (U , λ; O) = − N N (∑ ∑ i=1 j=1 − K N ∑ ∑ j=1 k=1 − ) um ij0 log πj − ( i=1 j=1 T ∑ t=1 s.t. ot =vk −1 N T∑ N ∑ ∑ −1 N ( T∑ N ∑ ∑ N ∑ ) um ijt log aij t=1 ) um ijt log bj (k) i=1 um ijt log [αt (i)βt+1 (j)] (4.5) i=1 j=1 t=1 where, similar to the FE-DHMM, the last term including αt (i)βt+1 (j) is ignored, since the forward-backward variables can be computed from π, A, B by the forwardbackward algorithm (see Section 2.3.4). Minimising the function Jm (U , λ; O) on π, A, B is performed by using Lagrange multipliers and the same constraints as in (3.26) [Tran and Wagner 1999c]. We obtain the parameter reestimation equations as follows πj = N ∑ um ij0 , aij = i=1 T∑ −1 um ijt t=1 N T∑ −1 ∑ , bj (k) = um ijt t=1 j=1 4.2.2 T ∑ N ∑ um ijt t=1 i=1 s.t. ot =vk N T ∑ ∑ um ijt t=1 i=1 (4.6) FCM Continuous HMM Similarly, using the fuzzy membership function ujkt and the distance djkt in (3.29) and (3.30), respectively, we obtain the fuzzy EM algorithm for the FCM continuous HMM (FCM-CHMM) as follows [Tran and Wagner 1999a] Fuzzy E-Step: ujkt = [ K ( N ∑ ∑ i=1 l=1 d2jkt / d2ilt )1/(m−1) ]−1 (4.7) 4.2 Fuzzy C-Means Hidden Markov Models 69 M-step: Similar to the FE-CHMM, the parameter estimation equations for the π and A distributions are unchanged, but the output distribution B is estimated via Gaussian mixture parameters (w, µ, Σ) as follows wjk = T ∑ um jkt t=1 K T ∑∑ , µjk = um jkt t=1 k=1 4.2.3 T ∑ um jkt xt t=1 T ∑ , Σjk = T ∑ ′ um jkt (xt − µjk )(xt − µjk ) t=1 T ∑ um jkt t=1 um jkt t=1 (4.8) Noise Clustering Approach The concept of the garbage state in Section 3.4.4, page 58 is applied to the FCMDHMM. The fuzzy objective function for the FCM-DHMM in the NC approach (NCFCM-DHMM) is as follows [Tran and Wagner 1999a] Jm (U, λ; O) = N ∑ N T∑ −1 ∑ 2 um ijt dijt + T∑ −1 t=0 t=0 i=1 j=1 (1 − N ∑ N ∑ uijt )m δ 2 (4.9) i=1 j=1 The fuzzy EM algorithm for the NC-FCM-CHMM is as follows Fuzzy E-Step: uijt = 1 N N ∑ ∑ (d2ijt /d2klt )1/(m−1) (4.10) + (d2ijt /δ 2 )1/(m−1) k=1 l=1 where dijt is defined in (3.21). The second term in the denominator of (4.10) becomes quite large for outliers, resulting in small membership values in all the good states for outliers. M-Step: identical to the M-step of the FCM-DHMM in Section 4.2.1. Similarly, the FCM-CHMM in the NC approach (NC-FCM-CHMM) is as follows Fuzzy E-Step: ujkt = 1 K N ∑ ∑ (d2jkt /d2ilt )1/(m−1) (4.11) + (d2jkt /δ 2 )1/(m−1) i=1 l=1 where djkt is defined in (3.30). M-Step: identical to the M-step of the FCM-CHMM in Section 4.2.2. 4.3 Fuzzy C-Means Gaussian Mixture Models 4.3 70 Fuzzy C-Means Gaussian Mixture Models Similar to FE-GMMs, FCM Gaussian mixture models (FCM-GMMs) are summarised as follows. 4.3.1 Fuzzy C-Means GMM For a training vector sequence X = (x1 x2 . . . xT ), the fuzzy objective function of the fuzzy C-means GMM (FCM-GMM) is [Tran et al. 1998a] Jm (U, λ; X) = T K ∑ ∑ 2 um it dit (4.12) i=1 t=1 where d2it = − log P (xt , i|λ) = − log [wi N (xt , µi , Σi )] (4.13) and uit = ui (xt ) is the fuzzy membership function denoting the degree to which feature vector xt belongs to Gaussian distribution i, satisfying 0 ≤ uit ≤ 1 K ∑ ∀i, t, uit = 1 ∀t, 0< T ∑ uit < T ∀i (4.14) t=1 i=1 The fuzzy EM algorithm for the FCM-GMM is as follows [Tran and Wagner 1998] Fuzzy E-Step: Minimising the fuzzy objective function over U gives uit = [ K ( ∑ d2it k=1 / d2kt )1/(m−1) ]−1 (4.15) M-Step: Minimising the fuzzy objective function over λ gives wi = T ∑ um it t=1 K T ∑ ∑ t=1 k=1 4.3.2 , um kt µi = T ∑ um it xt t=1 T ∑ , Σi = T ∑ ′ um it (xt − µi )(xt − µi ) t=1 T ∑ um it t=1 (4.16) um it t=1 Noise Clustering Approach The fuzzy objective function for the FCM-GMM in the NC approach (NC-FCMGMM) is as follows [Tran and Wagner 1999e] Jm (U, λ; X) = K T ∑ ∑ t=1 i=1 uit d2it + T ∑ t=1 (1 − K ∑ i=1 uit )m δ 2 (4.17) 4.4 Fuzzy C-Means Vector Quantisation 71 Minimising the fuzzy objective function on U and λ gives Fuzzy E-Step: uit = 1 K ∑ 1/(m−1) (d2it /d2kt ) (4.18) 2 1/(m−1) + (d2it /δ ) k=1 M-Step: identical to the M-step of the FCM-GMM in Section 4.3.1. 4.4 Fuzzy C-Means Vector Quantisation The reestimation algorithms for fuzzy C-means VQ (FCM-VQ) are identical to the FCM algorithms reviewed in Section 2.4.6, page 37. 4.5 Comparison Between FCM and FE Models We have considered three kinds of models: 1) Conventional models in Chapter 2, 2) FE models in Chapter 3, and 3) FCM models in this chapter. The relationship between conventional and FE models has been discussed in Section 3.7 page 61 and can be summarised in Figure 4.1. Conventional models (HMMs, GMMs and VQ) are considered as a special group with a value of the degree of fuzzy entropy n = 1 within the infinite family of FE model groups with values of n in the range (0, ∞). 0 • ✻ FE Models Hard Models 1 • ✻ FE Models n ✲ Conventional Models Figure 4.1: The relationship between FE model groups versus the degree of fuzzy entropy n Such a relationship is not available for conventional models and FCM models. This means that no suitable value of the degree of fuzziness m in (1, ∞) can be set for FCM models to reduce to the conventional models. However, a similar relationship can be established between the group well-known in pattern recognition with m = 2 4.5 Comparison Between FCM and FE Models 1 • ✻ FCM Models Hard Models 2 • ✻ FCM Models 72 m ✲ Typical Models Figure 4.2: The relationship between FCM model groups versus the degree of fuzziness m and other FCM model groups with m > 1 and m 6= 2. Figure 4.2 shows the similar relationship between FCM models to that in Figure 4.1. Therefore we discuss this relationship before comparing FCM and FE models. Without loss of generality, let us consider the following problem for GMMs. Given a vector xt , we assume that xt is closest to the cluster i among K clusters. Consider the membership uit of the FCM-GMM. From (4.15), it can be rewritten as uit = 1 K ∑ k=1 ( ) d2it 1/(m−1) d2kt = 1+ K ∑ k=1 k6=i ( 1 ) d2it 1/(m−1) d2kt (4.19) Writing u∗it for the membership if m = 2, we obtain u∗it /[ =1 1+ ] K ( 2 ) ∑ dit k=1 k6=i d2kt (4.20) 1 d2 < 1. From the above assumption for xt , we obtain 2it < 1 m−1 dkt ∀k = 6 i. Since the function ax decreases for a < 1, we can show that As m > 2, we have uit < u∗it (4.21) It can be easily shown for the remaining cases. In general, we obtain • d2it ≤ d2kt ∀k 6= i : xt is closest to cluster i uit • d2it ≥ d2kt  > u∗it    =    < u∗it u∗it 1<m<2 m=2 (4.22) m>2 ∀k 6= i : xt is furthest from cluster i uit  < u∗it       1<m<2 = u∗it m=2 u∗it m>2 > (4.23) 4.5 Comparison Between FCM and FE Models 73 Comparing with the typical FCM model with m = 2, if we wish to decrease the influence of vectors far from the cluster center, we reduce the value of m to less than 2. Inversely, values of m > 2 increase the influence of those vectors. The membership function with different values of m versus the distance between vector xt and cluster i is demonstrated in Figure 4.3, which is quite similar to that demonstrated in Figure 3.3 page 64 for FE models. The limit of m = 1 represents hard models, which will be presented in Chapter 5. Figure 4.3: The FCM membership function uit with different values of the degree of fuzziness m versus the distance dit between vector xt and cluster i It is known that model parameters are estimated by the memberships in the Mstep of reestimation algorithms, therefore selecting a suitable value of the degree of fuzziness m is necessary to obtain optimum model parameter estimates. This can be a solution for the insufficient training data problem. In this case, the quality of the model parameter estimates trained by conventional methods cannot be guaranteed and hence FCM models with the adjustable parameter m can be employed to find better estimates. Although FE models also have the advantage of the adjustable parameter n, the membership functions of FE and FCM models are different because of the employed different optimisation criteria. We compare the expressions of FE 4.5 Comparison Between FCM and FE Models 74 and FCM memberships taken from (3.42) and (4.15) as follows −d2it /n FE: uit = e K ∑ FCM: e −d2kt /n k=1 uit = ( 1 d2it )1/(m−1) ) K ( ∑ 1 1/(m−1) k=1 (4.24) d2kt For simplicity, we consider the typical cases where n = 1 and m = 2. It can be seen that the FE membership employs the function f (x) = e−x whereas the FCM membership employs the function f (x) = 1/x, where x = d2it . Figure 4.4 shows curves representing these functions. Figure 4.4: Curves representing the functions used in the FE and FCM memberships, where x = d2it , m = 2 and n = 1 We can see that the change of the FCM membership is more rapid than the one of the FE membership for short distances (0 < x < 1). Applied to cluster analysis, this means that the FCM memberships of feature vectors close to the cluster center can have very different values even if these vectors are close together, whereas their FE memberships are not very different. On the other hand, in the M-step, FE and FCM model parameters are estimated by uit and um it , respectively. For long distances, i.e. for feature vectors very far from the cluster center, the difference between uit for the FE membership and um it for the FCM membership is not significant. Indeed, although Figure 4.4 shows the FCM membership value is much greater than the FE membership 4.6 Summary and Conclusion 75 value for long distances, for estimating the model parameters in the M-step, the FCM membership value is reduced due to the weighting exponent m (um it < uit as m > 1 and 0 ≥ uit ≥ 1). 4.6 Summary and Conclusion Fuzzy C-means models have been proposed in this chapter. A parameter is introduced as the degree of fuzziness m > 1. With m → 1, we obtain hard models, which will be presented in Chapter 5. With m → ∞, we obtain maximally fuzzy models with only a single state or a single cluster. Typical models with m = 2 are well known in pattern recognition. The main differences between FE models and FCM models are the optimisation criteria and the reestimation of the fuzzy membership functions. The advantage of the FE and FCM models is that they have adjustable parameters m and n, which may be useful for finding optimum models for solving the insufficient training data problem. Relationships between the FCM models are summarised in Figure 4.5 below. Experimental results for these models will be reported in Chapter 6. FCM Models ❄ FCM Discrete Models ❄ FCM DHMM ❄ NC-FCM DHMM ❄ FCM Continuous Models ❄ NC-FCM CHMM ❄ NC-FCM CHMM ❄ FCM GMM ❄ NC-FCM GMM ❄ FCM VQ ❄ NC-FCM VQ Figure 4.5: Fuzzy C-means models for speech and speaker recognition Chapter 5 Hard Models As the degrees of fuzzy entropy and fuzziness tend to their minimum values, both the fuzzy entropy and the fuzzy C-means models approach the hard model. For fuzzy and conventional models, the model structures are not fundamentally different, except for the use of fuzzy optimisation criteria and the fuzzy membership. However, a different model structure applies to the hard models because of the binary (zero-one) membership function. For example, the hard HMM employs only the best path for estimating model parameters and for recognition, and the hard GMM employs only the most likely Gaussian distribution among the mixture of Gaussians to represent a feature vector. Although the smoothed (fuzzy) membership is more successful than the binary (hard) membership in describing the model structures, hard models can also be used because they are simple yet effective. The simplest hard model is the VQ model, which is effective for speaker recognition. This chapter proposes new hard models—hard HMMs and hard GMMs. These models emerge as interesting consequences of investigating fuzzy approaches. Sections 5.2 and 5.3 present hard models for HMMs and GMMs, respectively. The last section gives a summary and a conclusion for hard models. 5.1 From Fuzzy To Hard Models Fuzzy and hard models have a mutual relation: fuzzy models are obtained by fuzzifying hard models, or inversely we can derive hard models by defuzzification of fuzzy models. We consider this relation for the simplest models: VQ (hard C-means) and 76 5.1 From Fuzzy To Hard Models 77 fuzzy VQ (FE-VQ and FCM-VQ). As presented in Chapters 3 and 4, the fuzzification of the hard objective function (the sum-of-squared-error function) to the fuzzy objective function is achieved by adding a fuzzy entropy term to the hard objective function for FE-VQ or applying a weighting exponent m > 1 to each uit for FCM-VQ. Figure 5.1 shows this fuzzification. Inversely, the defuzzification of FE-VQ by letting n → 0 or FCM-VQ by letting m → 1 results in the same VQ. To obtain a simpler calculation, a more convenient way is to use the minimum distance rule mentioned in Equation (2.41) on page 37 to implement the defuzzification. Figure 5.2 shows the defuzzification methods. FE J(U, λ; X) = T C ∑ ∑ ✲ Hn (U, λ; X) = T C ∑ ∑ uit d2it + n T C ∑ ∑ uit log uit i=1 t=1 i=1 t=1 uit d2it ✲ i=1 t=1 FCM ✲ Jm (U, λ; X) = T C ∑ ∑ 2 um it dit i=1 t=1 Hard: uit = 0 or 1 Fuzzy: uit ∈ [0, 1] Figure 5.1: From hard VQ to fuzzy VQ: an additional fuzzy entropy term for fuzzy entropy VQ, and a weighting exponent m > 1 on each uit for fuzzy C-means VQ. The mutual relation between fuzzy and hard VQ models has already been reported in the literature. In this thesis, we show that this relation can be applied to more general models, such as the GMM and the HMM. Indeed, the fuzzy HMMs and FE-VQ: n→0 FCM-VQ: m→1 { 1 if dit < dkt ∀k Minimum distance rule: uit = 0 otherwise ✲ VQ Figure 5.2: From fuzzy VQ to (hard) VQ: n → 0 for FE-VQ or m → 1 for FCM-VQ or using the minimum distance rule to compute uit directly. 5.2 Hard Hidden Markov Models 78 the fuzzy GMMs presented in Chapters 3 and 4 are generalised from fuzzy VQ by using the distance-probability relation d2XY = − log P (X, Y ) that relates clustering to modelling. In this chapter, the distance-probability relation is used to derive hard HMMs and hard GMMs from fuzzy HMMs and fuzzy GMMs, respectively. Applying this relation to the minimum distance rule, we obtain a probabilistic rule to determine the hard membership value uY (X) as follows uY (X) = { 1 if P (X, Y ) > P (X, Z) ∀Z 6= Y 0 otherwise (5.1) where X denotes observable data and Y , Z denote unobservable data. This rule can be called the maximum joint probability rule. Figure 5.3 shows the mutual relations between fuzzy and hard models used in this thesis. Hard HMMs Hard GMMs ✛ Minimum distance rule Fuzzy HMMs Fuzzy GMMs ✻2 dXY = − log P (X, Y ) VQ ✛ Fuzzification ✲ Fuzzy VQ Defuzzification Figure 5.3: Mutual relations between fuzzy and hard models 5.2 Hard Hidden Markov Models As discussed in Section 3.4.1, the membership uijt denotes the belonging of an observation sequence O to fuzzy state sequences being in state i at time t and state j at time t + 1. There are N possible fuzzy states at each time t and they are concatenated into fuzzy state sequences. Figure 5.4 illustrates possible fuzzy state sequences starting from state 1 at time t = 1 and ending at state 3 at time t = T in a 3-state left-to-right HMM with ∆i = 1 (the Bakis HMM) and also in the corresponding fuzzy HMM. As discussed above, the hard membership function takes only two values 0 and 1. This means that in hard HMMs, each observation ot belongs only to the most likely state at each time t. In other words, the sequence O is in the most likely single state 5.2 Hard Hidden Markov Models 79 Figure 5.4: Possible state sequences in a 3-state Bakis HMM and a 3-state fuzzy Bakis HMM sequence. This is quite similar to the state sequence in HMMs using the Viterbi algorithm, where if we round maximum probability values to 1 and others to 0, we obtain hard HMMs. Therefore, conventional HMMs using the Viterbi algorithm can be regarded as “pretty” hard HMMs. This remark gives an alternative approach to hard HMMs from conventional HMMs based on the Viterbi algorithm. Figure 5.5 illustrates a possible single state sequence in the hard HMM [Tran et al. 2000a]. Figure 5.5: A possible single state sequence in a 3-state hard HMM 5.2 Hard Hidden Markov Models 5.2.1 80 Hard Discrete HMM This section shows the reestimation algorithm for the hard discrete HMM (H-DHMM), where the maximum joint probability rule in (5.1) is formulated as the “hard” E-step. In previous chapters, we have used the fuzzy membership uijt for FE-DHMMs and FCM-DHMMs. However, for reducing calculation, the hard membership uijt can be computed by the product of the memberships uit and uj(t+1) . Indeed, uijt = 1 means that the observation sequence O is in state i at time t (uit = 1) and state j at time t + 1 (uj(t+1) = 1), thus uijt = uit uj(t+1) = 1. It is quite similar for the remaining cases. Therefore we use the following distance for the membership uit d2it = − log P (O, st = i|λ) (5.2) From (2.26) page 29, we can show that P (O, st = i|λ) = αt (i)βt (i) (5.3) where αt (i) and βt (i) are computed in (2.24) and (2.25) page 28. Using the maximum joint probability rule in (5.1), the reestimation algorithm for H-DHMM is as follows Hard E-Step: uit = { 1 if αt (i)βt (i) > αt (k)βt (k) ∀k 6= i 0 otherwise (5.4) Ties are broken randomly. M-Step: π j = uj1 , aij = T∑ −1 uit uj(t+1) t=1 T∑ −1 , uit t=1 5.2.2 bj (k) = T ∑ ujt t=1 s.t. ot =vk T ∑ (5.5) ujt t=1 Hard Continuous HMM The distance in (3.30) page 57 defined for the FE-CHMM and the FCM-CHMM is used for the hard continuous HMM (H-CHMM). Using the maximum joint probability rule in (5.1), the reestimation algorithm for H-CHMM is as follows 5.3 Hard Gaussian Mixture Models 81 Hard E-Step: ujkt = { 1 if P (X, st = j, mt = k|λ) > P (X, st = h, mt = l|λ) ∀(h, l) 6= (j, k) (5.6) 0 otherwise where ties are broken randomly and P (X, st = j, mt = k|λ) is computed in (3.30). M-Step: wjk = T ∑ ujkt t=1 K T ∑ ∑ µjk = , ujkt t=1 k=1 5.3 T ∑ ujkt xt t=1 T ∑ , Σjk = T ∑ ujkt (xt − µjk )(xt − µjk )′ t=1 ujkt t=1 T ∑ ujkt t=1 (5.7) Hard Gaussian Mixture Models In fuzzy Gaussian mixture models, the membership uit denotes the belonging of vector xt to cluster i represented by a Gaussian distribution. Since 0 ≤ uit ≤ 1, vector xt is regarded as belonging to a mixture of Gaussian distributions. In hard Gaussian mixture models (H-GMMs), the membership uit takes only two values 0 and 1. This means that vector xt belongs to only one Gaussian distribution, or in other words, Gaussian distributions are not mixed, they are separated by boundaries as in VQ. Figure 5.6 illustrates a mixture of three Gaussian distributions in fuzzy GMMs and Figure 5.7 illustrates three separate Gaussian distributions in the H-GMM. With this interpretation, H-GMMs should be termed hard C-Gaussians models (from the terminology “hard C-means”) or K-Gaussians models (from the terminology “Kmeans”) [Tran et al. 2000b]. The distance in (3.40) page 59 defined for the FE-GMM is used for the H-GMM. From the maximum joint probability rule in (5.1), the reestimation algorithm for the H-GMM is formulated as follows Hard E-Step: uit = { 1 if P (xt , i|λ) > P (xt , k|λ) ∀i 6= k 0 otherwise (5.8) where ties are broken randomly and P (xt , i|λ) = wi N (xt , µi , Σi ) (5.9) 5.3 Hard Gaussian Mixture Models 82 Figure 5.6: A mixture of three Gaussian distributions in the GMM or the fuzzy GMM. Figure 5.7: A set of three non overlapping Gaussian distributions in the hard GMM. 5.4 Summary and Conclusion 83 wi and N (xt , µi , Σi ) are defined in Section 2.3.5, page 29. M-Step: wi = T ∑ uit t=1 K T ∑∑ , ukt t=1 k=1 µi = T ∑ uit xt t=1 T ∑ , Σi = T ∑ uit (xt − µi )(xt − µi )′ t=1 uit t=1 T ∑ (5.10) uit t=1 It can be seen that the above reeestimation algorithm is the most generalised algorithm of the VQ algorithms mentioned in Section 2.3.6 page 31. Indeed, let Ti be the number of vectors in cluster i, from (5.8) we obtain T ∑ uit = Ti (5.11) t=1 Replacing (5.11) into (5.10) gives an alternative form for the M-step as follows M-Step: wi = Ti , T µi = 1 ∑ xt , Ti xt ∈Ci Σi = 1 ∑ (xt − µi )(xt − µi )′ Ti xt ∈Ci (5.12) Since covariance matrices are also reestimated, this algorithm is more generalised than the ECVQ algorithm reviewed in (2.36), page 32. 5.4 Summary and Conclusion Hard models have been presented in this chapter. There are three ways to obtain hard models: 1) Using the fuzzy entropy algorithm with n ≈ 0; 2) Using the fuzzy C-means algorithm with m ≈ 1; and 3) Using the nearest prototype rule. The third way has simpler calculations. We have proposed hard HMMs and hard GMMs, where only the best state sequence is employed in hard HMMs and a non overlapping Gaussian distribution set is employed in hard GMMs. Conventional HMMs using the Viterbi algorithm can be regarded as “pretty” hard HMMs. Hard GMMs are regarded as the most generalised VQ models from which VQ, extended VQ and entropy-constrained VQ can be derived. Conventional HMMs using the Viterbi algorithm and VQ models are widely used, which means that hard models play an important role in speech and speaker recognition. Relationships between the hard models are summarised in 5.4 Summary and Conclusion 84 Figure 5.8 and experimental results for hard models will be reported in Chapter 6. Hard Models ❄ ❄ Hard DHMM λ = {π, A, B} ❄ Hard CHMM ❄ Hard GMM ❄ ECVQ λ = {µ, w} λ = {π, A, B = {w, µ, Σ}} λ = {w, µ, Σ} ❄ Extended VQ λ = {µ, Σ} ❄ VQ λ = {µ} Figure 5.8: Relationships between hard models Chapter 6 A Fuzzy Approach to Speaker Verification Fuzzy approaches have been in Chapters 3 and 4 to train speech and speaker models. This chapter proposes an alternative fuzzy approach to speaker verification. For an input utterance and a claimed identity, most of the current methods compute a claimed speaker’s score, which is the ratio of the claimed speaker’s and the impostors’ likelihood functions, and compare this score with a given threshold to accept or reject this speaker. Considering the speaker verification problem based on fuzzy set theory, the claimed speaker’s score is viewed as the fuzzy membership function of the input utterance in the claimed speaker’s fuzzy set of utterances. Fuzzy entropy and fuzzy C-means membership functions are proposed as fuzzy membership scores, which are the ratios of f unctions of the claimed speaker’s and impostors’ likelihood functions. So a likelihood transformation is considered to relate current likelihood and fuzzy membership scores. Based on this consideration, more fuzzy scores are proposed to compare with current methods. Furthermore, the noise clustering method supplies a very effective modification to all methods, which can overcome some of the problems of ratio-type scores and greatly reduce the false acceptance rate. Some basic concepts of speaker verification relevant to this chapter have been reviewed in Section 2.2.2 page 18. A more detailed analysis of a speaker verification system and of current normalisation methods are provided in this chapter. 85 6.1 A Speaker Verification System 6.1 86 A Speaker Verification System Let λ0 be the claimed speaker model and λ be a model representing all other possible speakers, i.e. impostors 1 . For a given input utterance X and a claimed identity, the choice is between the hypothesis H0 : X is from the claimed speaker λ0 , and the alternative hypothesis H1 : X is from the impostors λ. A claimed speaker’s score S(X) is computed to reject or accept the speaker claim. Depending on the meaning of the score, we can distinguish between similarity scores L(X) and dissimilarity scores D(X) between X and λ0 . Likelihood scores are included in L(X) and VQ distortion scores are included in D(X). These scores satisfy the following rules L(X) { > θL accept ≤ θL reject D(X) { < θD accept and where θL and θD (6.1) (6.2) ≥ θD reject are the decision thresholds. Figure 6.1 presents a typical speaker verification system. Claimed Identity Input Speech ✲ Speech Processing ✗ Speaker ✲ Models ✖ ✲ ✔✗ Threshold ✕✖ ❄ Score Determination ✲ ❄ Hypothesis Testing ✔ ✕ Accept ✲ or Reject Figure 6.1: A Typical Speaker Verification System Assuming that speaker models are given, this chapter solves the problem of finding the effective scores of the claimed speaker such that the equal error rate (EER) mentioned in Section 2.2.2 page 18 is minimised. We define equivalent scores as scores giving the same equal error rate (EER) even though they possibly use different thresholds. For example, Sa (X) = P (X|λ0 ) and Sb (X) = log P (X|λ0 ) are equivalent scores, but use thresholds θ and log θ, respectively . 1 In this context, we use the term imposter for any speaker other than the claimed speaker and without any implication of fraudulent intent or active voice manipulation 6.2 Current Normalisation Methods 6.2 87 Current Normalisation Methods The simplest method of scoring is to use the absolute likelihood score (unnormalised score) of an utterance. In the log domain, that is L0 (X) = log P (X|λ0 ) (6.3) This score is strongly influenced by variations in the test utterance such as the speaker’s vocal characteristics, the linguistic content and the speech quality. It is very difficult to set a common decision threshold to be used over different tests. This drawback is overcome to some extent by using normalisation. According to the Bayes decision rule for minimum risk, a likelihood ratio L1 (X) = P (X|λ0 ) P (X|λ) (6.4) is used. This ratio produces a relative score which is less volatile to non-speaker utterance variations [Reynolds 1995a]. In the log domain, (6.4) is equivalent to the following normalisation technique, proposed by Higgins et al., 1991 L1 (X) = log P (X|λ0 ) − log P (X|λ) (6.5) The term log P (X|λ) in (6.5) is called the normalisation term and requires calculation of all impostors’ likelihood functions. An appoximation of this method is to use only the closest impostor model for calculating the normalisation term [Liu et al. 1996] L2 (X) = log P (X|λ0 ) − max log P (X|λ) λ6=λ0 (6.6) However when the size of the population increases, both of these normalisation methods L1 (X) and L2 (X) are unrealistic since all impostors’ likelihood functions must be calculated for determining the value of the normalisation term. Therefore a subset of the impostor models is used. This subset consists of B “background” speaker models λi , i = 1, . . . , B and is representative of the population close to the claimed speaker, i.e. the “cohort speaker” set [Rosenberg et al. 1992]. Depending on the approximation of P (X|λ) in (6.4) by the likelihood functions of the background model set P (X|λi ), i = 1, . . . , B, we obtain different normalisation methods. An approximation [Reynolds 1995a] has been applied that is the arithmetic mean (average) of 6.2 Current Normalisation Methods 88 the likelihood functions of B background speaker models. The corresponding score for this approximation is L3 (X) = log P (X|λ0 ) − log { } B 1 ∑ P (X|λi ) B i=1 (6.7) If the claimed speaker’s likelihood function is also included in the above arithmetic mean, we obtain the normalisation method based on the a posteriori probability [Matsui and Furui 1993] L4 (X) = log P (X|λ0 ) − log B ∑ P (X|λi ) (6.8) i=0 Note that i = 0 in (6.8) denotes the claimed speaker model and the constant term 1/B is accounted for in the decision threshold. If the geometric mean is used instead of the arithmetic mean to approximate P (X|λ), we obtain the normalisation method [Liu et al. 1996] as follows L5 (X) = log P (X|λ0 ) − B 1 ∑ log P (X|λi ) B i=1 (6.9) Normalisation methods can also be applied to the likelihood function of each vector xt , t = 1, . . . , T in X, and such methods are called frame level normalisation methods. Such a method has been proposed as follows [Markov and Nakagawa 1998b] L6 (X) = T [ ∑ log P (xt |λ0 ) − log t=1 B ∑ ] P (xt |λi ) i=1 (6.10) For VQ-based speaker verification systems, the following score is widely used D1 (X) = D(X, λ0 ) − B 1 ∑ D(X, λi ) B i=1 (6.11) where D(X, λi ) = T ∑ (i) dkt (6.12) t=1 (i) where dkt is the VQ distance between vector xt ∈ X and the nearest codevector k in the codebook λi . It can be seen that this score is equivalent to L5 (X) if we replace D(X, λi ) in (6.11) by [− log P (X|λi )], i = 0, 1, . . . , B. 6.3 Proposed Normalisation Methods 6.3 89 Proposed Normalisation Methods Consider the speaker verification problem in fuzzy set theory. To accept or reject the claimed speaker, the task is to make a decision whether the input utterance X is either from the claimed speaker λ0 or from the set of impostors λ, based on comparing the score for X and a decision threshold θ. Thus the space of input utterances can be considered as consisting of two fuzzy sets: C for the claimed speaker and I for impostors. Degrees of belonging of X to these fuzzy sets are denoted by the fuzzy membership functions of X, where the fuzzy membership of X in C can be regarded as a claimed speaker’s score satisfying the rule in (6.1). Making a (hard) decision is thus a defuzzification process where X is completely in C if the fuzzy membership of X in C is sufficiently high, i.e greater than the threshold θ. In theory, there are many ways to define the fuzzy membership function, therefore it can be said that this fuzzy approach proposes more general scores than the current likelihood ratio scores for speaker verification. These are termed fuzzy membership scores, which can denote the belonging of X to the claimed speaker. Their values need not be scaled into the interval [0, 1] because of the above-mentioned equivalence of scores. Based on this discussion, all of the above-mentioned likelihood-based scores can also be viewed as fuzzy membership scores. The next task is to find effective fuzzy membership scores. To do this, we need to know what the shortcoming of likelihood ratio scores is. The main problem comes from the relative nature of a ratio. Indeed, assuming L3 (X) in (6.7) is used, consider the two equal likelihood ratios in the following example L3 (X1 ) = 0.0000007 0.07 = L3 (X2 ) = 0.03 0.0000003 (6.13) where both X1 and X2 are accepted if assuming the given threshold is 2. The first ratio can lead to a correct decision that the input utterance X1 is from the claimed speaker (true acceptance). However it is improbable that X2 is from the claimed speaker or from any of background speakers since both likelihood values in the second ratio are very low. X2 is probably from an impostor and thus a false acceptance can occur on the basis of the likelihood ratio. This is a similar problem to that addressed by Chen et al. [1994]. Two fuzzy membership scores proposed in previous chapters can overcome this 6.3 Proposed Normalisation Methods 90 Utterance P (X|λ0 ) P (X|λ1 ) P (X|λ2 ) P (X|λ3 ) X1c 3.02 × 10−4 2.23 × 10−5 4.16 × 10−5 1.52 × 10−6 X2c 1.11 × 10−3 1.4 × 10−4 1.42 × 10−4 2.03 × 10−5 X3i 7.91 × 10−6 1.17 × 10−7 1.24 × 10−6 7.23 × 10−8 X4i 1.2 × 10−4 7.16 × 10−6 6.64 × 10−8 6.51 × 10−8 Table 6.1: The likelihood values for 4 input utterances X1 − X4 against the claimed speaker λ0 and 3 impostors λ1 − λ3 , where X1c , X2c are from the claimed speaker and X3i , X4i are from impostors problem. The fuzzy entropy (FE) score L7 (X) and the fuzzy C-means (FCM) score L8 (X) are rewritten as follows L7 (X) = [P (X|λ0 )]1/n B ∑ [P (X|λi )] (6.14) 1/n i=0 L8 (X) = [− log P (X|λ0 )]1/(1−m) B ∑ (6.15) [− log P (X|λi )]1/(1−m) i=0 where m > 1, n > 0, and the impostors’ fuzzy set I is approximately represented by B background speakers’ fuzzy subsets. Note that in the log domain, as n = 1, the score L7 (X) after taking the logarithm reduces to the score based on the a posteriori probability L4 (X) in (6.8). Experimental results in the next chapter show low EERs for these effective scores. To illustrate the effectiveness of these scores, for simplicity, we compare L3 (X) with L8 (X) in a numerical example. Table 6.1 presents the likelihood values for 4 input utterances X1 − X4 against the claimed speaker λ0 and 3 impostors λ1 − λ3 , where X1c , X2c are from the claimed speaker and X3i , X4i are from impostors (these are real values from experiments on the TI46 database). Given a score L(X), the EER = 0 in the case that all the scores for X1c and X2c are greater than all those for X3i and X4i . Table 6.2 shows scores in (6.7) and (6.15) computed using these likelihood values where m = 2 is applied to (6.15). It can be seen that with the score L3 (X), we always have the EER 6= 0 since scores for X3i and X4i are higher than those for X1c and X2c . 6.3 Proposed Normalisation Methods X1c Score 91 X2c X3i X4i L3 (X) 2.628 2.399 2.810 3.899 L8 (X) 0.316 0.316 0.302 0.350 Table 6.2: Scores of 4 utterances using L3 (X) and L8 (X) However using L8 (X), the EER is reduced since the score for X3i is lower than those for X1c and X2c . A more robust method proposed in this chapter is to use fuzzy membership scores based on the noise clustering (NC) method. Indeed, this fuzzy approach can reduce the false acceptance error by forcing the membership value of the input utterance X to become as small as possible if X is really from impostors, not from the claimed speaker or background speakers. This fuzzy approach is simple but very effective, just adding a suitable constant value ɛ > 0 (similar to the constant distance δ in the NC method) to denominators of ratios, i.e. normalisation terms as follows Change ∑ ... in the normalisation term to ∑ ... + ɛ (6.16) Note that the NC method can be applied not only fuzzy membership scores but also likelihood ratio scores. For example, NC-based versions of L3 (X), L7 (X) and L8 (X) are as follows L3nc (X) = log P (X|λ0 ) − log L7nc (X) = [P (X|λ0 )]1/n B ∑ 1/n [P (X|λi )] [ ] B 1 ∑ P (X|λi ) + ɛ3 B i=1 (6.17) (6.18) + ɛ7 i=0 L8nc (X) = B ∑ [− log P (X|λ0 )]1/(1−m) (6.19) [− log P (X|λi )]1/(1−m) + ɛ8 i=0 where the index “nc” means “noise clustering”. For illustration, applying NC-based 6.4 The Likelihood Transformation X1c Score 92 X2c X3i X4i L3nc (X) 2.251 1.710 -2.547 0.158 L8nc (X) 0.139 0.152 0.109 0.136 Table 6.3: Scores of 4 utterances using L3nc (X) and L8nc (X) scores for the first example in (6.13) with ɛ3 = 0.01 gives L3nc (X1c ) = 0.07 = 1.75 0.03 + 0.01 > L3nc (X2c ) = 0.0000007 = 0.000069 0.0000003 + 0.01 (6.20) For the second example in Table 6.3, with ɛ3 = 10−4 and ɛ8 = 0.5, NC-based scores are shown in Table 6.3. We can see that with thresholds θ3 = 1.0 and θ8 = 0.137, the EER is 0 for both of these NC-based scores. We have illustrated the effectiveness of fuzzy membership scores by way of some numerical examples. To be more rigorous, this approach should be considered in theoretical terms. Let us take into account expressions of likelihood ratio scores and fuzzy membership scores. The former is the ratio of likelihood functions whereas the latter is the ratio of f unctions of likelihood functions. Indeed, denoting P as the likelihood function, we can see that the FE score employs the function f (P ) = P 1/n and the FCM score employs f (P ) = ( − log P )1/(m−1) . In other words, likelihood ratio scores are transformed to FE and FCM scores by using these functions. Such a transformation is considered in the next section. 6.4 The Likelihood Transformation Consider a transformation T : P → T (P ), where P is the likelihood function and T (P ) is a certain continuous function of P . For example, T [P (X|λ0 )] = log P (X|λ0 ). Applying this transformation to the likelihood ratio score L1 (X) gives an alternative score S(X) S(X) = T [P (X|λ0 )] T [P (X|λ)] (6.21) 6.4 The Likelihood Transformation 93 The difference between S(X) and L1 (X) is S(X) − L1 (X) = P (X|λ0 ) T [P (X|λ0 )] T [P (X|λ)] − T [P (X|λ)] P (X|λ0 ) P (X|λ) [ ] (6.22) Assuming that in (6.22) T (P )/P is an increasing function, i.e. if P1 /P2 > 1 or P1 > P2 then T (P1 )/P1 > T (P2 )/P2 , and T (P ) is a negative function for 0 ≤ P ≤ 1. The two following cases can occur • P (X|λ0 ) > 1: the expression on the right hand side of (6.22) is negative, thus P (X|λ) P (X|λ0 ) T [P (X|λ0 )] < T [P (X|λ)] P (X|λ) • ⇔ S(X) < L1 (X) (6.23) P (X|λ0 ) ≤ 1: the expression on the right hand side of (6.22) is non negative, P (X|λ) thus P (X|λ0 ) T [P (X|λ0 )] ≥ T [P (X|λ)] P (X|λ) ⇔ S(X) ≥ L1 (X) (6.24) In the first case, the transformation moves the likelihood ratios greater than 1 to the transformed likelihood ratios less than 1 and vice versa in the second case, the ratios less than 1 are moved to the transformed ratios greater than 1. Figure 6.2 illustrates this transformation. This transformation is a nonlinear mapping since distances be- Figure 6.2: The transformation T where T (P )/P increases and T (P ) is non positive for 0 ≤ P ≤ 1: values of 4 ratios at A, B, C, and D are moved to those at A’, B’, C’, and D’ tween different ratios and their transformations are different. For example, distances 6.4 The Likelihood Transformation 94 AA’, BB’, CC’ and DD’ in Figure 6.4 are different. If T (P ) = log P , a numerical example for these ratios is as follows 0.011 ≃ 1.222 0.009 0.0009 ≃ 1.125 B: 0.0008 0.01 ≃ 1.111 C: 0.009 0.044 D: ≃ 0.863 0.051 −→ A: −→ −→ −→ log 0.011 ≃ 0.957 log 0.009 log 0.0009 B′ : ≃ 0.9834 log 0.0008 log 0.01 C′ : ≃ 0.978 log 0.009 log 0.044 D′ : ≃ 1.049 log 0.051 A′ : (6.25) Comparisons of the order of the 4 points A, B, C, D with the order of A’, B’, C’ D’ and of likelihood values at point B (0.0009 and 0.0008) with other values show that this transformation T can “recognise” ratios of small likelihood values and thus leads to a reduction of the false acceptance error rate (and the EER also) as shown in the previous section’s examples. Figure 6.3 shows histograms of a female speaker labelled f7 in/the TI46 corpus]using 16-mixture GMMs, where the score L3 (X) = [ log P (X|λ0 ) 1 B ∑B i=1 P (X|λi ) in Figure 6.3a is transformed by T (P ) = log P to [ the score D3 (X) = log log P (X|λ0 ) / log { B1 ∑B i=1 ] P (X|λi )} in Figure 6.3b. Figure 6.3: Histograms of speaker f7 in the TI46 using 16-mixture GMMs. The EER is 6.67% for Fig. 6.3a and is 5.90% for Fig. 6.3b. Considerations where T (P )/P is a decreasing function or T (P ) is a positive function can be similarly considered. In general, there may not exist a function T (P ) for 6.4 The Likelihood Transformation 95 the transformation T such that the EER is 0. However, a function T (P ) for reducing the EER may exist. For convenience in calculating products of probabilities, the function T (P ) should be related to the logarithm function. This is also convenient for applying these methods to VQ-based speaker verification since the distance in VQ can be defined as the negative logarithm of the corresponding likelihood function. Based on this transformation and to compare with current likelihood ratio scores, we also propose more fuzzy scores for the decision rule in (6.2) as follows D2 (X) = log P (X|λ0 ) max log P (X|λ) (6.26) λ6=λ0 log P (X|λ0 ) } B 1 ∑ P (X|λi ) log B i=1 log P (X|λ0 ) D4 (X) = } { ∑ 1 B P (X|λi ) log B i=0 log P (X|λ0 ) D5 (X) = B 1 ∑ log P (X|λi ) B i=1 arctan[log P (X|λ0 )] D6 (X) = B 1 ∑ arctan[log P (X|λi )] B i=1 D3 (X) = { (6.27) (6.28) (6.29) (6.30) where T [P (X|λ)] of the impostors is approximated by the transformed likelihood functions of background speakers D2 (X), . . . , D5 (X) in the same manner as was applied to L2 (X), . . . , L5 (X). Note that the factor 1/B in D4 (X) is not accounted for in the decision threshold as for L4 (X) in (6.8). The effectiveness of the score in (6.30) using the arctan function will be shown in the next chapter. It should again be noted that the NC-based versions of these scores are derived following the method in (6.16). To apply these scores to VQ-based speaker verification systems, likelihood functions should be changed to VQ distortions as shown for the score D1 (X) in (6.11) and (6.12). 6.5 Summary and Conclusion 6.5 96 Summary and Conclusion A fuzzy approach to speaker verification has been proposed in this chapter. Using fuzzy set theory, fuzzy membership scores are proposed as scores more general than current likelihood scores. The likelihood transformation and 7 new scores have been proposed. This fuzzy approach also leads to a noise clustering-based version for all scores, which improves speaker verification performance markedly. Using the arctan function in computing the score illustrates a theoretical extension for normalisation methods, where not only the logarithm function but also other functions can be used. Chapter 7 Evaluation Experiments and Results This chapter presents experiments performed to evaluate proposed models for speech and speaker recognition as well as proposed normalisation methods for speaker verification. Three speech data used in evaluation experiments are the TI46, the ANDOSL and the YOHO corpora. Isolated word recognition experiments were performed on the word sets— E set, 10-digit set, 10-command set and 46-word set—taken from the TI46 corpus using conventional, fuzzy and hard hidden Markov models. Speaker identification and verification experiments were performed on 16 speakers of TI46, 108 speakers of ANDOSL and 138 speakers of YOHO using conventional, fuzzy and hard Gaussian mixture models as well as vector quantisation. Experiments demonstrate that fuzzy models and their noise clustering versions outperform conventional models. Hard hidden Markov models were also obtained good results. 7.1 7.1.1 Database Description The TI46 database The TI46 corpus was designed and collected at Texas Instruments (TI). The speech was produced by 16 speakers, 8 females and 8 males, labelled f1-f8 and m1-m8 respectively, consisting of two vocabularies—TI-20 and TI-alphabet. The TI-20 vocabulary contains the ten digits from 0 to 9 and ten command words: enter, erase, go, help, 97 7.1 Database Description 98 no, rubout, repeat, stop, start, and yes. The TI-alphabet vocabulary contains the names of the 26 letters of the alphabet from a to z. For each vocabulary item, each speaker produced 10 tokens in a single training session and another two tokens in each of 8 testing sessions. The words in the TI-20 vocabulary are highly discriminable, with the majority of confusions occurring between go and no. By comparison, the TI-alphabet is a much more difficult vocabulary since it contains several confusable subsets of letters, such as the E-set {b, c, d, e, g, p, t, v, z} and the A-set {a, j, k}. The TI-20 vocabulary is a good choice because it has been used for other tests and therefore can serve as a standard benchmark of the performance [Syrdal et al. 1995]. The corpus was sampled at 12500 samples per second and 12 bits per sample. 7.1.2 The ANDOSL Database The Australian National Database of Spoken Language (ANDOSL) [Millar et al. 1994] corpus comprises carefully balanced material for Australian speakers, both Australianborn and overseas-born migrants. The aim was to represent as many significant speaker groups within the Australian population as possible. Current holdings are divided into those from native speakers of Australian English (born and fully educated in Australia) and those from non-native speakers of Australian English (first generation migrants having a non-English native language). A subset used for speaker recognition experiments in this paper consists of 108 native speakers, divided into 36 speakers of General Australian English, 36 speakers of Broad Australian English, and 36 speakers of Cultivated Australian English comprising 6 speakers of each gender in each of three age ranges (18-30, 31-45 and 46+). So there are total of 18 groups of 6 speakers labelled “ijk ′′ , where i denotes f (female) or m (male), j denotes y (young) or m (medium) or e (elder), and k denotes g (general) or b (broad) or c (cultivated). For example, the group f yg contains 6 female young general Australian English speakers. Each speaker contributed in a single session, 200 phonetically rich sentences. The averaged duration for each sentence is approximately 4 sec. The speech was recorded in an anechoic chamber using a B&K 4155 microphone and a DSC-2230 VU-meter used as a preamplifier and was digitised direct to computer disk using a DSP32C analog-to-digital converter mounted in a PC. All waveforms were sampled at 20 kHz and 16 bits per sample. For the processing as telephone speech, 7.2 Speech Processing 99 all waveforms were converted from 20 kHz to 8 kHz bandwidth. Low and high pass cut-offs were set to 300 Hz and 3400 Hz. 7.1.3 The YOHO Database The YOHO corpus was collected by ITT under a US government contract and was designed for speaker verification systems in office environments with limited vocabulary. There are 138 speakers, 108 males and 30 females. The vocabulary consists of 56 two-digit numbers ranging from 21 to 97 pronounced as “twenty-one”, “ninetyseven”, and spoken continuously in sets of three, for example “36-45-89”, in each utterance. There are four enrolment sessions per speaker, numbered 1 through 4, and each session contains 24 utterances. There are also ten verification sessions, numbered 1 through 10, and each session contains 4 utterances. All waveforms are low-pass filtered at 3.8 kHz and sampled at 8 kHz. 7.2 Speech Processing For the TI46 corpus, the data were processed in 20.48 ms frames (256 samples) at a frame rate of 10 ms. Frames were Hamming windowed and preemphasised with m = 0.9. For each frame, 46 mel-spectral bands of a width of 110 mel and 20 melfrequency cepstral coefficients (MFCC) were determined [Wagner 1996] and a feature vector with a dimension of 20 was generated for individual frames. For the ANDOSL and YOHO corpora, speech processing was performed using HTK V2.0 [Woodland 1997], a toolkit for building HMMs. The data were processed in 32 ms frames at a frame rate of 10 ms. Frames were Hamming windowed and preemphasised with m = 0.97. The basic feature set consisted of 12th-order MFCCs and the normalised short-time energy, augmented by the corresponding delta MFCCs to form a final set of feature vector with a dimension of 26 for individual frames. 7.3 Algorithmic Issues 7.3 100 Algorithmic Issues 7.3.1 Initialisation It was shown in the literature that no significant difference in isolated-word recognition and speaker identification was found by using different initialisation methods [Rabiner et al. 1983, Reynolds 1995b]. So HMMs, GMMs, and VQ are respectively initialised as follows: • HMMs and fuzzy HMMs: Widely-used HMMs for training are left-to-right HMMs as defined in (2.14), i.e. state sequences begin in state 1 and end in state N, therefore the discrete HMM parameter set in our experiments was initialised as follows π1 = 1 πi = 0, 2≤i≤N aij = 0 1 ≤ i ≤ N, j i + 1 aij = 0.5 1 ≤ i ≤ N, i≤j ≤i+1 bj (k) = 1/K 1≤j≤N 1≤k≤K (7.1) Fuzzy membership functions were initialised with essentially random choices. Gaussian parameters in continuous HMMs were initialised as those in GMMs. • GMMs and fuzzy GMMs: mixture weights, mean vectors, covariance matrices and fuzzy membership functions were initialised with essentially random choices. Covariance matrices are diagonal, i.e. [Σk ]ii = σk2 and [Σk ]ij = 0 if i 6= j, where σk2 , 1 ≤ k ≤ K are variances. • VQ and fuzzy VQ: only fuzzy membership functions in fuzzy VQ models were randomly initialised. VQ models were trained by the LBG algorithm—a widely used version of the K-means algorithm [Linde et al. 1980], where starting from the codebook size of 1, a binary split procedure is performed to double the codebook size in end of several iterations. 7.3.2 Constraints on parameters during training • HMMs and fuzzy HMMs: If the B matrix is left completely unconstrained, that is, a finite training sequence may result in bj (k) = 0, then the probability of that 7.3 Algorithmic Issues 101 sequence is equal to 0, and hence a recognition error must occur. This problem was handled by using post-estimation constraints on the bj (k)’s of the form bj (k) ≥ ɛ, where ɛ is a suitably chosen threshold value [Rabiner et al. 1983]. The number of states N = 6 was chosen for discrete HMMs based on an experiment considering the recognition error in different states as shown in Figure 7.3.2 Figure 7.1: Isolated word recognition error (%) versus the number of state N for the digit-set vocabulary, using left-to-right DHMMs, codebook size K = 16, TI46 database • GMMs and fuzzy GMMs: Similarly, a variance limiting constraint was applied to all GMMs using diagonal covariance matrices. This constraint places a min2 imum variance value σmin on elements of all variance vectors in the GMM, that 2 2 is, σi2 = σmin if σi2 ≤ σmin [Reynolds 1995b]. In our experiments, ɛ = 10−5 2 and σmin = 10−2 . The chosen number of mixtures were 16, 32, 64, and 128 to compare VQ models using the LBG algorithm. • VQ and fuzzy VQ: Codebook sizes of 16, 32, 64, and 128 were chosen for all experiments. • Fuzzy parameters: Choosing the appropriate values of the degree of fuzziness m and the degree of fuzzy entropy n was based on considering recognition error for each database and each model. For example, in speaker identification using 7.4 Isolated Word Recognition 102 FCM-VQ models and the TI46 database, an experiment was implemented to consider the identification error via different values of the degree of fuzziness m, where the codebook size was fixed. Results are shown in Figure 7.3.2. Therefore Figure 7.2: Speaker Identification Error (%) versus the degree of fuzziness m using FCM-VQ speaker models, codebook size K = 16, TI46 corpus the value m = 1.2 was chosen for FCM-VQ models using the TI46 database. In general, our experiments showed that the degree of fuzziness m depends on models and weakly on speech databases. Suitable values were m = 1.2 for FCM-VQ, m = 1.05 for FCM-GMMs and m = 1.2 for FCM-HMMs. Similarly, suitable values for the degree of fuzzy entropy n were n = 1 for FEVQ, n = 1.2 for FE-GMMs and n = 2.5 for FE-HMMs. Figure 7.3.2 shows the word recognition error versus the degree of fuzzy entropy n. 7.4 Isolated Word Recognition The first procedure in building an isolated word recognition system is that training a HMM for each word. The speech signals for training HMMs were converted into feature vector sequences as in (2.1) using the LPC analysis mentioned in Section 2.1.3. For training discrete models, these vector sequences were used to generate a VQ codebook by means of the LBG algorithm and were encoded into observation 7.4 Isolated Word Recognition 103 Figure 7.3: Isolated word recognition error (%) versus the degree of fuzzy entropy n for the E-set vocabulary, using 6-state left-to-right FE-DHMMs, codebook size of 16, TI46 corpus sequences by this VQ codebook. Observation sequences were used to train conventional DHMMs, FE-DHMMs, NC-FE-DHMMs, FCM-DHMMs, NC-FCM-DHMMs and H-DHMMs following reestimation algorithms in Section 2.3.4 on page 26, Section 3.4.2 on page 54, Section 4.2.1 on page 67 and Section 5.2.1 on page 80. For training continuous models, vector sequences were directly used as observation sequences to train conventional CHMMs, FE-CHMMs, NC-FE-CHMMs, FCM-CHMMs, NCFCM-CHMMs and H-CHMMs following reestimation algorithms in Section 2.3.4 on page 26, Section 3.4.3 on page 57, Section 4.2.2 on page 68 and Section 5.2.2 on page 80. The second procedure is that recognising unknown words. Assuming that we have a vocabulary of M words to be recognised and that a HMM was trained for each word, the speech signal of unknown word is converted into an observation sequence O using the above codebook. The probabilities P (O|λi ), i = 1, . . . , M were calculated using (2.26) and the recognised word is the word whose probability is highest. The three data sets of the TI46 corpus used for training discrete models are the E set, the 10-digit set, and the 10-command set. The whole 46 words were used to train continuous models. 7.4 Isolated Word Recognition 7.4.1 104 E set Results Table 7.4.1 presents the experimental results for the recognition of the E set using 6state left-to-right HMMs in speaker-dependent mode. In the training phase, 10 training tokens (1 training session x 10 repetitions) of each word were used to train conventional DHMMs, FE-DHMMs, NC-FE-DHMMs, FCM-DHMMs, NC-FCM-DHMMs and H-DHMMs using VQ codebook sizes of 16, 32, 64, and 128. In the recognition phase, isolated word recognition was carried out by testing all 160 test tokens (10 utterances x 8 testing sessions x 2 repetitions) against conventional DHMMs, FE-DHMMs, NC-FE-DHMMs, FCM-DHMMs, NC-FCM-DHMMs and H-DHMMs of each of 16 speakers. For continuous models, 736 test tokens (46 utterances x 8 testing sessions x 2 repetitions) were tested against conventional CHMMs, FE-CHMMs and FCM-CHMMs. Codebook Conventional FE NC-FE FCM NC-FCM Hard Size DHMM DHMM DHMM DHMM DHMM DHMM 16 54.54 41.51 41.48 51.97 42.74 44.51 32 39.41 34.38 34.36 37.46 33.46 36.72 64 33.84 29.92 29.88 30.54 27.87 30.89 128 33.98 31.28 31.25 32.27 31.85 33.07 Table 7.1: Isolated word recognition error rates (%) for the E set In general, the errors are very different for the codebook size of 16, where the highest error is for conventional models. This can be interpreted that the information obtained from training data was lost by using small size codebooks and then not sufficient for training conventional DHMMs. As discussed before, with a suitable degree of fuzzy entropy or fuzziness, fuzzy models can reduce the errors due to this problem. The average recognition error reductions by FE-DHMMs, NC-FE-DHMMs, FCM-DHMMs, and NC-FCM-DHMMs in comparison with conventional DHMMs are (54.54 − 41.51) = 13.03%, (54.54 − 41.23) = 13.31%, (54.54 − 51.97)% = 2.57%, and (54.54 − 43.74)% = 10.8%, respectively. As the codebook size increases, the errors for all models are reduced as shown with the codebook sizes of 32 and 64. The error difference between these models also decreases as the codebook size increases 7.4 Isolated Word Recognition 105 since the information obtained from the training data is sufficient for conventional DHMMs. However, the number of HMM parameters is proportional to the codebook size because of the matrix B, therefore with the large codebook size, e.g. 128, the training data is now insufficient for these HMMs and hence the errors for the codebook size of 128 become larger than those for the codebook size of 64. We should pay attention to results of hard DHMMs. This is surprising since hard models are often performed worse than other models. The results also show that FE models performed better than FCM models for the E set. However for their noise clustering versions, NC-FE-DHMMs are slightly better than FE-DHMMs whereas NC-FCM-DHMMs achieved significant results in comparison with FCM-DHMMs. The lowest recognition error is 27.87% obtained by NC-FCM-DHMMs using VQ codebook size of 64. This can be interpreted that the decreasing speed of the exponential function in FE memberships is faster than that of the inverse function in FCM memberships for long distances (see Figure 4.5 on page 74), so FCM models are more sensitive to outliers than FE models. Experiments were in speaker-dependent mode, so recognition performance should be evaluated in speaker-dependent results. Table 7.4.1 shows these speaker-dependent recognition results using the codebook size of 16. Recognition error rates were significantly reduced by using fuzzy models for female speakers labelled f1, f3, f7, and male speakers labelled m3, m6, m7, m8. Hard models were also more effective than conventional models for speakers labelled f3, f6, f7, m3, m7 and m8, but for speakers labelled f2, m1 and m5, they were worse than conventional models. In general, results are not very dependent on speakers in using noise clustering-based fuzzy models. 7.4.2 10-Digit&10-Command Set Results Tables 7.4.2 and 7.4.2 present the experimental results for the recognition of the 10-digit set and the 10-command set using 6-state left-to-right models in speakerdependent mode. These sets do not include a highly confusable vocabulary as the E set, therefore the recognition error rates are low for all models. In these two experiments we can see that the results for FCM-DHMMs and NC-FCM-DHMMs or FE-DHMMs and NC-FE-DHMMs produce very similar results. This is probably due to the fact that clusters in the 10-digit and 10-command sets are better separated 7.4 Isolated Word Recognition 106 than those in the E set. 7.4.3 46-Word Set Results The whole vocabulary of 46 words in the TI46 corpus was used to train continuous models. Based on the results for the 10-digit set and the 10-command set, recognition performance on the 46-word set using noise-clustering-based fuzzy models is possibly not much improved in comparison with fuzzy models. So in this experiment, only the 3-state 2-mixture models and 5-state 2-mixture models including conventional Conventional FE NC-FE FCM NC-FCM Hard Speaker DHMM DHMM DHMM DHMM DHMM DHMM f1 68.06 54.17 54.11 62.76 43.75 65.28 f2 43.75 47.22 47.15 47.25 47.22 44.44 f3 63.19 43.75 43.75 59.88 54.86 51.39 f4 50.00 31.25 31.24 48.35 39.58 34.72 f5 55.56 54.86 54.81 54.83 54.17 53.47 f6 54.17 29.86 29.86 47.97 28.47 33.33 f7 69.44 44.44 44.41 59.23 40.97 43.06 f8 49.31 31.94 31.92 44.25 31.94 40.28 m1 27.78 26.39 26.39 28.89 28.47 36.11 m2 52.08 29.17 29.15 49.86 38.19 31.25 m3 62.50 44.44 44.42 57.68 49.31 49.31 m4 44.29 37.14 37.05 47.67 46.43 35.71 m5 33.33 39.01 38.89 33.21 31.21 39.01 m6 63.57 50.71 50.71 61.59 52.14 48.57 m7 63.38 40.14 40.12 59.45 45.77 43.66 m8 72.22 59.72 59.69 68.64 51.39 62.50 Female 56.69 42.19 42.16 53.07 42.62 45.75 Male 52.39 40.84 40.80 50.87 42.86 43.27 Average 54.54 41.51 41.48 51.97 42.74 44.51 Table 7.2: Speaker-dependent recognition error rates (%) for the E set 7.4 Isolated Word Recognition Codebook Conventional 107 FE NC-FE FCM NC-FCM Hard Size DHMM DHMM DHMM DHMM DHMM DHMM 16 6.21 4.84 4.32 5.83 4.74 5.03 32 2.25 1.81 1.72 2.16 2.06 1.81 64 0.43 0.37 0.34 0.39 0.38 0.43 128 0.39 0.35 0.35 0.38 0.38 0.43 Table 7.3: Isolated word recognition error rates (%) for the 10-digit set Codebook Conventional FE NC-FE FCM NC-FCM Hard Size DHMM DHMM DHMM DHMM DHMM DHMM 16 15.74 6.45 6.32 13.78 13.74 9.07 32 4.36 3.64 3.52 3.88 3.76 3.50 64 2.43 2.27 2.24 2.28 2.28 2.18 128 1.65 1.45 1.43 1.60 1.58 1.73 Table 7.4: Isolated word recognition error rates (%) for the 10-command set 7.5 Speaker Identification 108 Model Conventional CHMM FE-CHMM FCM-CHMM 3 states 2 mixtures 9.59 9.42 8.96 5 states 2 mixtures 7.19 7.15 6.57 Table 7.5: Isolated word recognition error rates (%) for the 46-word set CHMMs, FE-CHMMs, and FCM-CHMMs were trained for each word. Table 7.4.3 presents recognition performance of these models. In this experiment FCM-CHMMs were performed better than FE-CHMMs and conventional CHMMs. Speaker-dependent results for the conventional CHMMs and FCM-CHMMs are presented in Table 7.4.3. In comparison with conventional CHMMs, recognition error rates for speakers labelled f2, f3, f4, m2, m3 and m6 were improved in both cases using 3-state 2-mixture and 5-state 2-mixture models. There was only a bad result for speaker labelled f8. 7.5 Speaker Identification The TI46, the ANDOSL, and the YOHO corpora were used for speaker identification. GMM-based speaker identification was performed by using conventional GMMs, FEGMMs, NC-FE-GMMs, FCM-GMMs and NC-FCM-GMMs. For VQ-based speaker identification, FE-VQ, FCM-VQ and (hard) VQ codebooks were also used. The vector sequences obtained after the LPC analysis of speech signals were used for training speaker models. Reestimation formulas have been presented in previous sections such as (2.30) page 30 for GMMs, (3.42) and (3.43) page 60 for FE-GMMs, (4.15) and (4.16) page 70 for FCM-GMMs. For GMM-based speaker identification, testing was performed by calculating the probabilities P (X|λi ), i = 1, . . . , M for the unknown X with M speaker models using (2.31) and the recognised speaker is the speaker whose probability is highest. For VQ-based speaker identification, testing was performed by calculating the average distortion D(X, λi ), i = 1, . . . , M for the unknown X with M speaker models using (6.12) on page 88 and the recognised speaker is the speaker whose average distortion is lowest. 7.5 Speaker Identification 7.5.1 109 TI46 Results The vocabulary was the 10-command set. In the training phase, 100 training tokens (10 utterances x 1 training session x 10 repetitions) of each speaker were used to train conventional GMMs, FCM-GMMs and NC-FCM-GMMs of 16, 32, 64, and 128 mixtures for GMM-based speaker identification, and VQ, FE-VQ, NC-FE-VQ codebooks of 16, 32, 64, 128 codevectors for VQ-based speaker identification. The degrees of fuzzy entropy and fuzziness were chosen n = 1 and m = 1.06, respectively. Speaker identification was carried out in text-independent mode by testing all 2560 test tokens (16 speakers x 10 utterances x 8 testing sessions x 2 repetitions) against conventional GMMs, FCM-GMMs and NC-FCM-GMMs of all 16 speakers for GMM-based speaker identification, and against VQ, FE-VQ, NC-FE-VQ codebooks of all 16 speakers for VQ-based speaker identification. Figure 7.4: Speaker identification error rate (%) versus the number of mixtures for 16 speakers, using conventional GMMs, FCM-GMMs and NC-FCM-GMMs The experimental results are plotted in Figure 7.5.1 for GMM-based speaker identification. The identification error rate was improved by using 64-mixture, 128mixture FCM-GMMs and NC-FCM-GMMs. The identification error rates of FCMGMMs and NC-FCM-GMMs were about 2% and 3% lower than the error rate of conventional GMMs using 64 mixtures [Tran and Wagner 1998]. Figure 7.5.1 shows experimental results for VQ-based speaker identification. FE-VQ and NC-FE-VQ were also obtained similar good results as FCM-GMMs ans NC-FCM-GMMs. 7.5 Speaker Identification 110 Figure 7.5: Speaker identification error rate (%) versus the codebook size for 16 speakers, using VQ, FE-VQ and NC-FE-VQ codebooks 7.5.2 ANDOSL Results For the ANDOSL corpus, the subset of 108 native speakers was used. For each speaker, the set of 200 long sentences was divided into two subsets. The training set consisting of 10 sentences numbered from 001 to 020 was used to train conventional GMMs, FE-GMMs and FCM-GMMs of 32 mixtures. Very low error rates were obtained for all models thus noise clustering-based fuzzy models have not been considered in this experiment. The test set consisting of 180 sentences numbered from 021 to 200 was used to test the above models. Speaker identification was carried out by testing all 19440 tokens (180 utterances x 108 speakers) against conventional GMMs, FE-GMMs and FCM-GMMs of all 108 speakers. The degree of fuzzy entropy and fuzziness were respectively set to n = 1.01 for FE-GMMs and to m = 1.026 for FCM-GMMs. The experimental results are presented in Table 7.5.2 for speaker groups mentioned in Section 7.1.2. Results for the ANDOSL corpus are better than those for the TI46 corpus since a larger training data set was used. Female speakers have higher error rates than male speakers. Elder speakers have lower error rates than young and medium speakers. Broad speakers have the lowest error rates in comparison with cultivated and general speakers. 7.6 Speaker Verification 7.5.3 111 YOHO Results For the experiments performed on the YOHO corpus, each speaker was modelled by using 48 training tokens in enrolment sessions 1 and 2 only. Using all four enrolment sessions, as done for example by [Reynolds 1995a], resulted in error rates that were too low to allow meaningful comparisons between the different normalisation methods for speaker verification. Conventional GMMs, hard GMMs and VQ codebooks were trained in the training phase. Speaker identification was carried out in textindependent mode by testing all 8280 test tokens (138 speakers x 10 utterances x 4 sessions) against conventional GMMs, hard GMMs and VQ codebooks of all 138 speakers. Table 7.5.3 shows identification error rates for these models. As expected, results show that hard GMMs performed better than VQ but worse than conventional GMMs. 7.6 Speaker Verification As discussed in Chapter 6, speaker verification performance depends on speaker models and on the normalisation method used to verify speakers. So experiments were performed in three cases: 1) Using proposed speaker models and current normalisation methods, 2) Using conventional speaker models and proposed normalisation methods, and 3) Using proposed speaker models and proposed normalisation methods. Three speech corpora—TI46, ANDOSL, and YOHO—were used and all experiments were in text-independent mode. Speaker models for speaker verification are also those for speaker identification. Verification rules in (6.1) and (6.2) on page 86 were used for similarity-based scores Li (X), i = 3, . . . , 6 and dissimilarity-based scores Dj (X), j = 3, . . . , 9 in this chapter. 7.6.1 TI46 Results The speaker verification experiments were performed on the TI46 corpus to evaluate proposed speaker models. Conventional GMMs, FCM-GMMs and NC-FCM-GMMs of 16, 32, 64, 128 mixtures and VQ, FE-VQ, NC-FE-VQ codebooks of 16, 32, 64, 128 codevectors in speaker identification were used. The normalisation method is L3 (X) 7.6 Speaker Verification 112 in (6.8) page 88 for GMM-based speaker verification, and is D1 (X) in (6.11) page 88 for VQ-based speaker verification. Since the TI46 corpus has a small speaker set consisting of 8 female and 8 male speakers, therefore for a verification experiment, each speaker was used as a claimed speaker with the remaining speakers (including 7 samegender background speakers) acting as impostors and rotating through all speakers. So the total number of claimed test utterances and impostor test utterances are 2560 (16 claimed speakers x 10 test utterances x 8 sessions x 2 repetitions) and 38400 ((16 x 15) impostors x 10 test utterances x 8 sessions x 2 repetitions), respectively. Figure 7.6: EERs (%) for GMM-based speaker verification performed on 16 speakers, using GMMs, FCM-GMMs and NC-FCM-GMMs. Experimental results are plotted in Figure 7.6.1 for GMM-based speaker verification. The lowest EER about 3% obtained for 128-mixture NC-FCM-GMMs. Both FCM-GMMs and NC-FCM-GMMs show better results compared to conventional GMMs, where the highest improvement of about 1% is found for 64-mixture NC-FCM-GMMs. Figure 7.6.1 shows EERs for VQ-based speaker verification using VQ, FE-VQ and NC-FE-VQ codebooks. Similarly, with codebook size of 16 and 32, FE-VQ and NC-FE-VQ obtain high improvements compared to VQ codebooks. 7.6.2 ANDOSL Results Experiments were performed on 108 speakers using each speaker as a claimed speaker with 5 closest background speakers or 5 same-subgroup background speakers as in- 7.6 Speaker Verification 113 Figure 7.7: EERs (%) for VQ-based speaker verification performed on 16 speakers, using VQ, FE-VQ and NC-FE-VQ codebooks dicated in Section 7.1.2 and 102 mixed-gender impostors (excluding 5 background speakers) and rotating through all speakers. The total number of claimed test utterances and impostor test utterances are 20520 (108 claimed speakers x 190 test utterances) and 2093040 ((108 x 102) impostors x 190 test utterances), respectively. In general, EERs obtained using the top 5 background speaker set are lower than those obtained using the 5 subgroup background speaker set. For both 16 and 32 mixture GMMs, proposed normalisation methods D3 (X), D4 (X), . . . , D9 (X), especially D8 (X) (FCM membership) and D9 (X) (using the arctan fuction) produce lower EERs than current normalisation methods L3 (X), . . . , L6 (X). More interesting, noise clustering-based normalisation methods D3nc (X), D4nc (X), . . . , D9nc (X) 7.6 Speaker Verification 114 and L3nc (X), . . . , L6nc (X) produce better results compared to current and proposed methods. The reason has been discussed in the last chapter, noise clustering-based normalisation methods reduce false acceptance cases and hence the EER is also reduced. For 16-mixture GMMs, the geometric mean normalisation method L5 (X) using 5 subgroup background speaker set produces the highest EER of 3.52% whereas the proposed method D8nc (X) using top 5 background speaker set produces the lowest EER of 1.66%. Similarly, for 32-mixture GMMs, L5 (X) using 5 subgroup background speaker set produces the highest EER of 3.26 % and D4nc (X) using top 5 background speaker set produces the lowest EER of 1.26%. 7.6.3 YOHO Results Experiments were performed on 138 speakers using each speaker as a claimed speaker with 5 closest background speakers and 132 mixed-gender impostors (excluding 5 background speakers) and rotating through all speakers. The total number of claimed test utterances and impostor test utterances are 5520 (138 claimed speakers x 40 test utterances) and 728640 ((138 x 132) impostors x 40 test utterances), respectively. Results are shown in two tables as follows. Table 7.6.3 shows results for conventional GMMs and VQ codebooks to compare EERs obtained by using current and proposed normalisation methods. Similar to ANDOSL results, for 16, 32 and 64 mixture GMMs, proposed normalisation methods D3 (X), D4 (X), . . . , D9 (X), especially D9 (X) (using the arctan fuction) produced lower EERs than current normalisation methods L3 (X), . . . , L6 (X). Noise clustering-based normalisation methods D3nc (X), D4nc (X), . . . , D9nc (X) and L3nc (X), . . . , L6nc (X) also produced better results compared to current and proposed methods. The current normalisation method L5 (X) produced the highest EER of 4.47% for 16-mixture GMMs and the proposed method D9nc (X) produced the lowest EER of 1.83% for 64-mixture GMMs. The better performances were also obtained for proposed methods using VQ codebooks. The highest EER of 6.80% was obtained using the current method L5 (X) for VQ codebook size of 16 and the lowest EER of 2.74% was obtained using proposed methods D5nc (X) and D9nc (X). Table 7.6.3 shows results for conventional GMMs, hard GMMs and VQ codebooks to compare EERs obtained by using conventional and proposed models as well as cur- 7.7 Summary and Conclusion 115 rent and proposed normalisation methods. Similar to speaker identification, conventional GMMs produced better results than hard GMMs, and hard GMMs produced better results than VQ codebooks. Results again show better results for proposed normalisation methods. For hard GMMs, the current normalisation method L5 (X) produced the highest EER of 5.90% for 16-component hard GMMs and the proposed method L5nc (X) produced the lowest EER of 3.10% for 32-component hard GMMs. 7.7 Summary and Conclusion Isolated word recognition, speaker identification and speaker verification experiments have been performed on the TI46, ANDOSL and YOHO corpora to evaluate proposed models and proposed normalisation methods. Fuzzy entropy and fuzzy C-means methods as well as their noise clustering-based versions and hard methods have been applied to train discrete and continuous hidden Markov models, Gaussian mixture models and vector quantisation codebooks. In isolated word recognition, experiments have shown significant results for fuzzy entropy and fuzzy C-means hidden Markov models compared to conventional hidden Markov models performed on the highly confusable vocabulary consisting of the nine English E-set letters: b, c, d, e, g, p, t, u, v and z. The lowest recognition error of 27.87% was obtained for noise clustering-based fuzzy C-means 6-state left-to-right discrete hidden Markov models using VQ codebook size of 64. The highest recognition error of 54.54 % was obtained for conventinal 6-state left-to-right discrete hidden Markov models using VQ codebook size of 16. For the 10-digit and 10-command sets, the lowest errors of 0.34% and 1.43% were obtained for noise clustering-based fuzzy entropy 6-state left-to-right discrete hidden Markov models using VQ codebook size of 64 and 128, respectively. Fuzzy C-means continuous hidden Markov models have shown good results in 46-word recognition with the lowest error of 6.57% obtained for 5-state 2-mixture fuzzy C-means continuous hidden Markov models. In speaker identification, experiments have shown good results for fuzzy entropy vector quantisation codebooks and fuzzy C-means Gaussian mixture models. For 16 speakers of the TI46 corpus, noise clustering-based fuzzy C-means Gaussian mixture models were obtained the lowest error of 12% using 128 mixtures. Noise clustering- 7.7 Summary and Conclusion 116 based fuzzy entropy vector quantisation codebooks were also obtained the lowest error of 10.13% using codebook size of 128. For 108 speakers of the ANDOSL corpus, the lowest error of 0.65% was obtained for noise clustering-based fuzzy C-means Gaussian mixture models using 32 mixtures. Results for 138 speakers of the YOHO corpus have evaluated hard Gaussian mixture models as intermediate models in the reduction of Gaussian mixture models to vector quantisation codebooks. In speaker verification, proposed models and proposed normalisation methods have been evaluated. For proposed models using current normalisation methods, the lowest EERs obtained for 16 speakers of the TI46 corpus were about 3% using 128-mixture noise clustering-based fuzzy C-means Gaussian mixture models and about 3.2% using 128-codevector noise clustering-based fuzzy entropy vector quantisation codebooks. For proposed normalisation methods using conventional models, the lowest EER obtained for 108 speakers of the ANDOSL corpus was 1.22% for the noise clustering-based normalisation method using the fuzzy C-means membership function as a similarity score for 32-mixture Gaussian mixture speaker models. The lowest EER for 138 speakers of the YOHO corpus was 1.83% for the noise clusteringbased normalisation method using the arc tan function as a dissimilarity score for 64-mixture Gaussian mixture speaker models. 7.7 Summary and Conclusion 117 Conventional CHMM FCM-CHMM Speaker 3 states 2 mixtures 5 states 2 mixtures 3 states 2 mixtures 5 states 2 mixtures f1 12.26 7.22 11.31 9.40 f2 8.03 6.53 5.03 5.85 f3 12.24 9.93 10.61 7.62 f4 7.07 7.48 5.99 3.81 f5 15.22 9.92 14.40 10.60 f6 10.19 7.07 8.97 7.47 f7 7.47 5.43 6.52 5.30 f8 6.39 4.49 7.89 4.90 m1 5.77 4.12 4.81 4.40 m2 3.94 2.17 2.72 1.09 m3 12.96 10.09 10.23 8.87 m4 8.90 5.98 9.04 4.45 m5 5.69 3.61 6.24 3.19 m6 13.91 11.85 13.36 11.02 m7 9.74 6.86 10.70 6.31 m8 13.59 12.23 15.49 10.73 Female 9.86 7.26 8.84 6.87 Male 9.32 7.12 9.08 6.27 Average 9.59 7.19 8.96 6.57 Table 7.6: Speaker-dependent recognition error rates (%) for the 46-word set 7.7 Summary and Conclusion Speaker GMM 118 FE-GMM FCM-GMM Groups 16 mixtures 32 mixtures 16 mixtures 32 mixtures 16 mixtures 32 mixtures fyb 0.83 0.54 0.83 0.52 0.50 0.58 fyc 3.95 2.55 3.91 2.51 3.92 2.50 fyg 3.38 2.78 3.35 2.75 3.08 2.17 fmb 2.03 0.98 2.01 0.89 1.75 0.42 fmc 2.25 1.95 2.25 1.89 2.17 1.50 fmg 2.08 1.87 2.05 1.85 1.83 1.50 feb 0.75 0.08 0.71 0.05 0.00 0.00 fec 3.92 1.74 3.90 1.70 3.92 1.58 feg 1.17 0.38 1.13 0.35 0.42 0.33 myb 1.17 0.44 1.06 0.40 0.92 0.25 myc 0.42 0.29 0.37 0.28 0.25 0.17 myg 0.42 0.30 0.38 0.28 0.58 0.25 mmb 0.92 0.17 0.86 0.16 0.50 0.17 mmc 0.42 0.09 0.47 0.09 0.33 0.08 mmg 0.33 0.07 0.30 0.05 0.08 0.00 meb 0.25 0.00 0.21 0.00 0.17 0.00 mec 0.08 0.04 0.06 0.04 0.08 0.00 meg 0.08 0.08 0.08 0.06 0.08 0.17 female 2.26 1.43 2.24 1.39 1.95 1.18 male 0.45 0.16 0.42 0.15 0.33 0.12 young 1.70 1.15 1.65 1.12 1.54 0.99 medium 1.34 0.86 1.32 0.82 1.11 0.61 elder 1.04 0.39 1.02 0.37 0.78 0.35 broad 0.99 0.37 0.95 0.34 0.64 0.24 cultivated 1.84 1.11 1.83 1.09 1.78 0.97 general 1.24 0.91 1.22 0.89 1.01 0.74 Average 1.36 0.80 1.33 0.77 1.14 0.65 Table 7.7: Speaker identification error rates (%) for the ANDOSL corpus using conventional GMMs, FE-GMMs and FCM-GMMs. 7.7 Summary and Conclusion GMM 119 Hard GMM VQ Speaker 16 mixtures 32 mixtures 16 components 32 components 16 vectors 32 vectors Female 15.08 7.27 14.14 6.88 17.11 8.91 Male 11.88 5.09 13.94 9.25 15.90 7.36 Average 13.48 6.18 14.04 8.06 16.5 8.13 Table 7.8: Speaker identification error rates (%) for the YOHO corpus using conventional GMMs, hard GMMs and VQ codebooks. 7.7 Summary and Conclusion Normalisation Methods 120 Top 5 Background Speaker Set 5 Subgroup Background Speaker Set 16-mixture GMM 32-mixture GMM 16-mixture GMM 32-mixture GMM L3 (X) 2.16 1.79 2.68 2.11 D3 (X) 1.84 1.51 2.24 1.61 L3nc (X) 1.81 1.40 2.00 1.41 D3nc (X) 1.79 1.35 1.89 1.33 L4 (X) 2.16 1.79 2.68 2.10 D4 (X) 1.90 1.47 2.15 1.46 L4nc (X) 1.81 1.40 2.00 1.41 D4nc (X) 1.81 1.36 1.88 1.32 L5 (X) 3.16 3.03 3.51 3.26 D5 (X) 2.61 2.20 2.77 2.14 L5nc (X) 2.02 1.97 2.32 2.10 D5nc (X) 1.99 1.71 2.27 1.68 D7 (X) 2.11 1.84 2.63 2.07 D7nc (X) 1.97 1.51 2.12 1.59 D8 (X) 1.91 1.53 2.34 1.69 D8nc (X) 1.66 1.26 1.86 1.32 D9 (X) 2.08 1.66 2.33 1.69 D9nc (X) 1.89 1.54 2.23 1.57 Table 7.9: EER Results (%) for the ANDOSL corpus using GMMs with different background speaker sets. Rows in bold are the current normalisation methods , others are the proposed methods. The index “nc” denotes noise clustering-based methods. 7.7 Summary and Conclusion Normalisation Methods 121 GMM VQ 16 mixtures 32 mixtures 64 mixtures 16 vectors 32 vectors 64 vectors L3 (X) 4.42 3.12 2.43 5.87 4.60 3.60 D3 (X) 4.12 2.94 1.98 5.41 4.06 3.14 L3nc (X) 4.20 2.89 1.96 4.96 4.25 3.50 D3nc (X) 4.25 2.90 1.96 4.76 3.84 3.09 L4 (X) 4.41 3.13 2.43 5.85 4.58 3.58 D4 (X) 4.20 2.97 1.99 5.21 3.90 3.06 L4nc (X) 4.20 2.89 1.96 4.96 4.23 3.49 D4nc (X) 4.21 2.87 1.97 4.72 3.76 3.04 L5 (X) 4.47 3.30 2.44 6.80 5.06 4.22 D5 (X) 4.10 2.98 2.05 5.32 3.81 3.10 L5nc (X) 3.87 2.74 1.87 4.48 3.36 2.91 D5nc (X) 3.75 2.76 1.88 4.49 3.33 2.74 D7 (X) 4.41 3.20 2.44 6.09 4.75 3.62 D7nc (X) 4.30 3.10 2.27 5.10 4.39 3.54 D8 (X) 4.17 2.97 2.05 5.29 3.99 3.11 D8nc (X) 4.16 2.95 2.04 4.50 3.41 2.97 D9 (X) 3.89 2.84 1.85 4.51 3.31 2.76 D9nc (X) 3.86 2.81 1.83 4.34 3.27 2.74 Table 7.10: Equal Error Rate (EER) Results (%) for the YOHO corpus. Rows in bold are the current normalisation methods , others are the proposed methods. 7.7 Summary and Conclusion GMM Normalisation 122 Hard GMM VQ Methods 16 mixtures 32 mixtures 16 components 32 components 16 vectors 32 vectors L3 (X) 4.42 3.12 5.22 3.99 5.87 4.60 D3 (X) 4.12 2.94 4.60 3.40 5.41 4.06 L3nc (X) 4.20 2.89 5.07 3.88 4.96 4.25 D3nc (X) 4.25 2.90 4.38 3.33 4.76 3.84 L4 (X) 4.41 3.13 5.20 4.00 5.85 4.58 D4 (X) 4.20 2.97 4.66 3.48 5.21 3.90 L4nc (X) 4.20 2.89 4.57 3.89 4.96 4.23 D4nc (X) 4.21 2.87 4.57 3.45 4.72 3.76 L5 (X) 4.47 3.30 5.90 4.42 6.80 5.06 D5 (X) 4.10 2.98 4.81 3.41 5.32 3.81 L5nc (X) 3.87 2.74 4.26 3.10 4.48 3.36 D5nc (X) 3.75 2.76 4.30 3.19 4.49 3.33 D7 (X) 4.41 3.20 5.22 3.97 6.09 4.75 D7nc (X) 4.30 3.10 5.03 3.92 5.10 4.39 D8 (X) 4.17 2.97 4.86 3.62 5.29 3.99 D8nc (X) 4.16 2.95 4.45 3.21 4.50 3.41 D9 (X) 3.89 2.84 4.36 3.18 4.51 3.31 D9nc (X) 3.86 2.81 4.32 3.13 4.34 3.27 Table 7.11: Comparisons of EER Results (%) for the YOHO corpus using GMMs, hard GMMs and VQ codebooks. Rows in bold are the current normalisation methods, others are the proposed methods. Chapter 8 Extensions of the Thesis The fuzzy membership function in fuzzy set theory and the a posteriori probability in the Bayes decision theory have similar roles. However the minimum recognition error rate for a recogniser is obtained by the maximum a posteriori probability rule whereas the maximum membership rule does not lead to such a minimum error rate. This problem can be solved by using a recently developed branch of fuzzy set theory, namely possibility theory. The development of this theory has led to a theory framework similar to that of probability theory. Therefore, a possibilistic pattern recognition approach to speech and speaker recognition may be developed which is powerful as the statistical pattern recognition approach. In this chapter, possibility theory is introduced briefly and a small application of this theory, namely a possibilistic C-means approach to speech and speaker recognition is proposed. It is suggested that future research into the possibilistic pattern recognition approach to speech and speaker recognition may be very promising. 8.1 8.1.1 Possibility Theory-Based Approach Possibility Theory Let us review the basis of the statistical pattern recognition approach in Section 2.3 page 20. The goal of a recogniser is to achieve the minimum recognition error rate. The maximum a posteriori probability (MAP) decision rule is selected to implement this goal. However, these probabilities are not known in advance and have to be estimated from a training set of observations with known class labels. The Bayes 123 8.1 Possibility Theory-Based Approach 124 decision theory thus effectively transforms the recogniser design problem into the distribution estimation problem and the MAP rule is transformed to the maximum likelihood rule. The distributions are usually parameterised in order to be practically implementable. The main task is to determine the right parametric form of the distributions from the training data. For the fuzzy pattern recognition approach in Section 2.4 page 34, the maximum membership rule is selected to solve the recogniser design problem. This rule is transformed to the minimum distance rule in fuzzy cluster analysis. However, it has not been shown that the implementation of these rules leads to the goal of the recogniser, i.e. the minimum recognition error rate as in the statistical pattern recognition approach. The fuzzy approaches proposed in this thesis have shown a solution for this problem. By defining the general distance as a decreasing function of the component distribution, the minimum distance rule becomes the maximum likelihood rule. This means that the minimum recognition error rate can be achieved in the proposed approaches. An alternative approach that is also based on fuzzy set theory is the use of possibility theory introduced by Zadeh [1978], where fuzzy variables are associated with possibility distributions in a similar way to that in which random variables are associated with probability distributions. A possibility distribution is a representation of knowledge and information. It can be said that probability theory is used to deal with randomness and possibility is used to deal with vagueness and ambiguity [Tanaka and Guo 1999]. Possibility theory is more concerned with the modelling of partial belief which is due to incomplete data rather than that which is due to the presence of random phenomena [Dubois and Prade 1988]. In possibilistic data analysis, the total error possibility is defined and the maximum possibility rule is formulated. The possibility distributions are also parameterised and the right parametric form can be determined from the training data. In a new application of the possibilistic approach to operations research [Tanaka and Guo 1999], an exponential possibility distribution that is similar to a Gaussian distribution has been proposed. Similarly, applying the possibilistic approach to speech and speaker recognition would be worth investigating. 8.1 Possibility Theory-Based Approach 8.1.2 125 Possibility Distributions A possibility distribution on a one-dimensional space is a fuzzy membership function of a point x in a fuzzy set A and is denoted as ΠA (x). For a set of n numbers h1 , . . . , hn , let the h-level sets Ahi = {x|ΠA (x) ≥ hi } be conventional sets (intervals) such that if h1 ≤ h2 ≤ . . . ≤ hn then Ah1 ⊇ Ah2 ⊇ . . . ⊇ Ahn . The distribution ΠA (x) should satisfy the following conditions • There exists an x such that ΠA (x) = 1 (normality), • h-level sets of fuzzy numbers are convex (convexity), • ΠA (x) is piecewise continuous (continuity). The possibility function is a unimodal function. The possibility distribution on the d-dimensional space is similarly defined. Let A be a fuzzy vector defined as A = {(x1 , . . . , xd )|x1 ∈ A1 , . . . , xd ∈ Ad } (8.1) where A1 , . . . , Ad are fuzzy sets. Denoting x = (x1 , . . . , xd )′ , the possibility distribution of A can be defined by ΠA (x) = ΠA1 (x1 ) ∧ . . . ∧ ΠAd (xd ) (8.2) For example, an exponential possibility distribution on the d-dimensional space can be described as ΠA (x) = exp { − (x − m)′ S−1 A (x − m)} (8.3) where m is a centre vector and SA is a symmetric positive definite matrix. The parametric representation of the exponential possibility distribution is λ = (m, SA ). 8.1.3 Maximum Possibility Rule Consider the simplest case where two classes A and B are characterised by two possibility distributions ΠA (x) and ΠB (x), respectively. The task is to classify the vector x into A or B. Let uA (x) and uB (x) be the degrees of possibility to which x belongs to A and B, respectively. The error possibilities that x belonging to A is assigned to B and vice versa are denoted as E(A → B) = max uB (x)ΠA (x) x E(B → A) = max uA (x)ΠB (x) x (8.4) 8.2 Possibilistic C-Means Approach 126 The total error possibility E can be defined as E = E(A → B) + E(B → A) (8.5) It can further be shown that [Tanaka and Guo 1999] E ≥ max [uB∗ (x)ΠA (x) + uA∗ (x)ΠB (x)] (8.6) x where uA ∗(x) = { 1 ΠA (x) ≥ ΠB (x) 0 otherwise uB ∗(x) = { 1 ΠB (x) > ΠA (x) 0 otherwise (8.7) Then we obtain the maximum possibility rule written as an if-then rule as follows If ΠA (x) ≥ ΠB (x) then x belongs to A, If ΠA (x) < ΠB (x) then x belongs to B. 8.2 8.2.1 (8.8) Possibilistic C-Means Approach Possibilistic C-Means Clustering As shown in the noise clustering method (see Section 2.4.7 on page 39), the FCM method uses the probabilistic constraint that the memberships of a feature vector xt across clusters must sum to one. It is meant to avoid the trivial solution of all memberships being equal to zero. However, since the memberships generated by this constraint are relative numbers, they are not suitable for applications in which the memberships are supposed to represent typicality or compatibility with an elastic constraint. A possibilistic C-means (PCM) clustering has been proposed to generate memberships that have a typicality interpretation [Krishnapuram and Keller 1993]. Following the fuzzy set theory [Zadeh 1965], the membership uit = ui (xt ) is the degree of compatibility of the feature vector xt with cluster i, or the possibility of xt belonging to cluster i. If the clusters represented by the clusters are thought of as a set of fuzzy subsets defined over the domain of discourse X = {x1 , . . . , xt }, then there should be no constraint on the sum of the memberships. Let U = [uit ] be a matrix whose elements are memberships of xt in cluster i, i = 1, . . . , C, t = 1, . . . , T . Possibilistic C-partition space for X is the set of matrices 8.2 Possibilistic C-Means Approach 127 U such that 0 ≤ uit ≤ 1 ∀i, t, max uit > 0 ∀t, 1≤i≤C 0< T ∑ uit < T ∀i t=1 (8.9) The objective function may be formulated as follows Jm (U, λ; X) = T C ∑ ∑ 2 um it dit + i=1 t=1 C ∑ ηi i=1 T ∑ (1 − uit )m (8.10) t=1 where U = {uit } is a possibilistic C-partition of X, m > 1, and ηi , i = 1, . . . , C are suitable positive numbers. Minimising the PCM objective function Jm (U, λ; X) in (8.10) demands the distances in the first term to be as low as possible and the uit in the second term to be as large as possible, thus avoiding the trivial solution. Parameters are estimated similar to those in the NC approach. Minimising (8.10) with respect to uit gives uit = 1 ) 1 d2 m−1 1 + it ηi (8.11) ( Equation (8.11) defines a possibility distribution function Πi for cluster i over the domain of discourse consisting of all feature vectors xt ∈ X. The value of ηi determines the relative degree to which the second term in the objective function is important compared with the first. In general, ηi relates to the overall size and shape of cluster i. In practice, the following definition works well ηi = K0 T ∑ 2 um it dit t=1 /∑ T um it (8.12) t=1 where typically K0 is chosen to be one. 8.2.2 PCM Approach to FE-HMMs For the possibilistic C-Means (PCM) approach, based on (8.9) in Section 8.2.1, the matrix U = [uijt ] is defined as 0 ≤ uijt ≤ 1 ∀i, j, t, max uijt > 0 1≤i≤N 1≤j≤N ∀t, 0< T ∑ uiit < T ∀i t=1 (8.13) 8.2 Possibilistic C-Means Approach 128 The fuzzy likelihood function in the PCM approach is as follows Ln (U, λ; O) = − N ∑ N T∑ −1 ∑ uijt d2ijt − t=0 i=1 j=1 N ∑ N ∑ nij T∑ −1 (uijt log uijt − uijt ) (8.14) t=0 i=1 j=1 where the degree of fuzzy entropy nij > 0 is dependent on the states. Since f (uijt ) = uijt log uijt − uijt is a monotonically decreasing function in [0, 1], maximising the fuzzy likelihood Ln (U, λ; O) in (8.14) over U forces uijt to be as large as possible. By setting the derivative of Ln (U, λ; O) with respect to uijt to zero, we obtain the algorithm for the FE-DHMM in the PCM approach (PCM-FE-DHMM) Fuzzy E-Step: uijt = exp { −d2ijt } = [P (O, st = i, st+1 = j|λ)]1/nij nij (8.15) M-Step: The M-step is identical to the M-step of the FE-DHMM in Section 3.4.2. Similarly, the FE-CHMM in the PCM approach (PCM-FE-CHMM) is as follows Fuzzy E-Step: uijt −d2jkt } = exp = [P (O, st = j, mt = k|λ)]1/njk njk { (8.16) M-Step: The M-step is identical to the M-step of the FE-CHMM in Section 3.4.3. 8.2.3 PCM Approach to FCM-HMMs For the possibilistic C-Means (PCM) approach, based on (8.9) in Section 8.2.1 page 127, the matrix U = [uijt ] is defined as 0 ≤ uijt ≤ 1 ∀i, j, t, max uijt > 0 1≤i≤N 1≤j≤N ∀t, 0< T ∑ uiit < T ∀i t=1 (8.17) The fuzzy objective function in the PCM approach is as follows Jm (U, λ; O) = N N ∑ T∑ −1 ∑ t=0 i=1 j=1 2 um ijt dijt + N N ∑ ∑ i=1 j=1 ηij T∑ −1 (1 − uijt )m (8.18) t=0 where ηij are suitable positive numbers. The second term forces uijt to be as large as possible. By setting the derivative of Jm (U, λ; O) with respect to uijt to zero, we obtain the algorithm for the FCM-DHMM in the PCM approach (PCM-FCM-DHMM) 8.2 Possibilistic C-Means Approach 129 Fuzzy E-Step: 1 uijt = 1+ ( (8.19) d2ijt )1/(m−1) ηij where dijt is defined in (3.21). M-Step: The M-step is identical to the M-step of the FCM-DHMM in Section 4.2.1. Similarly, the FCM-CHMM in the PCM approach (PCM-FCM-CHMM) is as follows Fuzzy E-Step: 1 ujkt = 1+ ( (8.20) d2jkt )1/(m−1) ηjk where djkt is defined in (3.30). M-Step: The M-step is identical to the M-step of the FCM-CHMM in Section 4.2.2. 8.2.4 PCM Approach to FE-GMMs The matrix U = [uit ] is defined as 0 ≤ uit ≤ 1 ∀i, t, max uit > 0 ∀t, 1≤i≤N 0< T ∑ uit < T ∀i t=1 (8.21) The fuzzy likelihood function in the PCM approach is as follows Ln (U, λ; X) = − K T ∑ ∑ uit d2it − t=1 i=1 K ∑ i=1 ni T ∑ (uit log uit − uit ) (8.22) t=1 where the degree of fuzzy entropy ni > 0 is dependent on the clusters. The algorithm for the FE-GMM in the PCM approach (PCM-FE-GMM) is as follows Fuzzy E-Step: −d2it (8.23) = [P (xt , i|λ)]1/ni ni M-Step: The M-step is identical to the M-step of the FE-GMM in Section 3.5.1. uit = exp { } 8.2 Possibilistic C-Means Approach 8.2.5 130 PCM Approach to FCM-GMMs The matrix U = [uit ] is defined as 0 ≤ uit ≤ 1 ∀i, t, max uit > 0 ∀t, 1≤i≤N 0< T ∑ uit < T ∀i t=1 (8.24) The fuzzy objective function in the PCM approach is as follows Jm (U, λ; O) = K T ∑ ∑ 2 um it dit + t=1 i=1 K ∑ i=1 ηi T ∑ (1 − uit )m (8.25) t=1 where ni , i = 1, . . . , K are positive numbers. The algorithm for the FCM-GMM in the PCM approach (PCM-FCM-GMM) is as follows Fuzzy E-Step: 1 uit = 1+ ( ) d2it 1/(m−1) (8.26) ηi M-Step: The M-step is identical to the M-step of the FCM-GMM in Section 4.3.1. 8.2.6 Summary and Conclusion The theory of possibility, in contrast to heuristic approaches, offers algorithms for composing hypothesis evaluations which are consistent with axioms in a well-developed theory. Possibility theory, rather than probability theory, relates to the perception of degrees of evidence instead of degrees of likelihood. In any case, using possibilities does not prevent us from using statistics in the estimation of membership functions [De Mori and Laface 1980]. In the PCM approach, the membership values can be interpreted as possibility values or degrees of typicality of vectors in clusters. The possibilistic C-partition defines C distinct possibility distributions and the PCM algorithm can be used to estimate possibility distributions directly from training data. The role of PCM clustering and PCM models can be seen by comparing the following figures 8.1, 8.2 and 8.3 with the corresponding figures on pages 41, 65 and 75. 8.2 Possibilistic C-Means Approach 131 Clustering Techniques Memberships 0 ≤ uit ≤ 1, 0 < T ∑ uit < T t=1 uit ∈ {0, 1} C ∑ C ∑ uit = 1 Parameter set λ = {µ} ❄ HCM λ = {µ, Σ} uit < 1 max uit > 0 1≤i≤C i=1 i=1 ❄ ❄ ❄ FCM NC PCM ❄ GustafsonKessel ❄ Extended NC ❄ Extended PCM ❄ GathGeva λ = {µ, Σ, w} Figure 8.1: PCM Clustering in Clustering Techniques FE Models ❄ FE Discrete Models ❄ FE DHMM ❄ NC-FE DHMM ❄ FE Continuous Models ❄ PCM-FE DHMM ❄ FE CHMM ❄ NC-FE CHMM ❄ PCM-FE CHMM ❄ FE GMM ❄ NC-FE GMM ❄ PCM-FE GMM ❄ FE VQ ❄ NC-FE VQ ❄ PCM-FE VQ Figure 8.2: PCM Approach to FE models for speech and speaker recognition 8.2 Possibilistic C-Means Approach 132 FCM Models ❄ FCM Discrete Models ❄ FCM DHMM ❄ NC-FCM DHMM ❄ FCM Continuous Models ❄ PCM-FCM DHMM ❄ FCM CHMM ❄ NC-FCM CHMM ❄ PCM-FCM CHMM ❄ FCM GMM ❄ NC-FCM GMM ❄ PCM-FCM GMM ❄ FCM VQ ❄ NC-FCM VQ ❄ PCM-FCM VQ Figure 8.3: PCM Approach to FCM models for speech and speaker recognition Chapter 9 Conclusions and Future Research 9.1 Conclusions Fuzzy approaches to speech and speaker recognition have been proposed and experimentally evaluated in this thesis. To obtain these approaches, the following basic problems have been solved. First, the time-dependent fuzzy membership function has been introduced to hidden Markov modelling to denote the degree of belonging of an observation sequence to a state sequence. Second, a relationship between modelling techniques and clustering techniques has been established by using a general distance defined as a decreasing function of the component probability density. Third, a relationship between fuzzy models and conventional models is also established by introducing a new technique—fuzzy entropy clustering. Finally, since the roles of the fuzzy membership function and the a posteriori probability in the Bayes decision theory are quite similar, the maximum a posteriori rule can be generalised to the maximum membership rule. With the above general distance, the use of the maximum membership rule also achieves the minimum recognition error rate for a speech and speaker recogniser. Fuzzy entropy models are the first set of proposed models in the fuzzy modelling approach. A parameter is introduced as the degree of fuzzy entropy n > 0. With n → 0, we obtain hard models. With n = 1, fuzzy entropy models reduce to conventional models in the maximum likelihood scheme. Thus, statistical models can be viewed as special cases of fuzzy models. Fuzzy entropy hidden Markov models, 133 9.1 Conclusions 134 fuzzy entropy Gaussian mixture models and fuzzy entropy vector quantisation have all been proposed. Fuzzy C-means models are the second set of proposed models in the fuzzy modelling approach. A parameter is introduced as the degree of fuzziness m > 1. With m → 1, we also obtain hard models. Fuzzy C-means hidden Markov models and fuzzy C-means Gaussian mixture models have been respectively proposed. Noise clustering is an interesting fuzzy approach to fuzzy entropy and fuzzy Cmeans models. This approach is simple but robust and has very good experimental evaluations. In general, fuzzy entropy and fuzzy C-means models have a common advantage, that is they obtain adjustable parameters n and m. When conventional models do not work well because of the insufficient training data problem or the complexity of speech data, such as the nine English E-set words, a suitable value of n or m can be found to obtain better models. Hard models are the third set of proposed models. These models are regarded as a consequence of fuzzy models as fuzzy parameters n and m tend to their limit values. Hard HMMs are the single-state sequence HMMs. Conventional HMMs using the Viterbi algorithm can be regarded as “pretty” hard HMMs. The fuzzy approach to speaker verification is an alternative fuzzy approach in this thesis. Based on the use of the fuzzy membership function as the claimed speaker’s score and consideration of the likelihood transformation, six fuzzy membership scores and ten noise clustering-based scores have been proposed. Using the arctan function in computing the score illustrates a theoretical extension for normalisation methods, where not only the logarithm function but also other functions can be used. Isolated word recognition, speaker identification and speaker verification experiments have been performed on the TI46, ANDOSL and YOHO corpora to evaluate proposed models and proposed normalisation methods. In isolated word recognition, experiments have shown very good results for fuzzy entropy and fuzzy C-means hidden Markov models compared to conventional hidden Markov models performed on the highly confusable vocabulary of the nine English E-set letters: b, c, d, e, g, p, t, u, v and z. In speaker identification, experiments have shown good results for fuzzy entropy vector quantisation codebooks and fuzzy C-means Gaussian mixture models. In 9.2 Directions for Future Research 135 speaker verification, experiments have shown better results for the proposed normalisation methods, especially for the noise clustering-based methods. With 2,093,040 test utterances for each ANDOSL result and 728,640 test utterances for each YOHO result, these evaluation experiments are sufficiently reliable. 9.2 Directions for Future Research Several directions for future research have been suggested which may extend or augment the work in this thesis. These are: • Possibilistic Pattern Recognition Approach: As shown in Chapter 8, this approach would be worth investigating. A possibility theory framework to replace the Bayes decision theory framework for the minimum recognition error rate task of a recogniser looks very promising. • Fuzzy Entropy Clustering: will be further considered in both theoretical and experimental aspects such as the local minima, convergence, cluster validity, cluster analysis and classifier design in pattern recognition. • Fuzzy Approach to Discriminative Methods: The fuzzy approaches proposed in this thesis are based on maximum likelihood-based methods. Since discriminative methods such as maximum mutual information and generalised probabilistic descent are also effective methods, finding a fuzzy approach to these methods should be studied. • Large Vocabulary Speech Recognition: The speech recognition experiments in this thesis were isolated word recognition experiments on small vocabularies. Therefore, to obtain a better evaluation for the proposed fuzzy models, continuous speech recognition experiments on large vocabularies should be carried out. • Likelihood Transformations: Since speaker verification has many important applications, other likelihood transformations should be studied to find more effective normalisation methods for speaker verification. Bibliography [Abramson 1963] N. Abramson, Information Theory and Coding, McGraw Hill, 1963. [Allerhand 1987] M. Allerhand, Knowledge-Based Speech Pattern Recognition, Kogan Page Ltd, London, 1987. [Ambroise et al. 1997] C. Ambroise, M. Dang and G. Govaert, “Clustering of Spatial Data by the EM Algorithm”, in A. Soares, J. Gomez-Hernandez and R. Froidevaux (eds), geoENV I - Geostatistics for Environmental Applications, vol. 9 of Quantitative Geology and Geostatistics, Kluwer Academic Publisher, pp. 493-504, 1997. [Ambroise and Govaert 1998] C. Ambroise and G. Govaert, “Convergence Proof of an EM-Type Algorithm for Spatial Clustering”, Pattern Reccognition Letters, vol. 19, pp. 919-927, 1998. [Atal 1974] B. S. Atal, “Effective of Linear Prediction Characteristics of Speech Wave for Automatic Speaker Identification and Verification”, J. Acoust. Soc. Am., vol. 55, pp. 1304-1312, 1974. [AT&T’s web site] http://www.research.att.com [Bakis 1976] R. Bakis, “Continuous Speech Word Recognition via Centisecond Acoustic States”, in Proc. ASA Meeting (Washington, DC), April, 1976. [Banon 1981] G. Banon, “Distinction Between Several Subsets of Fuzzy Measures”, Fuzzy Sets and Systems, vol. 5, pp. 291-305, 1981. [Baum 1972] L. E. Baum, “An inequality and associated maximisation technique in statistical estimation for probabilistic functions of a Markov process”, Inequalities, vol. 3, pp. 1-8, 1972. [Baum and Sell 1968] L. E. Baum and G. Sell, “Growth transformations for functions on manifolds”, Pacific J. Maths., vol. 27, pp. 211-227, 1968. [Baum and Eagon 1967] L. E. Baum and J. A. Eagon, “An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for Ecology”, Bull. Amer. Math. Soc., vol. 73, pp. 360-363, 1967. 136 BIBLIOGRAPHY 137 [BBN’s web site] http://www.gte.com/AboutGTE/gto/bbnt/speech/research/technologies/index.html. [Bellegarda 1996] J. R. Bellegarda, “Context dependent vector quantization for speech recognition”, chapter 6 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 133-158, 1996. [Bellegarda and Nahamoo 1990] J. R. Bellegarda and D. Nahamoo, “Tied mixture continuous parameter modelling for speech recognition”, in IEEE Trans. Acoustics, Speech, Signal Proc., vol. 38, pp. 2033-2045, 1990. [Bellman et al. 1966] R. Bellman, R. Kalaba and L. A. Zadeh, “Abstraction and Pattern Recognition”, J. Math. Anal. Appl., vol. 13, pp. 1-7, 1966. [Bezdek et al. 1998] J. C. Bezdek, T. R. Reichherzer, G. S. Lim and Y. Attikiouzel, “Multipleprototype classifier design”, IEEE Trans. Syst. Man Cybern., vol. 28, no. 1, pp. 67-79, 1998. [Bezdek 1993] J. C. Bezdek, “A review of probabilistic, fuzzy and neural models for pattern recognition”, J. Intell. and Fuzzy Syst., vol. 1, no. 1, pp. 1-25, 1993. [Bezdek and Pal 1992] J. C. Bezdek and S. K. Pal, Fuzzy Models for Pattern Recognition, IEEE Press, 1992. [Bezdek 1990] J. C. Bezdek, “A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms”, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-2, no.1, pp. 1-8, January 1990. [Bezdek 1981] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York and London, 1981. [Bezdek and Castelaz 1977] J. C. Bezdek and P. F. Castelaz, “Prototype classification and feature selection with fuzzy sets”, IEEE Trans. Syst. Man Cybern., vol. SMC-7, no. 2, pp. 87-92, 1977. [Bezdek 1974] J. C. Bezdek, “Cluster validity with fuzzy sets”, J. Cybern., vol. 3, no. 3, pp. 58-72, 1974. [Bezdek 1973] J. C. Bezdek, Fuzzy mathematics in Pattern Classification, Ph.D. thesis, Applied Math. Center, Cornell University, Ithaca, 1973. [Booth and Hobert 1999] J. G. Booth and J. P. Hobert, “Maximizing Generalized Linear Mixed Model Likelihoods with an Automated Monte Carlo EM algorithm”, J. Roy. Stat. Soc., Ser. B, 1999 (to appear). BIBLIOGRAPHY 138 [Cambridge’s web site] http://svr-www.eng.cam.ac.uk/ [Campbell 1997] J. P. Campbell, “Speaker Recognition: A Tutorial”, in Special issue on Automated biometric Syst., Proc. IEEE, vol. 85, no. 9, pp. 1436-1462, 1997. [Chou et al. 1989] P. Chou, T. Lookabaugh and R. Gray, “Entropy-constrained vector quantisation”, IEEE Trans. Acoustic, Speech, and Signal Processing, vol. ASSP-37, pp. 31-42, 1989. [Chou and Oh 1996] H. J. Choi and Y. H. Oh, “Speech recognition using an enhanced FVQ based on codeword dependent distribution normalization and codeword weighting by fuzzy objective function”, in Proceedings of the International Conference on Spoken Language Processing (ICSLP), vol. 1, pp. 354-357, 1996. [Cover and Hart 1967] T. M. Cover and P. E. Hart, “Nearest neighbour pattern classification”, IEEE Trans. Inform. Theory, vol. IT-13, pp. 21-27, 1967. [CSELT’s web site] http://www.cselt.it/ [Dang and Govaert 1998] M. Dang and G. Govaert, “Spatial Fuzzy Clustering using EM and Markov Random Fields”, J. Syst. Research & Inform. Sci., vol. 8, pp. 183-202, 1998. [Das and Picheny 1996] S. K. Das and M. A. Picheny, “Issues in practical large vocabulary isolated word recognition: the IBM Tangora system”, chapter 19 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 457-480, 1996. [Davé and Krishnapuram 1997] R. N. Davé and R. Krishnapuram, “Robust clustering methods: a unified view”, IEEE Trans. Fuzzy Syst., vol. 5, no.2, pp. 270-293, 1997. [Davé and Bhaswan 1992] R. N. Davé and K. Bhaswan, “Adaptive fuzzy c-shells clustering and detection of ellipses”, IEEE Trans. Neural Networks, vol. 3, pp. 643-662, May 1992. [Davé 1991] R. N. Davé, “Characterization and detection of noise in clustering”, Pattern Recognition Lett., vol. 12, no. 11, pp. 657-664, 1991. [Davé 1990] R. N. Davé, “Fuzzy-shell clustering and applications to circle detection in digital images”, Int. J. General Systems, vol. 16, pp. 343-355, 1990. [De Luca and Termini 1972] A. de Luca, S. Termini, “A definition of a nonprobabilistic entropy in the setting of fuzzy set theory”, Inform. Control, vol. 20, pp. 301-312, 1972. [De Mori and Laface 1980] R. De Mori and P. Laface, “Use of fuzzy algorithms for phonetic and phonemic labeling of continuous speech”, IEEE trans. Pattern Anal. Machine Intell., vol. PAMI-2, no. 2, pp. 136-148, March 1980. BIBLIOGRAPHY 139 [Dempster et al. 1977] A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM algorithm”, J. Roy. Stat. Soc., Series B, vol. 39, pp. 1-38, 1977. [DRAGON’s web site] http://www.dragonsys.com/products/index.html [Dubois and Prade 1988] D. Dubois and H. Prade, Possibility Theory; An Approach to Computerized Processing of Uncertainty, Plenum Press, New York, 1988. [Duda and Hart 1973] R. O. Duda and P. E. Hart, Pattern classification and scene analysis, John Wiley & Sons, New York, 1973. [Duisburg’s web site] http://www.uni-duisburg.de/e/Forschung/ [Dunn 1974] J. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Cluster”, J. Cybern., vol. 3, pp. 32-57, 1974. [Doddington 1998] G. R. Doddington, “Speaker Recognition Evaluation Methodology – An Overview and Perspective”, in Proc. Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C), pp. 60-66, 1998. [Fessler and Hero 1994] J. A. Fessler and A. O. Hero, “Space-alternating generalised EM algorithm”, IEEE Trans. Signal Processing, vol. 42, pp. 2664-2677, 1994. [Flanagan 1972] J. L. Flanagan, Speech Analysis, Synthesis, and Perception, 2nd ed., SpringerVerlag, New York, 1972. [Fogel 1995] D. B. Fogel, Evolutionary Computation, Toward A New Philosophy of Machine Intelligence, IEEE Press, New York, 1995. [Freitas 1998] J. F. G. Freitas, M. Niranjan and A. H. Gee, “The EM algorithm and neural networks for nonlinear state space estimation”, Technical Report CUED/F-INFENG/TR 313, Cambridge University, 1998. [Furui 1997] Sadaoki Furui, “Recent advances in speaker recognition”, Patter Recognition Lett., vol. 18, pp. 859-872, 1997. [Furui 1996] Sadaoki Furui, “An Overview of Speaker Recognition Technology”, chapter 2 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 31-56, 1996. [Furui 1994] Sadaoki Furui, “An Overview of Speaker Recognition Technology”, in Proc. ESCA Workshop on Automatic Speaker Recognition, Identification and Verification, pp. 1-9, 1994. [Furui and Sondhi 1991a] Sadaoki Furui and M. Mohan Sondhi, Advances in Speech Signal Processing, Marcel Dekker, Inc., New York, 1991. BIBLIOGRAPHY 140 [Furui 1991b] Sadaoki Furui, “Speaker-independent and speaker-adaptive recognition techniques”, in Advances in Speech Signal Processing, edited by Sadaoki Furui and M. Mohan Sondhi, Marcel Dekker, Inc., New York, pp. 597-622, 1991. [Furui 1989] Sadaoki Furui, Digital Speech, Processing, Synthesis, and Recognition, Marcel Dekker, Inc., New York, 1989. [Furui 1981] Sadaoki Furui, “Cepstral Analysis Techniques for Automatic Speaker Verification”, IEEE Trans. Acoustic, Speech, and Signal Processing, vol. 29, pp. 254-272, 1981. [Gath and Geva 1989] I. Gath and A. B. Geva, “Unsupervised optimal fuzzy clustering”, IEEE Trans. Patt. Anal. Mach. Intell., PAMI vol. 11, no. 7, pp. 773-781, 1989. [Ghahramani 1997] Z. Ghahramani, “Factorial Hidden Markov Models”, in Machine Learning, vol. 29, pp. 245-275, Kluwer Academic Publisher, 1997. [Ghahramani 1995] Z. Ghahramani, “Factorial Learning and the EM Algorithm”, in Adv. Neural Inform. Processing Syst. G. Tesauro, D.S. Touretzky and J. Alspector (eds.), vol. 7, pp. 617-624, MIT Press, Cambridge, 1995. [Gish 1990] H. Gish, “Robust discrimination in automatic speaker identification”, in Proc. IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’90), pp. 289-292, 1990. [Gish et al. 1991] H. Gish, M.-H. Siu and R. Rohlicek, “Segregation of speakers for speech recognition and speaker identification”, in Proc. IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’91), pp. 873-876, 1991. [Goldberg 1989] D. E. Goldberg, Genetic Algorithms in Search, Optimization & Machine Learning, Addison-Wesley, 1989. [Gravier and Chollet 1998] G. Gravier and G. Chollet, “Comparison of Normalization Techniques for Speaker Verification”, in Proc. on Speaker Recognition and its Commercial and Forensic Applications (RLA2C), pp. 97-100, 1998. [Gustafson and Kessel 1979] D. E. Gustafson and W. Kessel, “Fuzzy clustering with a Fuzzy Covariance Matrix”, in in Proc. IEEE-CDC, (K. S. Fu, ed.), vol. 2, pp. 761-766, IEEE Press, Piscataway, New Jersey, 1979. [Harrington and Cassidy 1996] J. Harrington and S. Cassidy, Techniques in Speech Acoustics, Kluwer Academic Publications, 1996. [Hartigan 1975] J. Hartigan, Clustering Algorithms, Wiley, NewYork, 1975. [Hathaway 1986] R. Hathaway, “Another interpretation of the EM algorithm for mixture distribution”, J. Stat. Prob. Lett., vol. 4, pp. 53-56, 1986. BIBLIOGRAPHY 141 [Higgins et al. 1991] A. L. Higgins, L. Bahler and J. Porter, “Speaker Verification using Randomnized Phrase Prompting”, Digital Signal Processing, vol. 1, pp. 89-106, 1991. [Höppner et al. 1999] F. Höppner, F. Klawonn, R. Kruse and T. Runkler, Fuzzy Cluster Analysis – Methods for classification, Data analysis and Image Recognition, John Wiley & Sons Ltd, 1999. [Huang et al. 1996] X. Huang, A. Acero, F. Alleva, M. Huang, L. Jiang and M. Mahajan, “From SPHINX-II to WHISPER: Making speech recognition usable”, chapter 20 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 481-508, 1996. [Huang et al. 1990] X. D. Huang, Y. Ariki and M. A. Jack, Hidden Markov Models For Speech Recognition, Edinburgh University Press, 1990. [Huang and Jack 1989] X. D. Huang and M. A. Jack, “Semi-Continuous Hidden Markov Models For Speech Signal”, Computer, Speech and Language, vol. 3, pp. 239-251, 1989. [IBM’s web site] http://www-4.ibm.com/software/speech/ [Jaynes 1957] E. T. Jaynes, “Information theory and statistical mechanics”, Phys. Rev., vol. 106, pp. 620-630, 1957. [Juang 1998] B.-H. Juang, “The Past, Present, and Future of Speech Processing”, IEEE Signal Processing Magazine, vol. 15, no. 3, pp. 24-48, 1998. [Juang et al. 1996] B.-H. Juang, W. Chou and C.-H. Lee, “Statistical and discriminative methods for speech recognition”, chapter 5 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 109-132, 1996. [Juang and Katagiri 1992] B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification”, IEEE Trans. Signal processing, SP-40, no. 12, pp. 3043-3054, 1992. [Juang and Rabiner 1991] B.-H. Juang and L. R. Rabiner, “Issues in using hidden Markov models for speech and speaker recognition”, in Advances in Speech Signal Processing, edited by Sadaoki Furui and M. Mohan Sondhi, Marcel Dekker, Inc., New York, pp. 509-554, 1991. [Juang 1985] B.-H. Juang, “Maximum likelihood estimation for multivariate observations of Markov sources”, AT&T Technical Journal, vol. 64, pp. 1235-1239, 1985. [Kasabov 1998] N. Kasabov, “A framework for intelligent conscious machines and its application to multilingual speech recognition systems”, in Brain-like computing and intelligent information systems, S. Amari and N. Kasabov eds. Singapore, Springer Verlag, pp. 106-126, 1998. BIBLIOGRAPHY 142 [Kasabov et al. 1999] N. Kasabov, R. Kozma, R. Kilgour, M. Laws, J. Taylor, M. Watts and A. Gray, “Hybrid connectionist-based methods and systems for speech data analysis and phoneme-based speech recognition” in Neuro-Fuzzy Techniques for Intelligent Information Processing, N. Kasabov and R.Kozma, eds. Heidelberg, Physica Verlag, 1999. [Katagiri and Juang 1998] S. Katagiri and B.-H. Juang, “Pattern Recognition using a family of design algorithms based upon the generalised probabilistic descent method”, invited paper in Proc. of the IEEE, vol. 86, no. 11, pp. 2345-2373, 1998. [Katagiri et al. 1991] S. Katagiri, C.-H. Lee and B.-H. Juang, “New discriminative training algorithms based on the generalised descent method”, in Proc. of IEEE Workshop on neural networks for signal processing, pp. 299-308, 1991. [Keller et al. 1985] J. M. Keller, M. R. Gray and J. A. Givens, “A fuzzy k-nearest neighbor algorithm”, IEEE Trans. Syst. Man Cybern., vol. SMC-15, no. 4, pp. 580-585, 1985. [Kewley-Port 1995] Diane Kewley-Port, “Speech recognition”, chapter 9 in Applied Speech Technology, edited by A. Syrdal, R. Bennett and S. Greenspan, CRC Press, Inc, USA, 1995. [Koo and Un 1990] M. Koo and C. K. Un, “Fuzzy smoothing of HMM parameters in speech recognition”, Electronic Letters, vol. 26, pp. 7443-7447, 1990. [Kosko 1992] B. Kosko, Neural Networks and Fuzzy Systems, Englewood Cliffs, NJ:Prentice-Hall, 1992. [Krishnapuram and Keller 1993] R. Krishnapuram and J. M. Keller, “A possibilistic approach to clustering”, IEEE Trans. Fuzzy Syst., vol. 1, pp. 98-110, 1993. [Krishnapuram et al. 1992] R. Krishnapuram, O. Nasraoui and H. Frigui, “Fuzzy c-spherical shells algorithm: A new approach”, IEEE Trans. Neural Networks, vol. 3, no. 5, pp. 663-671, 1992. [Kulkarni 1995] V. G. Kulkarni, Modeling and analysis of stochastic systems, Chapman & Hall, UK, 1995. [Kuncheva and Bezdek 1997] L. I. Kuncheva and J. C. Bezdek, “A fuzzy generalised nearest prototype classifier”, in Proc. the 7th IFSA World Congress, Prague, Czech, vol. III, pp. 217-222, 1997. [Kunzel 1994] H. J. Kunzel, “Current approaches to forensic speaker recognition”, in Proc. ESCA Workshop on Automatic Speaker Recognition, Identification and Verification, pp. 135-141, 1994. [Le et al. 1999] T. V. Le, D. Tran and M. Wagner, “Fuzzy evolutionary programming for hidden Markov modelling in speaker identification”, in Proc. the Congress on Evolutionary Computation 99, Washington DC, pp. 812-815, 1999. BIBLIOGRAPHY 143 [LIMSI’s web site] http://www.limsi.fr/Recherche/TLP/reco/2pg95-sv/2pg95-sv.html [C.-H. Lee et al. 1996] C.-H. Lee, F. K. Soong and K. K. Paliwal, Automatic speech and speaker recognition, Advanced topics, Kluwer Academic Publishers, USA, 1996. [C.-H. Lee and Gauvain 1996] C.-H. Lee and J.-L. Gauvain, “Bayesian adaptive learning and MAP estimation of HMM”, chapter 4 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 83-108, 1996. [Lee and Leekwang 1995] K. M. Lee and H. Leekwang, “Identification of λ-Fuzzy Measure by Genetic Algorithms”, Fuzzy Sets Syst., vol. 75, pp. 301-309, 1995. [K.-F. Lee and Alleva 1991] K.-F. Lee and Fil Alleva, “Continuous speech recognition”, in Advances in Speech Signal Processing, edited by Sadaoki Furui and M. Mohan Sondhi, Marcel Dekker, Inc., New York, pp. 623-650, 1991. [Leszczynski et al. 1985] K. Leszczynski, P. Penczek and W. Grochulski, “Sugeno’s Fuzzy Measure and Fuzzy Clustering”, Fuzzy Sets Syst., vol. 15, pp. 147-158, 1985. [Levinson et al. 1983] S. E. Levinson, L. R. Rabiner and M. M. Sondhi, “An introduction to the application of the theory of Probabilistic functions of a Markov process to automatic speech recognition”, in The Bell Syst. Tech. Journal, vol. 62, no. 4, 1983, pp 1035-1074, 1983. [Li and Mukaidono 1999] R.-P. Li and M. Mukaidono, “Gaussian clustering method based on maximum-fuzzy-entropy interpretation”, Fuzzy Sets and Systems, vol. 102, pp. 253-258, 1999. [Linde et al. 1980] Y. Linde, A. Buzo and R. M. Gray, “An Algorithm for Vector Quantization”, IEEE Trans. Communications, vol. 28, pp. 84-95, 1980. [Liu et al. 1996] C. S. Liu, H. C. Wang and C.-H. Lee, “Speaker Verification using Normalization Log-Likelihood Score”, IEEE Trans. Speech and Audio Processing, vol. 4, pp. 56-60, 1980. [Liu et al. 1998] C. Liu, D. B. Rubin and Y. N. Wu, “Parameter Expansion to Accelerate EM: the PX-EM algorithm”, Biometrika, 1998 (to appear). [Liu and Rubbin 1994] C. Liu and D. B. Rubin “The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence”, Biometrika, vol. 81, pp. 633-648, 1994. [Markov and Nakagawa 1998a] K. P. Markov and S. Nakagawa, “Discriminative training of GMM using a modified EM algorithm for speaker recognition”, in Proc. Inter. Conf. on Spoken Language Processing (ICSLP’98), vol. 2, pp. 177-180, Sydney, Australia, 1998. BIBLIOGRAPHY 144 [Markov and Nakagawa 1998b] K. P. Markov and S. Nakagawa, “Text-independent speaker recognition using non-linear frame likelihood transformation”, Speech Communication, vol. 24, pp. 193-209, 1998. [Matsui and Furui 1994] T. Matsui and S. Furui, “A new similarity normalisation method for speaker verification based on a posteriori probability”, in Proc. ESCA Workshop on Automatic Speaker Recognition, Identification and Verification, pp. 59-62, 1994. [Matsui and Furui 1993] T. Matsui and S. Furui, “Concatenated Phoneme Models for Text Variable Speaker Recognition”, in Proc. IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’93), pp. 391-394, 1993. [Matsui and Furui 1992] T. Matsui and S. Furui, “Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs”, in Proc. IEEE Inter. Conf. on Acoustic, Speech, and Signal Processing (ICASSP’92), San Francisco, pp. II-157-160, 1992. [Matsui and Furui 1991] T. Matsui and S. Furui, “A text-independent speaker recognition method robust against utterance variations”, in Proc. IEEE Inter. Conf. on Acoustic, Speech, and Signal Processing (ICASSP’91), pp. 377-380, 1991. [McDermott and Katagiri 1994] E. McDermott and S. Katagiri, “Prototype-based MCE/GPD training for various speech units”, Comp. Speech Language, vol. 8, pp. 351-368, 1994. [Medasani and Krishnapuram 1998] S. Medasani and R. Krishnapuram, “Categorization of Image Databases for Efficient Retrieval Using Robust Mixture Decomposition”, in Proc. IEEE Workshop on Content Based Access of Images and Video Libraries, IEEE Conference on Computer Vision and Pattern Recognition, Santa Barbara, pp. 50-54, 1998. [Meng and Dyk 1997] X. L. Meng and V. Dyk, “The EM algorithm An old folk song sung to a fast new tune (with discussion)”, J. Roy. Stat. Soc., Ser. B, vol. 59, pp. 511-567, 1997. [Meng and Rubin 1993] X. L. Meng and D. B. Rubin, “Maximum likelihood estimation via the ECM algorithm: a general framework”, Biometrika, vol. 80, pp. 267-278, 1993. [MIT’s web site] http://www.sls.lcs.mit.edu/sls/ [Millar et al. 1994] J. B. Millar, J. P. Vonwiller, J. M. Harrington and P. J. Dermody, “The Australian National Database of Spoken Language”, in Proc. Inter. Conf. on Acoustic, Speech, and Signal Processing (ICASSP’94), vol. 1, pp. 97-100, 1994. [Nadas 1983] A. Nadas, “A decision theoretic formulation of atraining problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood”, IEEE Trans. Signal Processing, vol. 31, no. 4, pp. 814-817, 1983. BIBLIOGRAPHY 145 [Murofushi and Sugeno 1989] T. Murofushi and M. Sugeno, “An interpretation of Fuzzy Measure and the Choquet Integral as an Integral with respect to a Fuzzy Measure”, Fuzzy Sets Syst., vol. 29, pp. 201-227, 1989. [Normandin 1996] Y. Normandin, “Maximum mutual information estimation of hidden Markov models”, chapter 3 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 57-82, 1996. [Western Ontario’s web site] http://www.uwo.ca/nca/ [O’Shaughnessy 1987] Douglas O’Shaughnessy, Speech Communication, Addison-Wesley, USA, 1987. [Ostendorf et al. 1997] M. Ostendorf, V. V. Digalakis and O. A. Kimball, “From HMM’s to segment models: A unified view of stochastic modeling for speech recognition”, IEEE Trans. Speech & Audio Processing, vol. 4, no. 5, pp. 360-378, 1997. [Ostendorf 1996] M. Ostendorf, “From HMM’s to segment models: stochastic modeling for CSR”, chapter 8 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 185-210, 1996. [Otten and Ginnenken 1989] R. H. J. M. Otten and L. P. P. P. van Ginneken, The Annealing Algorithm, Kluwer, Boston, 1989. [Owens 1993] F. J. Owens, Signal Processing of Speech, McGraw-Hill, Inc., New York, 1993. [Pal and Majumder 1977] S. K. Pal and D. D. Majumder, “Fuzzy sets and decision making approaches in vowel and speaker recognition”, IEEE Trans. Syst. Man Cybern., pp. 625-629, 1977. [Paul 1989] D. B. Paul, “The Lincoln Robust Continuous Speech Recogniser,” Proc. ICASSP 89, Glasgow, Scotland, pp. 449-452, 1989. [Peleg 1980] S. Peleg, “A new probability relaxation scheme”, IEEE Trans. Patt. Anal. Mach. Intell., vol. 7, no. 5, pp. 617-623, 1980. [Peleg and Rosenfeld 1978] S. Peleg and A. Rosenfeld, “Determining compatibility coefficients for curve enhancement relaxation processes”, IEEE Trans. Syst. Man Cybern., vol. 8, no. 7, pp. 548-555, 1978 BIBLIOGRAPHY 146 [Rabiner et al. 1996] L. R. Rabiner, B. H. Juang and C. H. Lee, “An Overview of Automatic Speech Recognition”, chapter 1 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 1-30, 1996. [Rabiner and Juang 1993] L. R. Rabiner and B. H. Juang, Fundamentals of speech recognition, Prentice Hall PTR, USA, 1993. [Rabiner 1989] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications speech recognition”, in Proc. IEEE, vol. 77, no. 2, pp. 257-286, 1989. [Rabiner and Juang 1986] L. R. Rabiner and B. H. Juang, “An introduction to hidden Markov models”, IEEE Acoustic, Speech, and Signal Processing Society Magazine, vol. 3, no. 1, pp. 4-16, 1986. [Rabiner et al. 1983] L. R. Rabiner, S. E. Levinson and M. M. Sondhi, “On the application of vector quantisation and hidden Markov models to speaker-independent, isolated word recognition”, The Bell System Technical Journal, vol. 62, no. 4, 1983, pp 1075-1105, 1983. [Rabiner and Schafer 1978] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice Hall PTR, USA, 1978. [Rezaee et al. 1998] M. R. Rezaee, B. P. F. Lelieveldt and J. H. C. Reiber, “A new cluster validity index for the fuzzy c-means”, Patt. Rec. Lett. vol. 19, pp. 237-246, 1998. [Reynolds 1995a] Douglas A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models”, Speech Communication, vol. 17, pp. 91-108, 1995. [Reynolds 1995b] Douglas A. Reynolds and Richard C. Rose, “Robust text-independent speaker identification using Gaussian mixture models”, IEEE Trans. Speech and Audio Processing, vol. 3, no. 1, pp. 72-83, 1995. [Reynolds 1994] Douglas A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models”, in Proc. ESCA Workshop on Automatic Speaker Recognition, Identification and Verification, vol. 17, pp. 91-108, 1994. [Reynolds 1992] Douglas A. Reynolds, A Gaussian Mixture Modeling Approach to Text-Independent Speaker Identification, PhD thesis, Georgia Institute of Technology, USA, 1992. [Rosenberg and Soong 1991] A. E. Rosenberg and Frank K. Soong, “Recent research in automatic speaker recognition”, in Advances in Speech Signal Processing, edited by Sadaoki Furui and M. Mohan Sondhi, Marcel Dekker, Inc., New York, pp. 701-740, 1991. BIBLIOGRAPHY 147 [Rosenberg et al. 1992] A. E. Rosenberg, J. Delong, C.-H. Lee, B.-H. Juang and F. K. Soong, “The use of cohort normalised scores for speaker verification”, in Proc. Inter. Conf. on Spoken Language Processing (ICSLP’92), pp. 599-602, 1992. [Ruspini 1969] E. H. Ruspini, “A new approach to clustering”, in Inform. Control, vol. 15, no. 1, pp. 22-32, 1969. [Sagayama 1996] S. Sagayama, “Hidden Markov network for precise acoustic modeling”, chapter 7 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 159-184, 1996. [Schwartz et al. 1996] R. Schwartz, L. Nguyen and J. Makhoul, “Multiple-pass search strategies”, chapter 18 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by ChinHui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 429-456, 1996. [Siu et al. 1992] M.-H. Siu, G. Yu and H. Gish, “An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers”, in Proc. IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’92), pp. I-189-192, 1992. [Soong et al. 1987] F. K. Soong, A. E. Rosenberg, L. R. Rabiner, and B. H. Juang, “A vector quantisation approach to speaker recognition”, AT&T Tech. J., vol. 66, pp. 14-26, 1987. [Syrdal et al. 1995] A. Syrdal, R. Bennett and S. Greenspan, Applied Speech Technology, CRC Press, Inc, USA, 1995. [Tanaka and Guo 1999] H. Tanaka and P. Guo, Possibilistic Data Analysis for Operations Research, Physica-Verlag Heidelberg, Germany, 1999. [Tran and Wagner 2000a] Dat Tran and Michael Wagner, “Fuzzy Modelling Techniques for Speech and Speaker Recognition”, the Special Issue on Recognition Technology of the IEEE Transactions on Fuzzy Systems (accepted subject to revision). [Tran and Wagner 2000b] Dat Tran and Michael Wagner, “Fuzzy Entropy Hidden Markov Models for Speech Recognition”, submitted to the International Conference on Spoken Language Processing (ICSLP2000), Beijing, China. [Tran and Wagner 2000c] Dat Tran and Michael Wagner, “Fuzzy Normalisation Methods for Speaker Verification”, submitted to the International Conference on Spoken Language Processing (ICSLP2000), Beijing, China. BIBLIOGRAPHY 148 [Tran and Wagner 2000d] Dat Tran and Michael Wagner, “A Proposed Likelihood Transformation for Speaker Verification”, the International Conference on Acoustics, Speech & Signal Processing (ICASSP’2000), Turkey (to appear). [Tran and Wagner 2000e] Dat Tran and Michael Wagner, “Frame-Level Hidden Markov Models”, the International Conference on Advances in Intelligent Systems: Theory and Applications (ISTA’2000), Australia (to appear). [Tran and Wagner 2000f] Dat Tran and Michael Wagner, “Fuzzy Entropy Clustering”, the FUZZIEEE’2000 Conference, USA (to appear). [Tran and Wagner 2000g] Dat Tran and Michael Wagner, “An Application of Fuzzy Entropy Clustering In Speaker Identification”, in Proceedings of the Joint Conference on Information Sciences 2000 (Fuzzy Theory and Technology Track), vol. 1, pp. 228-231, 2000, Atlantic City, NJ, USA. [Tran and Wagner 2000h] Dat Tran and Michael Wagner, “A General Approach to Hard, Fuzzy, and Probabilistic Models for Pattern Recognition”, the International Conference on Advances in Intelligent Systems: Theory and Applications (ISTA’2000), Australia (to appear). [Tran et al. 2000a] Dat Tran, Michael Wagner and Tuan Pham, “Hard Hidden Markov Models for Speech Recognition”, the 4rd World Multiconference on Systemetics, Cybernetics and Informatics/ The 6th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’2000), Florida, USA (to appear). [Tran et al. 2000b] Dat Tran, Michael Wagner and Tuan Pham, “Hard Gaussian Mixture Models for Speaker Recognition”, the 4rd World Multiconference on Systemetics, Cybernetics and Informatics/ The 6th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’2000), Florida, USA (to appear). [Tran 1999] Dat Tran, “Fuzzy Entropy Models for Speech Recognition”, the first prize of the 1999 IEEE ACT Section Student Paper Contest, in the Postgraduate Division, Australia. [Tran and Wagner 1999a] Dat Tran and Michael Wagner, “Hidden Markov models using fuzzy estimation”, in Proceedings of the EUROSPEECH’99 Conference, vol. 6, pp. 2749-2752, 1999, Hungary. [Tran and Wagner 1999b] Dat Tran and Michael Wagner, “Fuzzy expectation-maximisation algorithm for speech and speaker recognition”, in Proceedings of the 18th International Conference of the North American Fuzzy Information Society (NAFIPS’99), pp. 421-425, 1999, USA. [Tran and Wagner 1999c] Dat Tran and Michael Wagner, “Fuzzy hidden Markov models for speech and speaker recognition”, in Proceedings of the 18th International Conference of the North American Fuzzy Information Society (NAFIPS’99), pp. 426-430, 1999, USA. BIBLIOGRAPHY 149 [Tran and Wagner 1999d] Dat Tran and Michael Wagner, “Fuzzy approach to Gaussian mixture models and generalised Gaussian mixture models”, in Proceedings of the Computation Intelligence Methods and Applications (CIMA’99) Conference, pp. 154-158, 1999, USA. [Tran and Wagner 1999e] Dat Tran and Michael Wagner, “A robust clustering approach to fuzzy Gaussian mixture models for speaker identification”, in Proceedings of the Third International Conference on Knowledge-Based Intelligent Information Engineering Systems (KES’99), pp. 337-340, Adelaide, Australia. [Tran et al. 1999a] Dat Tran, Michael Wagner, and Tongtao Zheng, “A Fuzzy Approach to Statistical Models in Speech and Speaker Recognition”, in 1999 IEEE international Fuzzy Systems Conference Proceedings (FUZZ-IEEE’99), vol. 3, pp. 1275-1280, 1999, Korea. [Tran et al. 1999b] Dat Tran, Michael Wagner and Tongtao Zheng, “Fuzzy nearest prototype classifier applied to speaker identification”, in Proceedings of the European Symposium on Intelligent Techniques (ESIT’99) on CD-ROM, abstract on page 34, 1999, Greece. [Tran et al. 1999c] Dat Tran, Tuan Pham, and Michael Wagner, “Speaker recognition using Gaussian mixture models and relaxation labeling”, in Proceedings of the 3rd World Multiconference on Systemetics, Cybernetics and Informatics/ The 5th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS99), vol. 6, pp. 383-389, 1999, USA. [Tran et al. 1999d] Dat Tran, Michael Wagner and Tongtao Zheng, “State mixture modelling applied to speech and speaker recognition”, in a special issue of the Journal of Pattern Recognition Letters (Pattern Recognition in Practice VI), vol. 20, no. 11-13, pp. 1449-1456, 1999. [Tran 1998] Dat Tran “Hidden Markov models using state distribution”, the first prize of the 1998 IEEE ACT Section Student Paper Contest, in the Postgraduate Division, Australia. [Tran and Wagner 1998] Dat Tran and Michael Wagner, “Fuzzy Gaussian Mixture Models for Speaker Recognition”, in Special issue of the Australian Journal of Intelligent Information Processing Systems (AJIIPS), vol. 5, no. 4, pp. 293-300, 1998. [Tran et al. 1998a] Dat Tran, T. VanLe and Michael Wagner, “Fuzzy Gaussian Mixture Models for Speaker Recognition”, in Proceedings of the International Conference on Spoken Language Processing (ICSLP98), vol. 2, pp. 759-762, 1998, Australia. [Tran et al. 1998b] Dat Tran, Michael Wagner and T. VanLe, “A proposed decision rules based on fuzzy c-means clustering for speaker recognition”, in Proceedings of the International Conference on Spoken Language Processing (ICSLP98), vol. 2, pp. 755-758, 1998, Australia. [Tran et al. 1998c] Dat Tran, Michael Wagner and Tuan Pham, “Minimum Classifier Error and Relaxation Labelling for Speaker Recognition”, in Proceedings of the Speech computer Workshop, St Petersburg, (Specom 98), pp. 229-232, 1998, Russia. BIBLIOGRAPHY 150 [Tran et al. 1998d] Dat Tran, Minh Do, Michael Wagner and T. VanLe, “A proposed decision rule for speaker identification based on a posteriori probability”, in Proceedings of the ESCA Workshop (RLA2C98), pp. 85-88, 1998, France. [Tseng et al. 1987] H.-P. Tseng, M. J. Sabin and E. A. Lee, “Fuzzy vector quantisation applied to hidden Markov modelling”, in Proc. of the Inter. Conf. on Acoustics, Speech & Signal Processing (ICASSP’87), pp. 641-644, 1987. [Tsuboka and Nakahashi 1994] E. Tsuboka and J. Nakahashi, “On the fuzzy vector quantisation based hidden Markov model”, in Proc. Inter. Conf. on Acoustics, Speech & Signal Processing (ICASSP’94), vol. 1, pp. 537-640, 1994. [Upper 1997] D. R. Upper, Theory and algorithms for hidden Markov models and generalised hidden Markov models, PhD thesis in Mathematics, University of California at Berkeley, 1997. [Varga and Moore 1990] A. P. Varga and R. K. Moore, “Hidden Markov model decomposition of speech and noise”, in Proc. Inter. Conf. on Acoustics, Speech & Signal Processing (ICASSP’90), pp. 845-848, 1990. [Wagner 1996] Michael Wagner, “Combined speech-recognition speaker-verification system with modest training requirements”, in Proc. Sixth Australian International Conf. on Speech Science and Technology, Adelaide, Australia, pp. 139-143, 1996. [Wang 1992] Z. Wang and G. J. Klir, Fuzzy Measure Theory, Plenum Press, 1992. [Wilcox et al. 1994] L. Wilcox, F. Chen, D. Kimber, and V. Balasubramanian, “Segmentation of speech using speaker identification”, in Proc. IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’94), pp. I-161-164, 1994. [Windham 1983] M. P. Windham, “Geometrical fuzzy clustering algorithms”, Fuzzy sets Syst., vol. 10, pp. 271-279, 1983. [Woodland 1997] P. C. Woodland, “Broadcast news transcription using HTK”, in Proc. Inter. Conf. on Acoustics, Speech & Signal Processing (ICASSP’97), pp. , USA, 1997. [Wu 1983] C. F. J. Wu, “On the convergence properties of the EM algorithm”, Ann. Stat., vol. 11, pp. 95-103, 1983. [Yang and Cheng 1993] M. S. Yang and C. T. Chen, “On strong consistency of the fuzzy generalized nearest neighbour rule”, Fuzzy Sets Syst., vol. 3, no. 60, pp. 273-281, 1993. [Zadeh 1995] L. A. Zadeh, “Discussion: probability theory and fuzzy logic are complementary rather than competitive”, Technometrics, vol. 37, no. 3, pp. 271-276, 1995. BIBLIOGRAPHY 151 [Zadeh 1994] L. A. Zadeh, “Fuzzy logic, neural networks, and soft computing”, Communications of the ACM, vol. 37, no. 3, pp. 77-84, 1994. [Zadeh 1978] L. A. Zadeh, “Fuzzy sets as a basis for a theory of possibility”, Fuzzy Sets and Systems, vol. 1, no. 1, pp. 3-28, 1978. [Zadeh 1977] L. A. Zadeh, “Fuzzy sets and their application to pattern classification and clustering analysis”, Classification and Clustering, edited by J. Van Ryzin, Academic Press Inc, pp. 251-282 & 292-299, 1977. [Zadeh 1976] L. A. Zadeh, “The linguistic approach and its application to decision analysis”, Directions in large scale systems, edited by Y. C. Ho and S. K. Mitter, Plenum Publishing Corporation, pp. 339-370, 1976. [Zadeh 1968] L. A. Zadeh, “Probability measures of fuzzy events”, J. Math. Anal. Appl., vol. 23, no. 2, pp. 421-427, 1968. [Zadeh 1965] L. A. Zadeh, “Fuzzy Sets”, Inf. Control., vol. 8, no. 1, pp. 338-353, 1965. [Zhuang et al. 1989] X. Zhuang, R. M. Haralick and H. Joo, “A simplex-like algorithm for the relaxation labeling process”, IEEE Trans. Patt. Anal. Mach. Intell., vol. 11, pp. 1316-1321, 1989. Appendix A List of Publications 1. Dat Tran, “Fuzzy Entropy Models for Speech Recognition”, the first prize of the 1999 IEEE ACT Section Student Paper Contest, in the Postgraduate Division, Australia. 2. Dat Tran “Hidden Markov models using state distribution”, the first prize of the 1998 IEEE ACT Section Student Paper Contest, in the Postgraduate Division, Australia. 3. Dat Tran and Michael Wagner, “Fuzzy Modelling Techniques for Speech and Speaker Recognition”, the Special Issue on Recognition Technology of the IEEE Transactions on Fuzzy Systems (accepted subject to revision). 4. Dat Tran and Michael Wagner, “Fuzzy Entropy Hidden Markov Models for Speech Recognition”, submitted to the International Conference on Spoken Language Processing (ICSLP’2000), Beijing, China. 5. Dat Tran and Michael Wagner, “Fuzzy Normalisation Methods for Speaker Verification”, submitted to the International Conference on Spoken Language Processing (ICSLP’2000), Beijing, China. 6. Dat Tran and Michael Wagner, “A Proposed Likelihood Transformation for Speaker Verification”, the International Conference on Acoustics, Speech & Signal Processing (ICASSP’2000), Turkey (to appear). 7. Dat Tran and Michael Wagner, “Frame-Level Hidden Markov Models”, the International Conference on Advances in Intelligent Systems: Theory and Applications (ISTA’2000), Australia (to appear). 8. Dat Tran and Michael Wagner, “A General Approach to Hard, Fuzzy, and Probabilistic Models for Pattern Recognition”, the International Conference on Advances in Intelligent Systems: Theory and Applications (ISTA’2000), Australia (to appear). 9. Dat Tran and Michael Wagner, “Fuzzy Entropy Clustering”, the FUZZ-IEEE’2000 Conference, USA (to appear). 152 153 10. Dat Tran and Michael Wagner, “An Application of Fuzzy Entropy Clustering In Speaker Identification”, in Proceedings of the Joint Conference on Information Sciences 2000 (Fuzzy Theory and Technology Track), vol. 1, pp. 228-231, 2000, Atlantic City, NJ, USA. 11. Dat Tran and Michael Wagner, “Hidden Markov models using fuzzy estimation”, in Proceedings of the EUROSPEECH’99 Conference, vol. 6, pp. 2749-2752, 1999, Hungary. 12. Dat Tran and Michael Wagner, “Fuzzy expectation-maximisation algorithm for speech and speaker recognition”, in Proceedings of the 18th International Conference of the North American Fuzzy Information Society (NAFIPS’99), pp. 421-425, 1999, USA (Outstanding Student Paper Award, Top Honor). 13. Dat Tran and Michael Wagner, “Fuzzy hidden Markov models for speech and speaker recognition”, in Proceedings of the 18th International Conference of the North American Fuzzy Information Society (NAFIPS’99), pp. 426-430, 1999, USA (Outstanding Student Paper Award, Top Honor). 14. Dat Tran and Michael Wagner, “Fuzzy approach to Gaussian mixture models and generalised Gaussian mixture models”, in Proceedings of the Computation Intelligence Methods and Applications (CIMA’99) Conference, pp. 154-158, 1999, USA. 15. Dat Tran and Michael Wagner, “A robust clustering approach to fuzzy Gaussian mixture models for speaker identification”, in Proceedings of the Third International Conference on Knowledge-Based Intelligent Information Engineering Systems (KES’99), pp. 337-340, Adelaide, Australia. 16. Dat Tran and Michael Wagner, “Fuzzy Gaussian Mixture Models for Speaker Recognition”, in Special issue of the Australian Journal of Intelligent Information Processing Systems (AJIIPS), vol. 5, no. 4, pp. 293-300, 1998. 17. Dat Tran, Michael Wagner and T. VanLe, “A proposed decision rules based on fuzzy C-means clustering for speaker recognition”, in Proceedings of the International Conference on Spoken Language Processing (ICSLP’98), vol. 2, pp. 755-758, 1998, Australia. 18. Dat Tran, Michael Wagner and Tuan Pham, “Hard Hidden Markov Models for Speech Recognition”, the 4rd World Multiconference on Systemetics, Cybernetics and Informatics/ The 6th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’2000), Florida, USA (to appear). 19. Dat Tran, Michael Wagner and Tuan Pham, “Hard Gaussian Mixture Models for Speaker Recognition”, the 4rd World Multiconference on Systemetics, Cybernetics and Informatics/ The 6th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’2000), Florida, USA (to appear). 154 20. Dat Tran, Michael Wagner and Tuan Pham, “Minimum classifier Error and Relaxation labelling for speaker recognition”, in Proceedings of the Speech computer Workshop, St Petersburg, (Specom’98), pp. 229-232, 1998, Russia. 21. Dat Tran, Michael Wagner and Tongtao Zheng, “State mixture modelling applied to speech and speaker recognition”, in a special issue of the Journal of Pattern Recognition Letters (Pattern Recognition in Practice VI), vol. 20, no. 11-13, pp. 1449-1456, 1999. 22. Dat Tran, Michael Wagner, and Tongtao Zheng, “A Fuzzy Approach to Statistical Models in Speech and Speaker Recognition”, in 1999 IEEE international Fuzzy Systems Conference Proceedings (FUZZ-IEEE’99), vol. 3, pp. 1275-1280, 1999, Korea. 23. Dat Tran, Michael Wagner and Tongtao Zheng, “Fuzzy nearest prototype classifier applied to speaker identification”, in Proceedings of the European Symposium on Intelligent Techniques (ESIT’99) on CD-ROM, abstract on page 34, 1999, Greece. 24. Dat Tran, Tuan Pham, and Michael Wagner, “Speaker recognition using Gaussian mixture models and relaxation labeling”, in Proceedings of the 3rd World Multiconference on Systemetics, Cybernetics and Informatics/ The 5th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’99), vol. 6, pp. 383-389, 1999, USA. 25. Dat Tran, T. VanLe and Michael Wagner, “Fuzzy Gaussian Mixture Models for Speaker Recognition”, in Proceedings of the International Conference on Spoken Language Processing (ICSLP’98), vol. 2, pp. 759-762, 1998, Australia (paper selected to publish in a special issue of the Australian Journal of Intelligent Information Processing Systems). 26. Dat Tran, Minh Do, Michael Wagner and T. VanLe, “A proposed decision rule for speaker identification based on a posteriori probability”, in Proceedings of the ESCA Workshop (RLA2C’98), pp. 85-88, 1998, France. 27. Tuan Pham, Dat Tran and Michael Wagner, “Optimal fuzzy information fusion for speaker verification”, in Proceedings of the Computation Intelligence Methods and Applications Conference (CIMA’99), pp. 141-146, 1999, USA. 28. Tuan Pham, Dat Tran and Michael Wagner, “Speaker verification using relaxation labeling”, in Proceedings of the ESCA Workshop (RLA2C’98), pp. 29-32, 1998, France. 29. Le, T. V., Tran D., and Wagner, M., “Fuzzy evolutionary programming for hidden Markov modelling in speaker identification”, in Proceedings of the Congress on Evolutionary Computation (CEC’99), Washington DC, pp. 812-815, July 1999.

Log In

Fuzzy Approaches to Speech and Speaker Recognition

Related papers

Related papers