Fuzzy Approaches to
Speech and Speaker Recognition
A thesis submitted for the degree
of Doctor of Philosophy of
the University of Canberra
Dat Tat Tran
May 2000
Summary of Thesis
Statistical pattern recognition is the most successful approach to automatic speech and
speaker recognition (ASASR). Of all the statistical pattern recognition techniques, the hidden Markov model (HMM) is the most important. The Gaussian mixture model (GMM)
and vector quantisation (VQ) are also effective techniques, especially for speaker recognition
and in conjunction with HMMs, for speech recognition.
However, the performance of these techniques degrades rapidly in the context of insufficient training data and in the presence of noise or distortion. Fuzzy approaches with their
adjustable parameters can reduce such degradation.
Fuzzy set theory is one of the most successful approaches in pattern recognition, where,
based on the idea of a fuzzy membership function, fuzzy C-means (FCM) clustering and
noise clustering (NC) are the most important techniques.
To establish fuzzy approaches to ASASR, the following basic problems are solved. First,
a time-dependent fuzzy membership function is defined for the HMM. Second, a general
distance is proposed to obtain a relationship between modelling and clustering techniques.
Third, fuzzy entropy (FE) clustering is proposed to relate fuzzy models to statistical models. Finally, fuzzy membership functions are proposed as discriminant functions in decison
making.
The following models are proposed: 1) the FE-HMM, NC-FE-HMM, FE-GMM, NC-FEGMM, FE-VQ and NC-FE-VQ in the FE approach, 2) the FCM-HMM, NC-FCM-HMM,
FCM-GMM and NC-FCM-GMM in the FCM approach, and 3) the hard HMM and GMM
as the special models of both FE and FCM approaches. Finally, a fuzzy approach to speaker
verification and a further extension using possibility theory are also proposed.
The evaluation experiments performed on the TI46, ANDOSL and YOHO corpora show
better results for all of the proposed techniques in comparison with the non-fuzzy baseline
techniques.
ii
Certificate of Authorship of Thesis
Except as specially indicated in footnotes, quotations and the bibliography, I certify that I
am the sole author of the thesis submitted today entitled—
Fuzzy Approaches to Speech and Speaker Recognition
in terms of the Statement of Requirements for a Thesis issued by the University Higher
Degrees Committee.
Papers containing some of the material of the thesis have been published as Tran [1999,
1998], Tran and Wagner [2000a-h, 1999a-e, 1998], and Tran et al. [2000a,b, 1999a-d, 1998ad]. For all of the above joint papers, I certify that the contributions of my co-authors,
Michael Wagner and Tu Van Le, were solely made in their respective roles of primary and
secondary thesis supervisors.
For the joint papers Tran et al. [2000a,b, 1999c, 1998c], my co-author Tuan Pham contributed comments on my literature review and discussions on the theoretical development
while Michael Wagner contributed as a thesis supervisor. For the joint papers Tran et al.
[1999a,b,d], my co-author Tongtao Zheng contributed discussions on the experimental results while Michael Wagner contributed as a thesis supervisor. For the joint papers Tran et
al. [1998d], my co-author Minh Do contributed part of the programming and some discussions on the theoretical development and the experimental results while Michael Wagner
and Tu Van Le contributed as a thesis supervisor.
Acknowledgements
First and foremost, I would like to thank my primary supervisor, Professor Michael Wagner,
for his enormous support and encouragement during my research study at the University
of Canberra. I am also thankful for the advice and guidance he gave me in spite of his
busy schedule, for helping me organise the thesis draft and refine its contents, and for his
patience in answering my inquiries.
I would like to thank my secondary supervisor, Associate Professor Tu Van Le, for his
teaching and support he gave me to be a PhD candidate at the University of Canberra.
I would also like to thank staff members as well as research students at the University
of Canberra, for their support and for maintaining the excellent computing facilities which
were crucial for carrying out my research.
I am grateful for the University of Canberra Research Scholarship, which enabled me
to undertake this research in the period February 1997 - February 2000. I would also like
to thank the School of Computing and Division of Management and Technology which
provided funding for attending several conferences.
I would like to express my gratitude to all my lecturers and colleagues at the Department of Theoretical Physics, Faculty of Physics, and Faculty of Mathematics, University
of Ho Chi Minh City, Viet Nam.
Many special thanks to my family members. I am indebted to my parents for the
sacrifices they have made for me. I wish to thank my brothers-in-law and sisters-in-law as
well as my wife, Phuong Dao, my son, Nguyen Tran, and my daughter, Thao Tran, for
they have given me much support throughout the years of my thesis research.
Finally, this work is to the memory of my previous supervisor, Professor Phi Van Duong,
a scientist of the Abdus Salam International Centre for Theoretical Physics (ICTP), Trieste, Italy. A special thanks for his teaching, love, advice, guidance, support and encouragement throughout 12 years at the University of Ho Chi Minh City, Viet Nam.
iv
Contents
Summary of Thesis
ii
Acknowledgements
iv
List of Abbreviations
xv
1 Introduction
1.1
1
Current Approaches to Speech and Speaker Recognition . . . . . . . .
1
1.1.1
Statistical Pattern Recognition Approach . . . . . . . . . . . .
2
1.1.2
Modelling Techniques: HMM, GMM and VQ . . . . . . . . . .
2
Fuzzy Set Theory-Based Approach . . . . . . . . . . . . . . . . . . .
3
1.2.1
The Membership Function . . . . . . . . . . . . . . . . . . . .
3
1.2.2
Clustering Techniques: FCM and NC . . . . . . . . . . . . . .
4
1.3
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.4
Contributions of This Thesis . . . . . . . . . . . . . . . . . . . . . . .
7
1.4.1
Fuzzy Entropy Models . . . . . . . . . . . . . . . . . . . . . .
7
1.4.2
Fuzzy C-Means Models . . . . . . . . . . . . . . . . . . . . . .
8
1.4.3
Hard Models . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.4.4
A Fuzzy Approach to Speaker Verification . . . . . . . . . . .
8
1.4.5
Evaluation Experiments and Results . . . . . . . . . . . . . .
9
Extensions of This Thesis . . . . . . . . . . . . . . . . . . . . . . . .
9
1.2
1.5
2 Literature Review
2.1
10
Speech Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.1
11
Speech Sounds . . . . . . . . . . . . . . . . . . . . . . . . . .
v
CONTENTS
2.2
2.3
2.4
2.5
vi
2.1.2
Speech Signals . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.1.3
Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.1.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Speech and Speaker Recognition . . . . . . . . . . . . . . . . . . . . .
15
2.2.1
Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2.2
Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . .
18
2.2.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Statistical Modelling Techniques . . . . . . . . . . . . . . . . . . . . .
20
2.3.1
Maximum A Posteriori Rule . . . . . . . . . . . . . . . . . . .
20
2.3.2
Distribution Estimation Problem . . . . . . . . . . . . . . . .
21
2.3.3
Maximum Likelihood Estimation . . . . . . . . . . . . . . . .
22
2.3.4
Hidden Markov Modelling . . . . . . . . . . . . . . . . . . . .
24
Parameters and Types of HMMs
. . . . . . . . . . . . . . . .
24
Three Basic Problems for HMMs . . . . . . . . . . . . . . . .
26
2.3.5
Gaussian Mixture Modelling . . . . . . . . . . . . . . . . . . .
29
2.3.6
Vector Quantisation Modelling . . . . . . . . . . . . . . . . . .
31
2.3.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
Fuzzy Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . .
34
2.4.1
Fuzzy Sets and the Membership Function . . . . . . . . . . . .
34
2.4.2
Maximum Membership Rule . . . . . . . . . . . . . . . . . . .
35
2.4.3
Membership Estimation Problem . . . . . . . . . . . . . . . .
35
2.4.4
Pattern Recognition and Cluster Analysis . . . . . . . . . . .
35
2.4.5
Hard C-Means Clustering . . . . . . . . . . . . . . . . . . . .
36
2.4.6
Fuzzy C-Means Clustering . . . . . . . . . . . . . . . . . . . .
37
Fuzzy C-Means Algorithm . . . . . . . . . . . . . . . . . . . .
38
Gustafson-Kessel Algorithm . . . . . . . . . . . . . . . . . . .
38
Gath-Geva Algorithm
. . . . . . . . . . . . . . . . . . . . . .
39
2.4.7
Noise Clustering
. . . . . . . . . . . . . . . . . . . . . . . . .
39
2.4.8
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
Fuzzy Approaches in the Literature . . . . . . . . . . . . . . . . . . .
42
2.5.1
Maximum Membership Rule-Based Approach . . . . . . . . .
42
2.5.2
FCM-Based Approach . . . . . . . . . . . . . . . . . . . . . .
43
CONTENTS
vii
3 Fuzzy Entropy Models
46
3.1
Fuzzy Entropy Clustering . . . . . . . . . . . . . . . . . . . . . . . .
46
3.2
Modelling and Clustering Problems . . . . . . . . . . . . . . . . . . .
50
3.3
Maximum Fuzzy Likelihood Estimation . . . . . . . . . . . . . . . . .
51
3.4
Fuzzy Entropy Hidden Markov Models . . . . . . . . . . . . . . . . .
52
3.4.1
Fuzzy Membership Functions . . . . . . . . . . . . . . . . . .
52
3.4.2
Fuzzy Entropy Discrete HMM . . . . . . . . . . . . . . . . . .
54
3.4.3
Fuzzy Entropy Continuous HMM . . . . . . . . . . . . . . . .
57
3.4.4
Noise Clustering Approach . . . . . . . . . . . . . . . . . . . .
58
Fuzzy Entropy Gaussian Mixture Models . . . . . . . . . . . . . . . .
59
3.5.1
Fuzzy Entropy GMM . . . . . . . . . . . . . . . . . . . . . . .
59
3.5.2
Noise Clustering Approach . . . . . . . . . . . . . . . . . . . .
60
3.6
Fuzzy Entropy Vector Quantisation . . . . . . . . . . . . . . . . . . .
61
3.7
A Comparison Between Conventional and Fuzzy Entropy Models . . .
61
3.8
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . .
64
3.5
4 Fuzzy C-Means Models
66
4.1
Minimum Fuzzy Squared-Error Estimation . . . . . . . . . . . . . . .
66
4.2
Fuzzy C-Means Hidden Markov Models . . . . . . . . . . . . . . . . .
67
4.2.1
FCM Discrete HMM . . . . . . . . . . . . . . . . . . . . . . .
67
4.2.2
FCM Continuous HMM . . . . . . . . . . . . . . . . . . . . .
68
4.2.3
Noise Clustering Approach . . . . . . . . . . . . . . . . . . . .
69
Fuzzy C-Means Gaussian Mixture Models . . . . . . . . . . . . . . .
70
4.3.1
Fuzzy C-Means GMM . . . . . . . . . . . . . . . . . . . . . .
70
4.3.2
Noise Clustering Approach . . . . . . . . . . . . . . . . . . . .
70
4.4
Fuzzy C-Means Vector Quantisation . . . . . . . . . . . . . . . . . . .
71
4.5
Comparison Between FCM and FE Models . . . . . . . . . . . . . . .
71
4.6
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . .
75
4.3
5 Hard Models
76
5.1
From Fuzzy To Hard Models . . . . . . . . . . . . . . . . . . . . . . .
76
5.2
Hard Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . .
78
5.2.1
80
Hard Discrete HMM . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
5.2.2
viii
Hard Continuous HMM . . . . . . . . . . . . . . . . . . . . .
80
5.3
Hard Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . .
81
5.4
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . .
83
6 A Fuzzy Approach to Speaker Verification
85
6.1
A Speaker Verification System . . . . . . . . . . . . . . . . . . . . . .
86
6.2
Current Normalisation Methods . . . . . . . . . . . . . . . . . . . . .
87
6.3
Proposed Normalisation Methods . . . . . . . . . . . . . . . . . . . .
89
6.4
The Likelihood Transformation . . . . . . . . . . . . . . . . . . . . .
92
6.5
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . .
96
7 Evaluation Experiments and Results
7.1
97
Database Description . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
7.1.1
The TI46 database . . . . . . . . . . . . . . . . . . . . . . . .
97
7.1.2
The ANDOSL Database . . . . . . . . . . . . . . . . . . . . .
98
7.1.3
The YOHO Database . . . . . . . . . . . . . . . . . . . . . . .
99
7.2
Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
7.3
Algorithmic Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.4
7.5
7.6
7.7
7.3.1
Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.3.2
Constraints on parameters during training . . . . . . . . . . . 100
Isolated Word Recognition . . . . . . . . . . . . . . . . . . . . . . . . 102
7.4.1
E set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.4.2
10-Digit&10-Command Set Results . . . . . . . . . . . . . . . 105
7.4.3
46-Word Set Results . . . . . . . . . . . . . . . . . . . . . . . 106
Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.5.1
TI46 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.5.2
ANDOSL Results . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.5.3
YOHO Results . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.6.1
TI46 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.6.2
ANDOSL Results . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.6.3
YOHO Results . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 115
CONTENTS
ix
8 Extensions of the Thesis
8.1
8.2
123
Possibility Theory-Based Approach . . . . . . . . . . . . . . . . . . . 123
8.1.1
Possibility Theory . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.1.2
Possibility Distributions . . . . . . . . . . . . . . . . . . . . . 125
8.1.3
Maximum Possibility Rule . . . . . . . . . . . . . . . . . . . . 125
Possibilistic C-Means Approach . . . . . . . . . . . . . . . . . . . . . 126
8.2.1
Possibilistic C-Means Clustering . . . . . . . . . . . . . . . . . 126
8.2.2
PCM Approach to FE-HMMs . . . . . . . . . . . . . . . . . . 127
8.2.3
PCM Approach to FCM-HMMs . . . . . . . . . . . . . . . . . 128
8.2.4
PCM Approach to FE-GMMs . . . . . . . . . . . . . . . . . . 129
8.2.5
PCM Approach to FCM-GMMs . . . . . . . . . . . . . . . . . 130
8.2.6
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . 130
9 Conclusions and Future Research
133
9.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.2
Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . 135
Bibliography
136
A List of Publications
152
List of Figures
2.1
The speech signal of the utterance “one” (a) in the long period of time
from t = 0.3 sec to t = 0.6 sec and (b) in the short period of time from
t = 0.4 sec to t = 0.42 sec. . . . . . . . . . . . . . . . . . . . . . . . .
2.2
13
Block diagram of LPC front-end processor for speech and speaker
recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.3
An N -state left-to-right HMM with ∆i = 1 . . . . . . . . . . . . . . .
28
2.4
Relationships between HMM, GMM, and VQ techniques . . . . . . .
33
2.5
A statistical classifier for isolated word recognition and speaker identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.6
Clustering techniques and their extended versions . . . . . . . . . . .
41
3.1
Generating 3 clusters with different values of n: hard clustering as
n → 0, clusters increase their overlap with increasing n > 0, and are
identical to a single cluster as n → ∞ . . . . . . . . . . . . . . . . . .
3.2
49
States at each time t = 1, . . . , T are regarded as time-dependent fuzzy
sets. There are N × T fuzzy states connected by arrows into N T fuzzy
state sequences in the fuzzy HMM. . . . . . . . . . . . . . . . . . . .
3.3
The observation sequence O belongs to fuzzy state sequences being in
fuzzy state i at time t and fuzzy state j at time t + 1. . . . . . . . . .
3.4
53
54
The observation sequence X belongs to fuzzy state j and fuzzy mixture
k at time t in the fuzzy continuous HMM. . . . . . . . . . . . . . . .
55
3.5
From fuzzy entropy models to conventional models . . . . . . . . . .
62
3.6
The membership function uit with different values of the degree of fuzzy
3.7
entropy n versus the distance dit between vector xt and cluster i . . .
64
Fuzzy entropy models for speech and speaker recognition . . . . . . .
65
x
LIST OF FIGURES
xi
4.1 The relationship between FE model groups versus the degree of fuzzy entropy n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.2 The relationship between FCM model groups versus the degree of fuzziness m 72
4.3
The FCM membership function uit with different values of the degree
of fuzziness m versus the distance dit between vector xt and cluster i
4.4
73
Curves representing the functions used in the FE and FCM memberships, where x = d2it , m = 2 and n = 1
. . . . . . . . . . . . . . . . .
74
4.5
Fuzzy C-means models for speech and speaker recognition . . . . . .
75
5.1
From hard VQ to fuzzy VQ: an additional fuzzy entropy term for fuzzy
entropy VQ, and a weighting exponent m > 1 on each uit for fuzzy Cmeans VQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
77
From fuzzy VQ to (hard) VQ: n → 0 for FE-VQ or m → 1 for FCM-VQ
or using the minimum distance rule to compute uit directly. . . . . . .
77
5.3
Mutual relations between fuzzy and hard models . . . . . . . . . . . .
78
5.4
Possible state sequences in a 3-state Bakis HMM and a 3-state fuzzy
Bakis HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.5
A possible single state sequence in a 3-state hard HMM . . . . . . . .
79
5.6
A mixture of three Gaussian distributions in the GMM or the fuzzy
GMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.7
A set of three non overlapping Gaussian distributions in the hard GMM. 82
5.8
Relationships between hard models . . . . . . . . . . . . . . . . . . .
84
6.1
A Typical Speaker Verification System . . . . . . . . . . . . . . . . .
86
6.2
The transformation T where T (P )/P increases and T (P ) is non positive for 0 ≤ P ≤ 1: values of 4 ratios at A, B, C, and D are moved to
those at A’, B’, C’, and D’ . . . . . . . . . . . . . . . . . . . . . . . .
6.3
Histograms of speaker f7 in the TI46 using 16-mixture GMMs. The
EER is 6.67% for Fig. 6.3a and is 5.90% for Fig. 6.3b. . . . . . . . .
7.1
93
94
Isolated word recognition error (%) versus the number of state N for
the digit-set vocabulary, using left-to-right DHMMs, codebook size K
= 16, TI46 database . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
LIST OF FIGURES
7.2
xii
Speaker Identification Error (%) versus the degree of fuzziness m using
FCM-VQ speaker models, codebook size K = 16, TI46 corpus . . . . 102
7.3
Isolated word recognition error (%) versus the degree of fuzzy entropy n
for the E-set vocabulary, using 6-state left-to-right FE-DHMMs, codebook size of 16, TI46 corpus . . . . . . . . . . . . . . . . . . . . . . . 103
7.4
Speaker identification error rate (%) versus the number of mixtures for
16 speakers, using conventional GMMs, FCM-GMMs and NC-FCMGMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.5
Speaker identification error rate (%) versus the codebook size for 16
speakers, using VQ, FE-VQ and NC-FE-VQ codebooks . . . . . . . . 110
7.6
EERs (%) for GMM-based speaker verification performed on 16 speakers, using GMMs, FCM-GMMs and NC-FCM-GMMs. . . . . . . . . . 112
7.7
EERs (%) for VQ-based speaker verification performed on 16 speakers,
using VQ, FE-VQ and NC-FE-VQ codebooks . . . . . . . . . . . . . 113
8.1
PCM Clustering in Clustering Techniques . . . . . . . . . . . . . . . 131
8.2
PCM Approach to FE models for speech and speaker recognition
8.3
PCM Approach to FCM models for speech and speaker recognition . 132
. . 131
List of Tables
3.1
An example of memberships for the GMM and the FE-GMM . . . . .
6.1
The likelihood values for 4 input utterances X1 −X4 against the claimed
64
speaker λ0 and 3 impostors λ1 −λ3 , where X1c , X2c are from the claimed
speaker and X3i , X4i are from impostors . . . . . . . . . . . . . . . . .
90
6.2
Scores of 4 utterances using L3 (X) and L8 (X) . . . . . . . . . . . . .
91
6.3
Scores of 4 utterances using L3nc (X) and L8nc (X) . . . . . . . . . . .
92
7.1
Isolated word recognition error rates (%) for the E set . . . . . . . . . 104
7.2
Speaker-dependent recognition error rates (%) for the E set . . . . . . 106
7.3
Isolated word recognition error rates (%) for the 10-digit set . . . . . 107
7.4
Isolated word recognition error rates (%) for the 10-command set . . 107
7.5
Isolated word recognition error rates (%) for the 46-word set . . . . . 108
7.6
Speaker-dependent recognition error rates (%) for the 46-word set . . 117
7.7
Speaker identification error rates (%) for the ANDOSL corpus using
conventional GMMs, FE-GMMs and FCM-GMMs. . . . . . . . . . . 118
7.8
Speaker identification error rates (%) for the YOHO corpus using conventional GMMs, hard GMMs and VQ codebooks. . . . . . . . . . . . 119
7.9
EER Results (%) for the ANDOSL corpus using GMMs with different
background speaker sets. Rows in bold are the current normalisation
methods , others are the proposed methods. The index “nc” denotes
noise clustering-based methods. . . . . . . . . . . . . . . . . . . . . . 120
7.10 Equal Error Rate (EER) Results (%) for the YOHO corpus. Rows in
bold are the current normalisation methods , others are the proposed
methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xiii
LIST OF TABLES
xiv
7.11 Comparisons of EER Results (%) for the YOHO corpus using GMMs,
hard GMMs and VQ codebooks. Rows in bold are the current normalisation methods, others are the proposed methods. . . . . . . . . . . . 122
List of Abbreviations
ANN
artificial neural network
CHMM
continuous hidden Markov model
DTW
dynamic time warping
DHMM discrete hidden Markov model
EM
expectation maximisation
FE
fuzzy entropy
FCM
fuzzy C -means
GMM
Gaussian mixture model
HMM
hidden Markov model
LPC
linear predictive coding
MAP
maximum a posteriori
ML
maximum likelihood
MMI
maximum mutual information
NC
noise clustering
pdf
probability density function
PCM
possibilistic C -means
VQ
vector quantisation
xv
Chapter 1
Introduction
Research in automatic speech and speaker recognition by machine has been conducted
for more than four decades. Speech recognition is the process of automatically recognising the linguistic content in a spoken utterance. Speaker recognition can be classified into two specific tasks: identification and verification. Speaker identification
is the process of determining who is speaking based on information obtained from
the speaker’s speech. Speaker verification is the process of accepting or rejecting the
identity claim of a speaker.
1.1
Current Approaches to Speech and Speaker
Recognition
Three current approaches to speech and speaker recognition by machine are the
acoustic-phonetic approach, the pattern-recognition approach and the artifical intelligence approach. The acoustic-phonetic approach is based on the theory of acoustic phonetics which postulates that there exist finite, distinctive phonetic units in
spoken language and that the phonetic units are broadly characterised by sets of
properties that are manifest in the speech signal or its spectrum over time. The
acoustic-phonetic approach is usually based on segmentation of the speech signal and
subsequent feature extraction. The main problem with the acoustic-phonetic approach is the variability of the acoustic properties of a phoneme depending on many
factors including acoustic context, speaker gender, age, emotional state, etc. The
1
1.1 Current Approaches to Speech and Speaker Recognition
2
pattern-recognition approach generally uses the speech patterns directly, i.e. without
explicit feature determination and segmentation. This method has two steps: training
of speech patterns, and recognition of patterns via pattern comparison. Finally, the
artifical intelligence approach is a hybrid of the acoustic-phonetic approach and the
pattern-recognition approach in that it exploits ideas and concepts of both methods,
especially the use of an expert system for segmentation and labelling, and the use of
neural networks for learning the relationships between phonetic events and all known
inputs [Rabiner and Juang 1993].
1.1.1
Statistical Pattern Recognition Approach
In general, the pattern-recognition approach is the method of choice for speech and
speaker recognition because of its simplicity of use, proven high performance, and
relative robustness and invariance to different speech vocabularies, users, algorithms
and decision rules. The most successful approach to speech and speaker recognition
is to treat the speech signal as a stochastic pattern and to use a statistical pattern
recognition technique. The statistical formulation has its root in the classical Bayes
decision theory, which links a recognition task to the distribution estimation problem.
In order to be practically implementable, the distributions are usually parameterised
and thus the distribution estimation problem becomes a parameter estimation problem, where a reestimation algorithm using a set of parameter estimation equations
is established to find the right parametric form of the distributions. The unknown
parameters defining the distribution have to be estimated from the training data. To
obtain reliable parameter estimates, the training set needs to be of sufficient size in
relation to the number of parameters. When the amount of the training data is not
sufficient, the quality of the distribution parameter estimates cannot be guaranteed.
In other words, the minimum Bayes risk generally remains an unachievable lower
bound [Huang et al. 1996].
1.1.2
Modelling Techniques: HMM, GMM and VQ
In the statistical pattern recognition approach, hidden Markov modelling of the speech
signal is the most important technique. It has been used extensively to model funda-
1.2 Fuzzy Set Theory-Based Approach
3
mental speech units in speech recognition because the hidden Markov model (HMM)
can adequately characterise both the temporally and spectrally varying nature of
the speech signal [Rabiner and Juang 1993]. In speech recognition, the left-to-right
HMM with only self and forward transitions is the simplest word and subword model.
In text-dependent speaker recognition, training comprises spectral and temporal representations of specific utterances and testing must use the same utterances, the
ergodic HMM with all possible transitions can be used for broad phonetic categorisation [Matsui and Furui 1992]. In text-independent speaker recognition, training
comprises a good representation of all the speaker’s speech sounds and testing can be
any utterance, there are no constraints on the training and test text and the temporal
information has not been shown to be useful [Reynolds 1992]. The Gaussian mixture
model (GMM) is used in this case. The GMM uses a mixture of Gaussian densities
to model the distribution of a speaker-specific feature vectors. When little data is
available, the vector quantisation (VQ) technique is also effective in characterising
speaker-specific features. The VQ model is a codebook and is generated by clustering
the training feature vectors of each speaker.
1.2
Fuzzy Set Theory-Based Approach
An alternative successful approach in pattern recognition is the fuzzy set theorybased approach. Fuzzy set theory was introduced by Zadeh [1965] to represent and
manipulate data and information that possess nonstatistical uncertainty. Fuzzy set
theory is a generalisation of conventional set theory that was introduced as a new
way of representing the vagueness or imprecision that is ever present in our daily
experience as well as in natural language [Bezdek 1993].
1.2.1
The Membership Function
The membership function is the basic idea in fuzzy set theory. The membership of a
point in a fuzzy set represents the degree to which the point belongs to this fuzzy set.
The first fuzzy approach to speech and speaker recognition was the use of the memberships as discriminant functions in decision making [Pal and Majumder 1977]. In fuzzy
clustering, the memberships are not known in advance and have to be estimated from
1.3 Problem Statement
4
a training set of observations with known class labels. The membership estimation
procedure in fuzzy pattern recognition is called the abstraction [Bellman et al. 1966].
1.2.2
Clustering Techniques: FCM and NC
The most successful technique in fuzzy cluster analysis is fuzzy C-means (FCM) clustering, it is widely used in both theory and practical applications of fuzzy clustering
techniques to unsupervised classification [Zadeh 1977]. FCM clustering [Dunn 1974,
Bezdek 1973] is an extension of hard C-means (HCM), also known as K-means clustering [Duda and Hart 1973]. A general estimation procedure for the FCM technique
has been established and its convergence has been shown [Bezdek and Pal 1992].
However, the FCM technique has a problem of sensitivity to outliers. The sum
of the memberships of a feature vector across classes is always equal to one both for
clean data and for noisy data. It would be more reasonable that, if the feature vector
comes from noisy data or outliers, the memberships should be as small as possible
for all classes and the sum should be smaller than one. This property is important
since all parameter estimates are computed based on these memberships. An idea of
a noise cluster has been proposed [Davé 1991] in the noise clustering (NC) technique
to deal with noisy data or outliers.
All the above-mentioned subjects are reviewed respect to speech and speaker
recognition, statistical modelling and clustering in Chapter 2.
1.3
Problem Statement
In general, the most successful approach in speech and speaker recognition is the
statistical pattern recognition approach where the HMM is the most important technique and with the GMM and VQ also used in speaker recognition. However, the
performance of these techniques degrades rapidly in the context of insufficient training data and in the presence of noise or distortion. Fuzzy approaches with their
adjustable parameters can hopefully reduce such degradation. In pattern recognition, the fuzzy set theory-based approach is one of the most succesful approaches and
FCM clustering is the most important technique. Therefore, to obtain fuzzy pattern
recognition approaches to statistical methods in speech and speaker recognition, we
1.3 Problem Statement
5
need to solve the following basic problems. In the training phase, these are: 1) how
to determine the fuzzy membership functions for the statistical models and 2) how
to estimate these fuzzy membership functions from a training set of observations?
In the recognition phase, the basic problem is: 3) how to use the fuzzy membership
functions as discriminant functions for recognition?
For the first problem, we begin with the HMM technique in the statistical pattern
recognition approach and through considering the HMM, we show how the concept of
the fuzzy membership function can be used. In hidden Markov modelling, the underlying assumption is that the observation sequence obtained from speech processing can
be characterised as generated by a stochastic process. Each observation is regarded
as the output of another stochastic process—the hidden Markov process—which is
governed by the output probability distribution. A first-order hidden Markov process
consists of a finite state sequence where the initial state is governed by an initial
state distribution and state transitions which occur at discrete time t are governed
by transition probabilities which only depend on the previous state. The observation
sequence is regarded as being produced by different state sequences with corresponding probabilities. This situation can be more effectively represented by using fuzzy
set theory, where states at each time t are regarded as time-dependent fuzzy sets and
called fuzzy states. The time-dependent fuzzy membership ust =i (ot ) can be defined
as the degree of belonging of the observation ot to the fuzzy state st = i at time t.
However, the observations are always considered in the sequence O and related to
the state sequence S. Fuzzy state sequences are thus also defined as a sequence of
fuzzy states in time and the fuzzy membership function is defined for the observation
sequence O in fuzzy state sequence S, based on the fuzzy membership function of
the observation ot in the fuzzy state st [Tran and Wagner 1999a]. For example, to
compute the state transition matrix A, we consider fuzzy states at time t and time
t+1 included in corresponding fuzzy state sequences and define the fuzzy membership
function ust =i st+1 =j (O). This membership denotes the degree of belonging of the observation sequence O to fuzzy state sequences being in fuzzy state st = i at time t and
fuzzy state st+1 = j at time t+1. In this approach, probability and fuzziness are complementary rather than competitive. For the HMM, probability deals with stochastic
processes for the observation sequence and state sequences whereas fuzziness deals
1.3 Problem Statement
6
with the relationship between these sequences [Tran and Wagner 1999c].
For the second problem, estimating the fuzzy membership functions is based
on the selection of an optimisation criterion. The minimum squared-error criterion used in the FCM clustering and the NC techniques is very effective in cluster analysis [Bezdek and Pal 1992, Davé 1990], thus it can be applied to statistical modelling techniques. However, there are two sub-problems to be solved before
applying it. The first question is, how to obtain a relationship between the clustering and modelling techniques, since the goal of clustering techniques is to find
optimal partitions of data [Davé and Krishnapuram 1997] whereas the goal of statistical modelling techniques is to find the right parametric form of the distributions
[Juang et al. 1996]. For this sub-problem, a general distance for clustering techniques
is proposed [Tran and Wagner 1999b]. The distance is defined as a decreasing function of the component probability density, and hence grouping similar feature vectors
into a cluster becomes classification of these vectors into a component distribution.
Clusters are now represented by component distribution functions and hence characteristics of a cluster are not only its shape and location, but also the data density
in the cluster and possibly the temporal structure of data if the Markov process is
applied [Tran and Wagner 2000a]. Finding good partitions of data in clustering techniques thus leads to finding the right parametric form of distributions in the modelling
techniques. The second sub-problem is the relationship between the fuzzy and conventional models. Fuzzy models using FCM clustering can reduce to hard models
using HCM clustering if the degree of fuzziness m > 1 tends to 1 [Tran et al. 2000a].
However, the conventional HMM is not a hard model since there is more than one
possible state at each time t, therefore the relationship between the fuzzy HMM and
the HMM has not been established as well since the hard HMM has not yet been defined. For solving this problem, we propose an alternative clustering technique called
fuzzy entropy (FE) clustering [Tran and Wagner 2000f] and apply this technique to
the HMM to obtain the fuzzy entropy HMM (FE-HMM) [Tran and Wagner 2000b].
The degree of fuzzy entropy n > 0 is introduced in the FE-HMM. As n tends to 1,
FE-HMMs reduce to conventional HMMs. We also propose the hard HMM where
only the best state sequence is employed by using a binary (zero-one) membership
function [Tran et al. 2000a]. A hard GMM is also proposed, which employs only
1.4 Contributions of This Thesis
7
the most likely Gaussian distribution among the mixture of Gaussians to represent a
feature vector [Tran et al. 2000b].
For the third problem, it can be seen that the role of the fuzzy membership
function in fuzzy set theory and of the a posteriori probability in the Bayes decision theory are quite similar. Therefore the currently used maximum a posteriori
(MAP) decision rule can be generalised to the maximum fuzzy membership decision
rule [Tran et al. 1998b, Tran et al. 1998d]. Depending on which fuzzy technique is
applied, we can find a suitable form for the fuzzy membership function. For example,
in speaker verification, the fuzzy membership of an input utterance in the claimed
speaker’s fuzzy set of utterances is used as a similarity score to compare with a given
threshold in order to accept or reject this speaker. The fuzzy membership function
is determined as the ratio of f unctions of the claimed speaker’s and impostors’ likelihood functions [Tran and Wagner 2000c, Tran and Wagner 2000d].
1.4
Contributions of This Thesis
Based on solving the above-mentioned problems, fuzzy approaches to speech and
speaker recognition are proposed and evaluated in this thesis as follows.
1.4.1
Fuzzy Entropy Models
This fuzzy approach is presented in Chapter 3. FE models are based on a basic
algorithm termed FE clustering. The goal of this approach is not only to propose
a new fuzzy approach but also to show that statistical models, such as the HMM
and the GMM in the maximum likelihood scheme, can be viewed as fuzzy models, where probabilities of unobservable data, given observable data, are used as
fuzzy membership functions. Fuzzy entropy clustering, the maximum fuzzy likelihood criterion, the fuzzy EM algorithm, the fuzzy membership function as well
as FE-HMMs, FE-GMMs and FE-VQ and their NC versions are proposed in this
chapter [Tran and Wagner 2000a, Tran and Wagner 2000b, Tran and Wagner 2000f,
Tran and Wagner 2000g, Tran 1999].
The adjustment of the degree of fuzzy entropy n in FE models is an advantage.
When conventional models do not work well because of the insufficient training data
1.4 Contributions of This Thesis
8
problem or the complexity of the speech data, such as the nine English E-set letters,
a suitable value of n can be found to obtain better models.
1.4.2
Fuzzy C-Means Models
Chapter 4 presents this fuzzy approach. It is based on FCM clustering in fuzzy
pattern recognition. FCM models are estimated by the minimum fuzzy squared-error
criterion used in FCM clustering. The fuzzy EM algorithm is reformulated for this
criterion. FCM-HMMs, FCM-GMMs and FCM-VQ are respectively presented as well
as their NC versions. A discussion on the role of fuzzy memberships of FCM models
and a comparison between FCM models and FE models are also presented. Similarly
to FE models, FCM models also have an adjustable parameter called the degree of
fuzziness m > 1. Better models can be obtained in the case of the insufficient training
data problem and the complexity of speech data problem using a suitable value of m
[Tran and Wagner 1999a]-[Tran and Wagner 1999e].
1.4.3
Hard Models
As the degrees of fuzzy entropy and fuzziness tend to their minimum values, both
fuzzy entropy and fuzzy C-means models tend to the same limit which is the corresponding hard model. The simplest hard model is the VQ model, which is effective for speaker recognition. This chapter proposes new hard models—hard HMMs
and hard GMMs. These models emerge as interesting consequences of investigating fuzzy approaches. The hard HMM employs only the best path for estimating
model parameters and for recognition and the hard GMM employs only the most
likely Gaussian distribution among the mixture of Gaussians to represent a feature vector. Hard models can be very useful because they are simple yet efficient
[Tran et al. 2000a, Tran et al. 2000b].
1.4.4
A Fuzzy Approach to Speaker Verification
An even more interesting fuzzy approach is proposed in Chapter 6. The speaker verification process is reconsidered from the viewpoint of fuzzy set theory and hence
a likelihood transformation and seven fuzzy normalisation methods are proposed
1.5 Extensions of This Thesis
9
[Tran and Wagner 2000c, Tran and Wagner 2000d]. This fuzzy approach also leads
to a noise clustering-based version for all normalisation methods, which improves
speaker verification performance markedly.
1.4.5
Evaluation Experiments and Results
The evaluation of FE, FCM and hard models is presented in Chapter 7. Proposed
normalisation methods for speaker verification are also evaluated in this chapter.
The three speech corpora used in the experiments were the TI46, the ANDOSL and
the YOHO corpora. Isolated word recognition experiments were performed on the
E set, 10-digit set, 10-command set and 46-word set of the TI46 corpus. Speaker
identification and verification experiments were performed on the TI46 (16 speakers),
the ANDOSL (108 speakers) and the YOHO (138 speakers) corpora. Experiments
show that fuzzy models and their noise clustering versions outperform conventional
models in most of experiments. Hard hidden Markov models also achieved good
results.
1.5
Extensions of This Thesis
The fuzzy membership function in fuzzy set theory and the a posteriori probability in
the Bayes decision theory have very similar meanings, however it can be shown that
the minimum Bayes risk for the recognition process is obtained by the maximum a
posteriori probability rule whereas the maximum membership rule does not lead to
such a minimum risk. This problem can be overcome by using a well developed branch
of fuzzy set theory, namely posibility theory. A possibilistic pattern recognition approach in our viewpoint is nearly as developed as the statistical pattern recognition
approach. In the last chapter, we present the fundamentals of possibility theory and
propose a possibilistic C-means approach to speech and speaker recognition. Future
research into the possibility approach is suggested.
Chapter 2
Literature Review
This chapter provides a background review of statistical modelling techniques in speech
and speaker recognition and clustering techniques in pattern recognition. The essential characteristics of speech signals and speech processing are summarised in Section
2.1. An overview of the various disciplines required for understanding aspects of speech
and speaker recognition is presented in Section 2.2. Statistical modelling techniques are
overviewed in Section 2.3. In this section, we first attend to the basic issues in statistical pattern recognition techniques including Bayes decision theory, the distribution
estimation problem, maximum likelihood estimation and the expectation-maximisation
algorithm. Second, three widely used statistical modelling techniques—hidden Markov
modelling, Gaussian mixture modelling and vector quantisation—are described. Fuzzy
cluster analysis techniques are overviewed in Section 2.4. Fuzzy set theory, the fuzzy
membership function, the role of cluster analysis in pattern recognition and three basic
clustering techniques—hard C-means, fuzzy C-means and noise clustering—are reviewed
in this section. The last section 2.5 reviews the literature on fuzzy approaches to speech
and speaker recognition.
2.1
Speech Characteristics
Speech is the most natural means of communication among human beings, therefore it
plays a key role in the development of a natural interface to enhance human-machine
communication. This section briefly presents the nature of speech sounds and features
10
2.1 Speech Characteristics
11
of speech signals that lead to methods used to process speech.
2.1.1
Speech Sounds
Speech is produced as a sequence of speech sounds corresponding to the message
to be conveyed. The state of the vocal cords as well as the positions, shapes, and
sizes of the various articulators, change over time in the speech production process
[O’Shaughnessy 1987]. There are three states of the vocal cords: silence, unvoiced and
voiced. Unvoiced sounds are produced when the glottis is open and the vocal cords
are not vibrating, so the resulting speech waveform is aperiodic or random in nature.
Voiced sounds are produced by forcing air through the glottis with the tension of the
vocal cords adjusted so that they vibrate in a relaxation oscillation with a resulting
speech waveform which is quasi-periodic [Rabiner and Schafer 1978]. Note that the
segmentation of the waveform into well-defined regions of silence, unvoiced, and voiced
signals is not exact. It is difficult to distinguish a weak, unvoiced sound from silence, or
a weak, voiced sound from unvoiced sounds or even silence [Rabiner and Juang 1993].
Phonemes are the smallest distinctive class of individual speech sounds in a language. The number of phonemes varies according variant to different linguists. Vowel
sounds are produced by exciting an essentially fixed vocal tract shape with quasiperiodic pulses of air caused by the vibration of the vocal cords. Vowels have the
largest amplitudes among phonemes and range in duration from 50 to 400 ms in normal speech [O’Shaughnessy 1987]. Diphthong sounds are gliding monosyllabic speech
sounds that start at or near the articulatory position for one vowel and move to or toward the position for another. They are produced by varying the vocal tract smoothly
between vowel configurations appropriate to the diphthong [Rabiner and Juang 1993].
The group of sounds consisting of /w/, /l/, /r/, and /j/ is quite difficult to characterise, /l/ and /r/ are called semivowels because of their vowel-like nature, /w/ and
/j/ are called glides because they are generally characterised by a brief gliding transition of the vocal tract shape between adjacent phonemes. The nasal consonants
/m/, /n/, and /η/ are produced with glottal excitation and the vocal tract totally
constricted at some point along the oral passageway while the velum is open and
allows air flow through the nasal tract. The unvoiced fricatives /f/, /θ/, /s/, and
/sh/ are produced by air flowing over a constriction in the vocal tract, with the lo-
2.1 Speech Characteristics
12
cation of the constriction determining the particular fricative sound produced. The
voiced fricatives /v/, /th/, /z/, and /zh/ are the counterparts of the unvoiced fricatives /f/, /θ/, /s/, and /sh/, respectively. For voiced fricatives, the vocal cords
are vibrating, and thus one excitation source is at the glottis. The unvoiced stop
consonants /p/, /t/, and /k/, and the voiced stop consonants /b/, /d/, and /g/
are transient, noncontinuant sounds produced by building up pressure behind a total constriction somewhere in the oral tract and then suddenly releasing the pressure. For the voiced stop consonants, the release of sound energy is accompanied
by vibrating vocal cords while for the unvoiced stop consonants the glottis is open
[Harrington and Cassidy 1996, Rabiner and Juang 1993, Flanagan 1972].
2.1.2
Speech Signals
The speech signal produced by the human vocal system is one of the most complex
signals known. In addition to the inherent physiological complexity of the human
vocal tract, the physical production system differs from one person to another. Even
when an utterance is repeated by the same person, the observed speech signal is
different each time. Moreover, the speech signal is influenced by the speaking environment, the channel used to transmit the signal, and, when recording it, also by the
transducer used to capture the signal [Rabiner et al. 1996].
The speech signal is a slowly time varying signal in the sense that, when examined over a sufficiently short period of time, depending on the speech sound between
5 and 100 ms, its characteristics are approximately stationary. But over longer periods of time, the signal characteristics are non-stationary. They change to reflect the
sequence of different speech sounds being spoken [Juang et al. 1996]. An illustration
of this characteristic is given in Figure 2.1, which shows the time waveform corresponding to the word “one” as spoken by a female speaker. The non-stationarity is
observed in the long period of time from t = 0.3 sec to t = 0.6 sec (300 msec) in
Figure 2.1.a. In the short period of time from t = 0.4 sec to t = 0.42 sec (20 msec),
Figure 2.1.b shows the stationarity of the speech signal.
The “quasi-stationarity” is the first characteristic of speech that distinguishes
it from other random, non-stationary signals. Based on this characterisation of the
speech signal, a reasonable speech model should have the following components. First,
2.1 Speech Characteristics
13
Figure 2.1: The speech signal of the utterance “one” (a) in the long period of time
from t = 0.3 sec to t = 0.6 sec and (b) in the short period of time from t = 0.4 sec to
t = 0.42 sec.
short-time measurements at an interval of the order of 10 ms are to be made along
the pertinent speech dimensions that best carry the relevant information for linguistic or speaker distinction. Second, because of the existence of the quasi-stationary
region, the neighbouring short-time measurements on the order of 100 ms need to
be simultaneously considered, either as a group of identically and independently distributed observations or as a segment of a non-stationary random process covering
two quasi-stationary regions. The last component is a mechanism that describes the
sound change behaviour among the sound segments in the utterance. This characteristic takes into account the implicit structure of the utterance, words, syntax, and so
on in a probability distribution sense [Juang et al. 1996].
2.1.3
Speech Processing
The speech signal can be parametrically represented by a number of variables related
to short time energy, fundamental frequency and sound spectrum. Probably the most
important parametric representation of speech is the short time spectral envelope, in
2.1 Speech Characteristics
14
which the two most common choices of spectral analysis are the filterbank and the
linear predictive coding spectral analysis models. In the filterbank model, the speech
signal is passed through a bank of Q bandpass filters whose coverage spans the frequency range of interest in the signal (e.g., 300-3000 Hz for telephone-quality signals,
50-8000 Hz for broadband signals). The individual filters can overlap in frequency
and the output of a bandpass filter is the short-time spectral representation of the
speech signal. The linear predictive coding (LPC) model performs spectral analysis on blocks of speech (speech frames) with an all-pole modelling constraint. Each
individual frame is windowed so as to minimise the signal discontinuities at the beginning and end of each frame. Thus the output of the LPC spectral analysis block
is a vector of coefficients that specify the spectrum of an all-pole model that best
matches the speech signal [Rabiner and Schafer 1978]. The most common parameter
set derived from either LPC or filterbank spectra is a vector of cepstral coefficients.
The cepstral coefficients cm (t), m = 1, . . . , Q, which are the coefficients of the Fourier
transform representation of the log magnitude spectrum of speech frame t, have been
shown to be a more robust, reliable feature set for speech recognition than the spectral vectors. The temporal cepstral derivatives (also known as delta cepstra) are
often used as additional features to model trajectory information [Campbell 1997].
For each speech frame, the result of the feature analysis is a vector of Q weighted
cepstral coefficients and an appended vector of Q cepstral time derivatives as follows
[Rabiner and Juang 1993]
x′t = (ĉ1 (t), . . . , ĉQ (t), △ĉ1 (t), . . . , △ĉQ (t))
(2.1)
where t is the index of the speech frame, x′t is tranpose of vector xt with 2Q components, and ĉm (t) = wm cm (t), wm is the weighting function to truncate the computation
and de-emphasizes cm around m = 1 and around m = Q
wm = 1 +
πm
Q
sin
2
Q
(
)
1≤m≤Q
(2.2)
If second-order temporal derivatives △2 ĉm (t) are computed, these are appended to
vector xt giving a vector with 3Q components.
A block diagram of the LPC front-end processor is shown in Figure 2.2. The speech
signal is preemphasised for spectrally flattening, then is blocked into frames. Frames
2.2 Speech and Speaker Recognition
✲
Digitised Speech
Preemphasis
15
✲
Frame
Blocking
✲ Windowing
ĉm (t) ✛
∆ĉm (t)
✛
Temporal ✛
Derivative
Parameter ✛
Weighting
❄
Parameter
Computing
Figure 2.2: Block diagram of LPC front-end processor for speech and speaker recognition
are Hamming windowed, a typical window used for the autocorrelation method of
LPC. Then the cepstral coefficients cm (t) weighted by wm and the temporal cepstral
derivatives are computed for each frame.
2.1.4
Summary
The quasi-stationarity is an important characteristic of speech. The speech signal
after spectral analysis is converted into a feature vector sequence. The current most
commonly used short-term measurements are cepstral coefficients which form a robust, reliable feature set of speech for speech and speaker recognition.
2.2
Speech and Speaker Recognition
Recognising the liguistic content in a spoken utterance and identifying the talker of
the utterance through developing algorithms and implementing them on machines are
the goal of speech and speaker recognition [C.-H. Lee et al. 1996]. A brief overview
of some of the fundamental aspects of speech and speaker recognition is given in this
section.
2.2.1
Speech Recognition
Broadly speaking, there are three approaches to speech recognition by machine,
namely, the acoustic-phonetic approach, the pattern-recognition approach, and the
artificial intelligence approach. The acoustic-phonetic approach is based on the theory
2.2 Speech and Speaker Recognition
16
of acoustic phonetics that postulates that there exists a finite set of distinctive phonetic units in spoken language and that the phonetic units are broadly characterised
by sets of properties that are manifest in the speech signal or its spectrum over time.
The problem with this approach is the fact that the degree to which each phonetic
property is realised in the acoustic signal varies greatly between speakers, between
phonetic contexts and even between repeated realisations of the phoneme by the same
speaker in the same context. This approach generally requires the segmentation of the
speech signal into acoustic-phonetic units and the identification of those units through
their known properties or features. The pattern-recognition approach is basically one
in which the speech patterns are used directly without explicit feature determination and segmentation. This method has two steps: training of speech patterns, and
recognition of patterns via pattern comparison. The artifical intelligence approach is
a hybrid of the acoustic-phonetic approach and the pattern-recognition approach in
that it exploits ideas and concepts of both methods, especially the use of an expert
system for segmentation and labelling, and the use of neural networks for learning the
relationships between phonetic events and all known inputs. Currently, the patternrecognition approach is the method of choice for speech recognition because of its
simplicity of use, proven high performance, robustness to different acoustic-phonetic
realisations and invariance to different speech vocabularies, users, algorithms and
decision rules [Rabiner and Juang 1993].
Depending on the mode of speech that the system is designed to handle, three
tasks of speech recognition can be distinguished: isolated-word, connected-word, and
continuous speech recognition. Continuous speech recognition allows natural conversational speech—150-250 words/min, with little or no adaptation of speaking style
imposed on system users. Isolated-word recognition requires the speaker to pause for
at least 100-250 ms after each word. It is unnatural for speakers and slows the processing rate to about 20-100 words/min. Continuous speech recognition is much more
difficult than isolated word recognition due to the absence of word boundary information. Connected-word speech recognition represents a compromise between the two extremes, the speaker needs not pause but must pronounce and stress each word clearly
[O’Shaughnessy 1987]. Restrictions on the vocabulary size differentiate speech recognition systems. Small vocabulary is about 100-200 words, large vocabulary is about
2.2 Speech and Speaker Recognition
17
1000 words and very large vocabulary—5000 words or greater. An alternative factor affecting speech recognition performance is speaker dependence/independence. Generally, speaker-dependent systems achieve better recognition performance than speakerindependent systems—identifying speech from many talkers—because of the limited
variability in the speech signal coming from a single speaker. Speaker-dependent systems demonstrate good performance only for speakers who have previously trained
the system [Kewley-Port 1995].
Research in automatic speech and speaker recognition by machine has been conducted for almost four decades and the earliest attempts to devise systems for automatic speech recognition by machine were made in the 1950s. Several fundamental
ideas in speech recognition were published in the 1960s and speech-recognition research achieved a number of significant milestones in the 1970s. Just as isolated
word recognition was a key focus of research in the 1970s, the problem of connected
word recognition was a focus of research in the 1980s. Speech research in the 1980s
was characterised by a shift in technology from template-based approaches to statistical modelling methods, especially the hidden Markov model approach [Rabiner 1989].
Since then, hidden Markov model techniques have become widely applied in virtually
every speech-recognition system.
Speech recognition systems have been developed for a wide variety of applications
both within telecommunications and in the business arena. In telecommunications, a
speech recognition system can provide information or access to data or services over
the telephone line. It can also provide recognition capability on the desktop/office
including voice control of PC and workstation environments. In manufacturing and
business, a recognition capability is provided to aid in the manufacturing processes.
Other applications include the use of speech recognition in toys and games. A significant portion of the research in speech processing in the past few years has gone into
studying practical methods for speech recognition in the world. In the United States,
major research efforts have been carried out at AT&T (the Next-Generation Text-toSpeech System) [AT&T’s web site], IBM (ViaVoice Speech Recognition and the TANGORA System) [IBM’s web site, Das and Picheny 1996], BBN (the BYBLOS and
SPIN Systems) [BBN’s web site], Dragon (the Dragon NaturallySpeaking Products)
[DRAGON’s web site], CMU (the SPHINX-II Systems) [Huang et al. 1996], Lincoln
2.2 Speech and Speaker Recognition
18
Laboratory [Paul 1989] and MIT (the Spoken Language Systems) [MIT’s web site].
The Hearing Health Care Research Unit Projects [Western Ontario’s web site] and
the INRS 86,000-word isolated word recognition system in Canada as well as the
Philips rail information system, the CSELT system for Eurorail information services [CSELT’s web site], the University of Duisburg [Duisburg’s web site], the Cambridge University systems [Cambridge’s web site], and the LIMSI voice recognition
[LIMSI’s web site] in Europe, are examples of the current activity in speech recognition research. Large vocabulary recognition systems are being developed based on
the concept of interpreting telephony and telephone directory assistance in Japan .
Syllable-recognisers have been designed to handle large vocabulary Mandarin dictation in China and Taiwan [Rabiner et al. 1996].
2.2.2
Speaker Recognition
Compared to speech recognition, there has been much less research in speaker recognition because fewer applications exist than for speech recognition. Speaker recognition
is the process of automatically recognising who is speaking based on information obtained from speech waves. Speaker recognition techniques can be used to verify the
identity claimed by people accessing certain protected systems. It enables access control of various services by voice. Voice dialing, banking over a telephone network,
database access services, security control for confidential information, remote access
of computers and the use for forensic purposes are important applications of speaker
recognition technology [Furui 1997, Kunzel 1994].
Variation in signal characteristics from trial to trial is the most important factor affecting speaker recognition performance. Variations arise not only between
speakers themselves but also from differences in recording and transmission conditions, noise and from a variety of psychological and physiological function within
an individual speaker. Normalisation and adaptation techniques have been applied
to compensate for these variations [Matsui and Furui 1993, Rosenberg et al. 1992,
Higgins et al. 1991, Varga and Moore 1990, Gish 1990]. Speaker recognition can be
classified into two specific tasks: identification and verification. Speaker identification is the process of determining which one of the voices known to the system best
matches the input voice sample. When an unknown speaker must be identified as one
2.2 Speech and Speaker Recognition
19
of the set of known speakers, the task is known as closed-set speaker identification. If
the input voice sample does not have a close enough match to anyone of the known
speakers and the system can produce a“no match” decision [Reynolds 1992], the task
is known as open-set speaker identification. Speaker verification is the process of accepting or rejecting the identity claim of a speaker. An identity claim is made by
an unknown speaker, and an utterance of this unknown speaker is compared with
the model for the speaker whose identity is claimed. If the match is good enough,
that is, above a given threshold, the identity claim is accepted. The use of a “cohort
speaker” set that is representative of the population close to the claimed speaker
has been proposed [Rosenberg et al. 1992]. In all verification paradigms, there are
two classes of errors: false rejections and false acceptances. A false rejection occurs
when the system incorrectly rejects a true spaker and a false acceptance occurs when
the system incorrectly accepts an imposter. An equal error rate condition is often
used to adjust system parameters so that the two types of errors are equally likely
[O’Shaughnessy 1987].
Speaker recognition methods can also be divided into text-dependent and textindependent. When the same text is used for both training and testing, the system is
said to be text-dependent. For text-independent operation, the text used to test the
system is theoretically unconstrained [Furui 1996]. Both text-dependent and independent speaker recognition systems can be defeated by playing back recordings of a
registered speaker. To overcome this problem, a small set of pass phrases can be used,
one of which is randomly chosen every time the system is used [Higgins et al. 1991].
Another method is text-prompted speaker recognition, which prompts the user for a
new pass phrase or every occasion [Matsui and Furui 1993]. An extension of speaker
recognition technology is the automatic extraction of the “turns” of each speaker
from a dialogue involving two or more speakers [Gish et al. 1991, Siu et al. 1992,
Wilcox et al. 1994].
2.2.3
Summary
Statistical pattern recognition is the method of choice for speech and speaker recognition, in which hidden Markov modelling of the speech signal is the most important
technique that has helped to advance the state of the art of automatic speech and
2.3 Statistical Modelling Techniques
20
speaker recognition.
2.3
Statistical Modelling Techniques
This section begins with a brief review of the classical Bayes decision theory and
its application to the formulation of statistical pattern recognition problems. The
distribution estimation problem in classifier design is then discussed in the Bayes
decision theory framework. The maximum likelihood estimation is reviewed as one
of the most widely used parametric unsupervised learning methods. Finally, three
widely-used techniques—hidden Markov modelling, Gaussian mixture modelling and
vetor quantisation modelling are reviewed in this framework.
2.3.1
Maximum A Posteriori Rule
The task of a recogniser (classifier) is to achieve the minimum recognition error rate.
A loss function is defined to measure this performance. It is generally non-negative
with a value of zero representing correct recognition [Juang et al. 1996]. Let X be
a random observation sequence from an information source, consisting of M classes
of events. The task of a recogniser is to correctly classify each X into one of the
M classes Ci , i = 1, 2, . . . , M . Suppose that when X truly belongs to class Cj but
the recogniser classifies X as belonging to class Ci , we incur a loss ℓ(Ci |Cj ). Since
the a posteriori probability P (Cj |X) is the probability that the true class is Cj , the
expected loss or risk associated with classifying X to class Ci is [Duda and Hart 1973]
R(Ci |X) =
M
∑
ℓ(Ci |Cj )P (Cj |X)
(2.3)
j=1
In speech and speaker recognition, the following zero-one loss function is usually
chosen
ℓ(Ci |Cj ) =
{
0
i=j
1
i 6= j
i, j = 1, . . . , M
(2.4)
This loss function assigns no loss to correct classification and a unit loss to any error
regardless of the class. The conditional loss becomes
R(Ci |X) =
∑
j6=i
P (Cj |X) = 1 − P (Ci |X)
(2.5)
2.3 Statistical Modelling Techniques
21
Therefore in order to achieve the minimum error rate classification, we have to select
the decision that Ci is correct, if the a posteriori probability P (Ci |X) is maximum
[Allerhand 1987]
C(X) = Ci
if
P (Ci |X) = max P (Cj |X)
1≤j≤M
(2.6)
The decision rule of (2.6) is called the maximum a posteriori (MAP) decision rule
and the minimum error rate achieved by the MAP decision is called Bayes risk
[Duda and Hart 1973].
2.3.2
Distribution Estimation Problem
For the implementation of the MAP rule, the required knowledge for an optimal
classification decision is thus the set of a posteriori probabilities. However, these
probabilities are not known in advance and have to be estimated from a training set
of observations with known class labels. The Bayes decision theory thus effectively
transforms the classifier design problem into the distribution estimation problem. This
is the basis of the statistical approach to pattern recognition [Juang et al. 1996].
The a posteriori probability can be computed by using the Bayes rule
P (Cj |X) =
P (X|Cj )P (Cj )
P (X)
(2.7)
It can be seen from (2.7) that decision making based on the a posteriori probability
employs both a priori knowledge from the a priori probability P (Cj ) together with
present observed data from the conditional probability P (X|Cj ). For the simple
case of isolated word recognition, the observations are the word utterances and the
class labels are the word identities. The conditional probability P (X|Cj ) is often
referred to as the acoustic model and the a priori probability P (Cj ) is known as
the language model [C.-H. Lee et al. 1996]. In order to be practically implementable,
the acoustic models are usually parameterised, and thus the distribution estimation
problem becomes a parameter estimation problem, where a reestimation algorithm and
a set of parameter estimation equations are established to find the best parameter set
λi for each class Ci , i = 1, . . . , M based on the given optimisation criterion. The final
task is to determine the right parametric form of the distributions and to estimate
the unknown parameters defining the distribution from the training data. To obtain
2.3 Statistical Modelling Techniques
22
reliable parameter estimates, the training set needs to be of sufficient size in relation to
the number of parameters. However, collecting and labelling data are labor intensive
and resource demanding processes. When the amount of the training data is not
sufficient, the quality of the distribution parameter estimates cannot be guaranteed.
In other words, a true MAP decision can rarely be implemented and the minimum
Bayes risk generally remains an unachievable lower bound [Juang et al. 1996].
2.3.3
Maximum Likelihood Estimation
As discussed above, the distribution estimation problem P (X|C) in the acoustic modelling approach becomes the parameter estimation problem P (X|λ). If λj denotes the
parameter set used to model a particular class Cj , the likelihood function of model
λj is defined as the probability P (X|λj ) treated as a function of the model λj . Maximising P (X|λj ) over λj is referred to as the maximum likelihood (ML) estimation
problem. It has been shown that if the model is capable of representing the true
distribution and enough training data is available, the ML estimate will be the best
estimate of the true parameters [Nadas 1983]. However, if the form of the distribution
is not known or the amount of training data is insufficient, the resulting parameter
set is not guaranteed to produce a Bayes classifier.
The expectation-maximisation (EM) algorithm proposed by Dempster, Laird and
Rubin [1977] is a general approach to the iterative computation of ML estimates
when the observation sequence can be viewed as incomplete data. Each iteration
of this algorithm consists of an expectation (E) step followed by a maximisation
(M) step. Many of its extensions and variations are popular tools for modal inference in a wide variety of statistical models in the physical, medical and biological
sciences [Booth and Hobert 1999, Liu et al. 1998, Freitas 1998, Ambroise et al. 1997,
Ghahramani 1995, Fessler and Hero 1994, Liu and Rubbin 1994].
In unsupervised learning, information on the class and state is unavailable, therefore the class and state are unobservable and only the data X are observable. Observable data are called incomplete data because they are missing the unobservable
data, and data composed both of observable data and unobservable data are called
complete data [Huang et al. 1990]. The purpose of the EM algorithm is to maximise
the log-likelihood log P (X|λ) from incomplete data. Suppose a measure space Y of
2.3 Statistical Modelling Techniques
23
unobservable data exists corresponding to a measure space X of observable (incomplete) data. For given X ∈ X, Y ∈ Y, and the parameter model set λ, let P (X|λ)
and P (Y |λ) be probability distribution functions defined on X and Y respectively.
To maximise the log-likelihood of the observable data X over λ, we obtain
L(X, λ) = log P (X|λ) = log P (X, Y |λ) − log P (Y |X, λ)
(2.8)
For two parameter sets λ and λ, the expectation of the incomplete log-likelihood
L(X, λ) over the complete data (X, Y ) conditioned by X and λ is
E[L(X, λ)|X, λ] = E[log P (X|λ)|X, λ] = log P (X|λ) = L(X, λ)
(2.9)
where E[.|X, λ] is the expectation conditioned by X and λ over complete data (X, Y ).
Using (2.8), we obtain
L(X, λ) = Q(λ, λ) − H(λ, λ)
(2.10)
where
Q(λ, λ) = E[log P (X, Y |λ)|X, λ] and H(λ, λ) = E[log P (Y |X, λ)|X, λ]
(2.11)
The basis of the EM algorithm lies in the fact that if Q(λ, λ) ≥ Q(λ, λ) then L(X, λ) ≥
L(X, λ) since it follows from Jensen’s inequality that H(λ, λ) ≤ H(λ, λ). This implies
that L(X, λ) increases monotonically on any iteration of parameter updates from λ
to λ via maximisation of the Q-function. When Y is discrete, the Q-function and the
H-function are represented as
Q(λ, λ) =
∑
P (Y |X, λ) log P (X, Y |λ) and H(λ, λ) =
Y ∈Y
∑
P (Y |X, λ) log P (Y |X, λ)
Y ∈Y
(2.12)
The following EM algorithm permits an easy maximisation of the Q-function
instead of maximising L(X, λ) directly.
Algorithm 1 (The EM Algorithm)
1. Initialisation: Fix Y and choose an initial estimate λ
2. E-step: Compute Q(λ, λ) based on the given λ
3. M-step: Use a certain optimisation method to determine λ , for which Q(λ, λ) ≥
Q(λ, λ)
2.3 Statistical Modelling Techniques
24
4. Termination: Set λ = λ, repeat from E-step until the change of Q(λ, λ) falls
below a preset threshold.
2.3.4
Hidden Markov Modelling
The underlying assumption of the HMM is that the speech signal can be well characterised as a parametric random process, and that the parameters of the stochastic
process can be estimated in a precise, well-defined manner. The HMM method provides a reliable way of recognizing speech for a wide range of applications [Juang 1998,
Ghahramani 1997, Furui 1997, Rabiner et al. 1996, Das and Picheny 1996].
There are two assumptions in the first-order HMM. The first is the Markov assumption, i.e. a new state is entered at each time t based on the transition probability,
which only depends on the previous state. It is used to characterise the sequence of
the time frames of a speech pattern. The second is the output-independence assumption, i.e. the output probability depends only on the state at that time regardless
of when and how the state is entered [Huang et al. 1990]. A process satisfying the
Markov assumption is called a Markov model [Kulkarni 1995]. An observable Markov
model is a process where the output is a set of states at each instant of time and
each state corresponds to an observable event. The hidden Markov model is a doubly
stochastic process with an underlying Markov process which is not directly observable
(hidden) but which can be observed through another set of stochastic processes that
produce observable events in each of the states [Rabiner and Juang 1993].
Parameters and Types of HMMs
Let O = (o1 o2 . . . oT ) be the observation sequence, S = (s1 s2 . . . sT ) the unobservable state sequence, X = (x1 x2 . . . xT ) the continuous vector sequence, V =
{v1 , v2 , . . . , vK } the discrete symbol set, and N the number of states. A compact
notation λ = {π, A, B} is proposed to indicate the complete parameter set of the
HMM [Rabiner and Juang 1993], where
• π = {πi },
• A = {aij },
πi = P (s1 = i|λ),
1 ≤ i ≤ N : the initial state distribution;
aij = P (st+1 = j|st = i, λ),
1 ≤ i, j ≤ N , and 1 ≤ t ≤ T − 1:
the state transition probability distribution, denoting the transition probability
2.3 Statistical Modelling Techniques
25
from state i at time t to state j at time t + 1; and
• B = {bj (ot )},
bj (ot ) = P (ot |st = j, λ),
1 ≤ j ≤ N , and
1 ≤ t ≤ T : the
observation probability distribution, denoting the probability of generating an
observation ot in state j at time t with probability bj (ot ).
One way to classify types of HMMs is by the structure of the transition matrix A
of the Markov chain [Huang et al. 1990]:
• Ergodic or fully connected HMM: every state can be reached from every other
state in a finite number of states. The initial state probabilities and the state
transition coefficients have the properties [Rabiner and Juang 1986]
0 ≤ πi ≤ 1,
N
∑
πi = 1
and
0 ≤ aij ≤ 1,
N
∑
aij = 1
(2.13)
j=1
i=1
• Left-to-right: as time increases, the state index increases or stays the same.
The state sequence must begin in state 1 and end in state N , i.e. πi = 0 if
i 6= 1 and πi = 1 if i = 1. The state- transition coefficients satisfy the following
fundamental properties
aij = 0
j < i,
0 ≤ aij ≤ 1,
and
N
∑
aij = 1
(2.14)
j=1
The additional constraint aij = 0, j > (i + ∆i), where ∆i > 0 is often placed
on the state-transition coefficients to make sure that large changes in state
indices do not occur [Rabiner 1989]. Such a model is called the Bakis model
[Bakis 1976], i.e. a left-to-right model which allows some states to be skipped.
An alternative way to classify types of HMMs is based on observations and their
representations [Huang et al. 1990]
• Discrete HMM (DHMM): the observations ot , 1 ≤ t ≤ T are discrete symbols
in V = {v1 , v2 , . . . , vK }, which are normally codevector indices of a VQ sourcecoding technique, and
B = {bj (k)},
1 ≤ j ≤ N,
bj (k) = P (ot = vk |st = j, λ),
1 ≤ k ≤ K,
K
∑
k=1
bj (k) = 1
(2.15)
2.3 Statistical Modelling Techniques
26
• Continuous HMM (CHMM): the observations ot ∈ O are vectors xt ∈ X and
the parametric representation of the observation probabilities is a mixture of
Gaussian distributions
B = {bj (xt )},
1 ≤ j ≤ N,
K
∑
bj (xt ) = P (xt |st = j, λ) =
1≤t≤T
wjk N (xt , µjk , Σjk ),
k=1
∫
X
where wjk is the kth mixture weight in state j satisfying
bj (xt )dxt = 1 (2.16)
∑K
k=1
wjk = 1 and
N (xt , µjk , Σjk ) is the kth Gaussian component density in state j with mean
vector µjk and covariance matrix Σjk (see Section 2.3.5 for detail).
Other variants have been proposed, such as factorial HMMs [Ghahramani 1997],
tied-mixture continuous HMMs [Bellegarda and Nahamoo 1990] and semi-continuous
HMMs [Huang and Jack 1989].
Three Basic Problems for HMMs
There are three basic problems to be solved for HMMs. The parameter estimation
problem is to train speech and speaker models, the evaluation problem is to compute
likelihood functions for recogniton and the decoding problem is to determine the best
fitting (unobservable) state sequence [Rabiner and Juang 1993, Huang et al. 1990].
The parameter estimation problem: This problem determines the optimal
model parameters λ of the HMM according to given optimisation criterion. A variant of the EM algorithm, known as the Baum-Welch algorithm, yields an iterative
procedure to reestimate the model parameters λ using the ML criterion [Baum 1972,
Baum and Sell 1968, Baum and Eagon 1967]. In the Baum-Welch algorithm, the unobservable data are the state sequence S and the observable data are the observation
sequence O. From (2.12), the Q-function for the HMM is as follows
Q(λ, λ) =
∑
P (S|O, λ) log P (O, S|λ)
(2.17)
S
Computing P (O, S|λ) [Rabiner and Juang 1993, Huang et al. 1990], we obtain
Q(λ, λ) =
T∑
−1 ∑
∑
t=0 st st+1
P (st , st+1 |O, λ) log [ast st+1 bst+1 (ot+1 )]
(2.18)
2.3 Statistical Modelling Techniques
27
where π s1 is denoted by as0 s1 for simplicity. Regrouping (2.18) into three terms for
the π, A, B coefficients, and applying Lagrange multipliers, we obtain the HMM
parameter estimation equations
• For discrete HMM:
π j = γ1 (i),
aij =
T∑
−1
ξt (i, j)
t=1
T∑
−1
,
bj (k) =
γt (i)
t=1
T
∑
γt (j)
t=1
s.t. ot =vk
T
∑
(2.19)
γt (j)
t=1
where
γt (i) =
N
∑
ξt (i, j),
j=1
ξt (i, j) = P (st = i, st+1 = j|O, λ) =
αt (i)aij bj (ot+1 )βt+1 (j)
N
N ∑
∑
(2.20)
αt (i)aij bj (ot+1 )βt+1 (j)
i=1 j=1
• For continuous HMM: estimation equations for the π and A distributions are
unchanged, but the output distribution B is estimated via Gaussian mixture
parameters as represented in (2.16)
wjk =
T
∑
ηt (j, k)
t=1
K
T ∑
∑
,
µjk =
ηt (j, k)
t=1 k=1
T
∑
T
∑
ηt (j, k)xt
t=1
T
∑
,
ηt (j, k)
t=1
ηt (j, k)(xt − µjk )(xt − µjk )′
Σjk =
t=1
T
∑
(2.21)
ηt (j, k)
t=1
where
ηt (j, k) =
αt (j)βt (j)
N
∑
j=1
αt (j)βt (j)
×
wjk N (xt , µjk , Σjk )
K
∑
(2.22)
wjk N (xt , µjk , Σjk )
k=1
Note that for practical implementation, a scaling procedure [Rabiner and Juang 1993]
is required to avoid number underflow on computers with ordinary floating-point
number representations .
2.3 Statistical Modelling Techniques
28
The evaluation problem: How can we efficiently compute P (O|λ), the probability that the observation sequence O was produced by the model λ?
For solving this problem, we obtain
P (O|λ) =
∑
P (O, S|λ) =
∑
πs1 bs1 (o1 )as1 s2 bs2 (o2 ) . . . asT −1 sT bsT (oT )
(2.23)
s1 ,s2 ,...,sT
all S
An interpretation of the computation in (2.23) is the following. At time t = 1, we are
in state s1 with probability πs1 , and generate the symbol o1 with probability bs1 (o1 ).
A transition is made from state s1 at time t = 1 to state s2 at time t = 2 with
probability as1 s2 and we generate a symbol o2 with probability bs2 (o2 ). This process
continues in this manner until the last transition at time T from state sT −1 to state sT
is made with probability asT −1 sT and we generate symbol oT with probability bsT (oT ).
Figure 2.3 shows an N -state left-to-right HMM with ∆i in (2.14) set to 1.
Figure 2.3: An N -state left-to-right HMM with ∆i = 1
To reduce computations, the forward and the backward variables are used. The
forward variable αt (i) is defined as
αt (i) = P (o1 o2 . . . ot , st = i|λ),
α1 (i) = πi bi (o1 ),
and
αt+1 (j) =
[∑
N
i=1
which can be computed iteratively as
1≤i≤N
]
αt (i)aij bj (ot+1 ),
1 ≤ j ≤ N,
1≤t≤T −1
(2.24)
2.3 Statistical Modelling Techniques
29
and the backward variable βt (i) is defined as
βt (i) = P (ot+1 ot+2 . . . oT |st = i, λ),
βT (i) = 1,
and
βt (i) =
N
∑
which can be computed iteratively as
1≤i≤N
aij bj (ot+1 )βt+1 (j),
1 ≤ i ≤ N,
t = T − 1, . . . , 1
(2.25)
j=1
Using these variables, the probability P (O|λ) can be computed following the forward
variable or the backward variable or both the forward and backward variables as
follows
P (O|λ) =
N
∑
αT (i) =
i=1
N
∑
πi bi (o1 )β1 (i) =
i=1
N
∑
αt (i)βt (i)
(2.26)
i=1
The decoding problem: Given the observation sequence O and the model λ,
how do we choose a corresponding state sequence S that is optimal in some sense?
This problem attempts to uncover the hidden part of the model. There are several
possible ways to solve this problem, but the most widely used criterion is to find
the single best state sequence that can be implemented by the Viterbi algorithm. In
practice, it is preferable to base recognition on the maximum likelihood state sequence
since this generalises easily to the continuous speech case. This likelihood is computed
using the same algorithm as the forward algorithm except that the summation is
replaced by a maximum operation.
2.3.5
Gaussian Mixture Modelling
Gaussian mixture models (GMMs) are effective models capable of achieving high
recognition accuracy for speaker recognition. As discussed above, HMMs can adequately characterise both the temporal and spectral varying nature of the speech
signal [Rabiner and Juang 1993], however for speaker recognition, the temporal information has been used effectively only in text-dependent mode. In text-independent
mode, there are no constraints on the training and test text and this temporal information has not been shown to be useful [Reynolds 1992]. On the other hand, the performance of text-independent speaker identification depends mostly on the total number
of mixture components (number of states times number of mixture components assigned to each state) per speaker model [Matsui and Furui 1993, Matsui and Furui 1992,
Reynolds 1992]. Therefore, it can be seen that, the N-state M-mixture continuous
2.3 Statistical Modelling Techniques
30
ergodic HMM is roughly equivalent to the NM-mixture GMM in text-independent
speaker recognition applications. In this case, the number of states does not play
an important role, and hence for simplicity, the 1-state HMM, i.e. the GMM is currently used for text-independent speaker recognition [Furui 1994]. Although we can
get equations for the GMM from the continuous HMM with the number of states
N = 1, for practical applications, they are summarised as follows.
The parameter estimation problem: The Q-function for the GMM is of the
form [Huang et al. 1990]
Q(λ, λ) =
T
K ∑
∑
P (i|xt , λ) log P (xt , i|λ) =
i=1 t=1
T
K ∑
∑
P (i|xt , λ) log[wi N (xt , µi , Σi )]
i=1 t=1
(2.27)
where P (i|xt , λ) is the a posteriori probability for the ith mixture, i = 1, . . . , K and
satisfies
P (i|xt , λ) =
P (xt , i|λ)
K
∑
wi N (xt , µi , Σi )
=
K
∑
P (xt , k|λ)
k=1
(2.28)
wk N (xt , µk , Σk )
k=1
λ = {w, µ, Σ} denotes a set of model parameters, where w = {wi }, µ = {µi }, Σ =
{Σi }, i = 1, . . . , K, wi are mixture weights satisfying
∑K
i=1
wi = 1 and N (xt , µi , Σi )
are the d-variate Gaussian component densities with mean vectors µi and covariance
matrices Σi
N (xt , µi , Σi ) =
1
1
d
(2π) 2 |Σi | 2
exp
{
1
− (xt − µi )′ Σ−1
i (xt − µi )
2
}
(2.29)
where (xt − µi )′ is the transpose of (xt − µi ), Σ−1
is the inverse of Σi , and |Σi | is the
i
determinant of Σi .
Setting derivatives of the Q-function with respect to λ to zero, the following
reestimation formulas are found [Huang et al. 1990, Reynolds 1995b]
T
1∑
P (i|xt , λ),
wi =
T t=1
µi =
T
∑
P (i|xt , λ)xt
t=1
T
∑
t=1
,
P (i|xt , λ)
Σi =
T
∑
P (i|xt , λ)(xt − µi )(xt − µi )′
t=1
T
∑
P (i|xt , λ)
t=1
(2.30)
The evaluation problem: For a training vector sequence X = (x1 x2 . . . xT ),
2.3 Statistical Modelling Techniques
31
the likelihood of the GMM is
log P (X|λ) =
T
∑
log P (xt |λ) =
t=1
t=1
2.3.6
T
∑
log
K
∑
wi N (xt , µi , Σi )
(2.31)
i=1
Vector Quantisation Modelling
Vector quantisation (VQ) is a data reduction method, which is used to convert
a feature vector set into a small set of distinct vectors using a clustering technique.
Advantages of this reduction are reduced storage, reduced computation,
and efficient representation of speech sounds [Furui 1996, Bellegarda 1996]. The
distinct vectors are called codevectors and the set of codevectors that best represents the training vector set is called the codebook. The VQ codebook can be
used as a speech or speaker model and a good recognition performance can be
obtained in many cases [Rabiner et al. 1983, Soong et al. 1987, Tseng et al. 1987,
Tsuboka and Nakahashi 1994, Bellegarda 1996]. Since there is only a finite number
of code vectors, the process of choosing the best representation of a given feature
vector is equivalent to quantising the vector and leads to a certain level of quantisation error. This error decreases as the size of the codebook increases, however the
storage required for a large codebook is nontrivial. The key point of VQ modelling
is to derive an optimal codebook which is commonly achieved by using the hard Cmeans (K-means) algorithm reviewed in Section 2.4.5. A variant of this algorithm is
the LBG algorithm [Linde et al. 1980], which is widely used in speech and speaker
recognition.
The difference between the GMM and VQ is the change from a “soft” mapping in
the GMM to a “hard” mapping in VQ of feature vectors into clusters [Chou et al. 1989].
In the GMM, we obtain
P (xt |λ) =
K
∑
wi N (xt , µi , Σi )
(2.32)
i=1
It means that vector xt can belong to K clusters (soft mapping) represented by K
Gaussian distributions. The degree of belonging of xt to the ith cluster is represented
by the probability P (i|xt , λ) and is determined as in (2.28). For the GMM, we obtain
0 ≤ P (i|xt , λ) ≤ 1. For VQ, vector xt is or is not in the ith cluster and the probability
2.3 Statistical Modelling Techniques
32
P (i|xt , λ) is determined as follows [Chou et al. 1989, Duda and Hart 1973]
P (i|xt , λ) =
{
1
if dit < djt
0
otherwise
∀j 6= i
(2.33)
where ties are broken randomly, dit denotes the distance between vector xt to the ith
cluster. If a particular distance is defined and (2.33) is replaced into (2.30), variants
of VQ are determined as follows
• Conventional VQ: the Euclidean distance d2it = (xt − µi )2 is used and
µi =
1
Ti
∑
xt
(2.34)
xt ∈cluster i
where Ti is the number of vectors in the ith cluster, and
∑K
i=1
Ti = T .
• Extended VQ: the Mahalanobis distance d2it = (xt − µi )′ Σ−1
i (xt − µi ) is used and
µi =
1
Ti
∑
Σi =
xt ,
xt ∈cluster i
1
Ti
∑
(xt − µi )(xt − µi )′
(2.35)
xt ∈cluster i
• Entropy-Constrained VQ: d2it = (xt − µi )′ Σ−1 (xt − µi ) − 2 log wi , assuming that
Σi = Σ ∀i, Σ fixed [Chou et al. 1989], and
µi =
1
Ti
∑
xt ∈cluster i
xt ,
wi =
Ti
T
(2.36)
The relationships between the statistical modelling techniques are summarised in
Figure 2.4
2.3.7
Summary
The statistical classifier based on the Bayes decision theory has been reviewed in this
section. The classifier design problem is to achieve the minimum recognition error
rate, which is performed by the MAP decision rule. Since the a posteriori probabilities
are not known in advance, the problem becomes a distribution estimation problem.
This is solved by determining the a priori probability and the likelihood function. As
discussed in Section 2.3.2, the former is derived from the language models in speech
recognition. In speaker identification, this is often simplified by assuming an equal
2.3 Statistical Modelling Techniques
33
Modelling Techniques
❄
❄
Discrete
HMM
❄
Continuous
HMM
λ = {π, A, B}
❄
GMM
(1-state HMM)
λ = {π, A, B = {w, µ, Σ}}
λ = {w, µ, Σ}
Changing from soft to hard mapping
❄
❄
Extended
VQ
ECVQ
λ = {µ, w}
λ = {µ, Σ}
❄
VQ
λ = {µ}
Figure 2.4: Relationships between HMM, GMM, and VQ techniques
Test Speech
S
❄
Speech
Analysis
Bayes
Theory
X
Prior
Knowledge
{P (Ci )}
1≤i≤M
✲
❄
MAP
Rule
✛
{P (X|λi )}
1≤i≤M
{P (Ci |X)}
❄
Recognised Class (Word, Speaker)
Ci∗ = arg max P (Ci |X)
Training Speech
{S (1) ,...,S (M ) }
❄
Speech
Analysis
{X (1) ,...,X (M ) }
❄
❄
Compute ✛ Models
Parameter
Likelihood {λ1 ,...,λM } Estimation
✻
1≤i≤M
1≤i≤M
Select a Modelling Technique
(HMM, GMM, VQ)
Figure 2.5: A statistical classifier for isolated word recognition and speaker identification
2.4 Fuzzy Clustering Techniques
34
a priori probability for all speakers. The latter is derived from the acoustic models,
which, in order to be practically implementable, are usually parameterised. Now
we need to solve the model parameter estimation problem. Model parameters are
determined such that the likelihood function is maximised. This is performed by the
EM algorithm. The HMM is the most effective model for solving this problem. In
the continuous case, the one-state HMM is identical to the GMM. If a hard mapping
of a vector to the Gaussian distribution is applied rather than a soft mapping in the
GMM, we obtain the VQ model and its variants. The block diagram in Figure 2.5
illustrates what we have summarised.
2.4
Fuzzy Clustering Techniques
This section begins with a brief review of fuzzy set theory and the membership function. The membership estimation problem is then mentioned. The role of cluster
analysis in pattern recognition as well as hard C-means, fuzzy C-means, noise clustering, and possibilistic C-means clustering techniques are then reviewed.
2.4.1
Fuzzy Sets and the Membership Function
Fuzzy set theory was introduced in 1965 by Lotfi Zadeh to represent and manipulate data and information that possess nonstatistical uncertainty. Fuzzy set theory
[Zadeh 1965] is a generalisation of conventional set theory that was introduced as a
new way to represent the vagueness or imprecision that is ever present in our daily
experience as well as in natural language [Bezdek 1993].
Let X be a feature vector space. A set A is called a crisp set if every feature
vector x in X either is in A (x ∈ A) or is not in A (x 6∈ A). A set B is called a
fuzzy set in X if it is characterized by a membership function uB (x), taking values
in the interval [0, 1] and representing the “degree of membership” of x in B. With
the ordinary set A, the membership value can take on only two values 0 and 1, with
uA (x) = 1 if x ∈ A or uA (x) = 0 if x 6∈ A [Zadeh 1965]. With the fuzzy set B, the
membership function uB (x) can take any value between 0 and 1. The membership
function is the basic idea in fuzzy set theory.
2.4 Fuzzy Clustering Techniques
2.4.2
35
Maximum Membership Rule
A membership function uCi (x) can represent the degree to which an observation x
belongs to a class Ci . In order to correctly classify an unknown observation x into
one of the classes Ci , i = 1, 2, . . . , M , the following maximum membership decision
rule can be used
C(x) = Ci
if
uCi (x) = max uCj (x)
1≤j≤M
(2.37)
Therefore in order to achieve the best classification, we have to decide on class Ci , if
the membership function uCi (x) is maximum [Keller et al. 1985].
2.4.3
Membership Estimation Problem
For the implementation of the maximum membership rule, the required knowledge for
an optimal classification decision is that of the membership functions. These functions
are not known in advance and have to be estimated from a training set of observations
with known class labels. The estimation procedure in fuzzy pattern recognition is
called the abstraction and the use of estimates to compute the membership values for
unknown observations not contained in the training set is called the generalisation
procedure [Bellman et al. 1966].
An estimate of the membership function is referred to as an abstracting function.
To generate a “good” abstracting function from the knowledge of its values over
a finite set of observations, we need some a priori information about the class of
functions to which the abstracting function belongs, such that this information in
combination with observations from X is sufficient for estimating. This approach
involves choosing a family of abstracting functions and finding a member of this
family which fits “best”, in some specified sense, the given observation sequence X.
In most practical situations, the a priori information about the membership function
of a fuzzy class is insufficient to generate an abstracting function, which is “optimal”
in a meaningful sense.
2.4.4
Pattern Recognition and Cluster Analysis
Pattern recognition can be characterised as “a field concerned with machine recognition of meaningful regularities in noisy or complex environments” [Duda and Hart 1973].
2.4 Fuzzy Clustering Techniques
36
A workable definition for pattern recognition is “the search for structure in data”
[Bezdek 1981]. Three main issues of the search for structure in data are: feature
selection, cluster analysis, and classification. Feature selection is the search for structure in data items, or observations xt ∈ X. The feature space X may be compressed
by eliminating redundant and unimportant features via selection or transformation.
Cluster analysis is the search for structure in data sets, or sequences X ∈ X. Since
“optimal” features are not known in advance, we often attempt to discover these by
clustering the feature variables. Finally, classification is the search for structure in
data spaces X. A pattern classifier designed for X is a device or means whereby X
itself is partitioned into “decision regions” [Bezdek 1981].
Clustering is the grouping of similar objects [Hartigan 1975]. Clustering in the
given unlabeled data X is to assign to feature vectors labels that identify “natural
subgroups” in X [Bezdek 1993]. In other words, clustering known as unsupervised
learning in X is a partitioning of X into C subsets or C clusters. The most important requirement is to find a suitable measure of clusters, referred to as a clustering
criterion. Objective function methods allow the most precise formulation of the clustering criterion. To construct an objective function, a similarity measure is required.
A standard way of expressing similarity is through a set of distances between pairs
of feature vectors. Optimising the objective function is performed to find optimal
partitions of data. The partitions generated by a clustering method define for all
data elements to which cluster they belong. The boundaries of partitions are sharp
in the hard clustering method or vague in the fuzzy clustering method. Each feature
vector of a fuzzy partition belongs to different clusters with different membership
values. Cluster validity is an important issue, which deals with the significance of the
structure imposed by a clustering method. It is required in order to determine an
optimal partition in the sense that it best explains the unknown structure in X.
2.4.5
Hard C-Means Clustering
Let U = [uit ] be a matrix whose elements are memberships of xt in the ith cluster,
i = 1, . . . , C, t = 1, . . . , T . Hard C-partition space for X is the set of matrices U such
2.4 Fuzzy Clustering Techniques
37
that [Bezdek 1993]
uit ∈ {0, 1}
C
∑
∀i, t,
uit = 1
∀t,
0<
T
∑
uit < T
∀i (2.38)
t=1
i=1
where uit = ui (xt ) is 1 or 0, according to whether xt is or is not in the ith cluster,
∑C
i=1
uit = 1 ∀t means each xt is in exactly one of the C clusters, and 0 <
∑T
t=1
uit < T
∀i means that no cluster is empty and no cluster is all of X because of 2 ≤ C < T .
The HCM method [Duda and Hart 1973] is based on minimisation of the sum-ofsquared-errors function as follows
J(U, λ; X) =
C ∑
T
∑
uit d2it
(2.39)
i=1 t=1
where U = {uit } is a hard C-partition of X, λ is a set of prototypes, in the simplest
case, it is the set of cluster centers: λ = {µ}, µ = {µi }, i = 1, . . . , C, and dit is the
distance in the A norm (A is any positive definite matrix) from xt to µi , known as a
measure of dissimilarity
d2it = ||xt − µi ||2A = (xt − µi )′ A(xt − µi )
(2.40)
Minimising the hard objective function J(U, λ; X) in (2.39) gives
uit =
µi =
{
1
dit < djt
0
otherwise
T
∑
uit xt
t=1
/∑
T
j = 1, . . . , C,
uit
j 6= i
(2.41)
(2.42)
t=1
where ties are broken randomly.
2.4.6
Fuzzy C-Means Clustering
The fuzzy C-means (FCM) method is the most widely used approach in both theory
and practical applications of fuzzy clustering techniques to unsupervised classification
[Zadeh 1977]. It is an extension of the hard C-means method that was first introduced
by Dunn [1974]. A weighting exponent m on each fuzzy membership called the degree
of fuzziness was introduced in the FCM method [Bezdek 1981] and hence a general
estimation procedure for the FCM has been established and its convergence has been
shown [Bezdek 1990, Bezdek and Pal 1992].
2.4 Fuzzy Clustering Techniques
38
Fuzzy C-Means Algorithm
Let U = [uit ] be a matrix whose elements are memberships of xt in cluster i, i =
1, . . . , C, t = 1, . . . , T . Fuzzy C-partition space for X is the set of matrices U such
that [Bezdek 1993]
0 ≤ uit ≤ 1
∀i, t,
C
∑
uit = 1
∀t,
0<
T
∑
uit < T
∀i (2.43)
t=1
i=1
where 0 ≤ uit ≤ 1 ∀i, t means it is possible for each xt to have an arbitrary distribution
of membership among the C fuzzy clusters.
The FCM method is based on minimisation of the fuzzy squared-error function as
follows [Bezdek 1981]
Jm (U, λ; X) =
T
C ∑
∑
2
um
it dit
(2.44)
i=1 t=1
where U = {uit } is a fuzzy C-partition of X, m > 1 is a weighting exponent on
each fuzzy membership uit and is called the degree of fuzziness, λ and dit are defined
as in (2.39) The basic idea of the FCM method is to minimise Jm (U, λ; X) over the
variables U and λ on the assumption that matrices U that are part of optimal pairs
for Jm (U, λ; X) identify good partitions of the data. Minimising the fuzzy objective
function Jm (U, λ; X) in (2.44) gives
uit =
1
C
∑
(2.45)
(d2it /d2kt )1/(m−1)
k=1
µi =
T
∑
t=1
um
it xt
/∑
T
um
it
(2.46)
t=1
Gustafson-Kessel Algorithm
An interesting modification of the FCM has been proposed by Gustafson and Kessel
[1979]. It attempts to recognise the fact that different clusters in the same data set
X may have differing geometric shapes. A generalisation to a metric that appears
more natural was made through the use of a fuzzy covariance matrix. Replacing the
distance in (2.40) by an inner product induced a norm of the form
d2it = (xt − µi )′ Mi (xt − µi )
(2.47)
2.4 Fuzzy Clustering Techniques
39
where the Mi , i = 1, . . . , C are symmetric and positive definite and subject to the
following constraints |Mi | = ρi , with ρi > 0 and fixed for each i. Define a fuzzy
covariance matrix Σi by
Σi =
T
∑
′
um
it (xt − µi )(xt − µi )
/∑
T
um
it
(2.48)
t=1
t=1
then we have Mi−1 = (|Mi ||Σi |)−1/d Σi , i = 1, . . . , C, where |Mi | and |Σi | are the
determinants of Mi and Σi , respectively and d is the vector space dimension.
The parameter set in this algorithm is λ = {µ, Σ}, where µ = {µi } and Σ = {Σi },
i = 1, . . . , C, are computed by (2.46) and (2.48), respectively.
Gath-Geva Algorithm
The algorithm proposed by Gath and Geva [1989] is an extension of the GustafsonKessel algorithm that also takes the size and density of the clusters into account. The
distance is chosen to be indirectly proportional to the probability P (xt , i|λ)
1
1
=
d2it =
P (xt , i|λ)
wi N (xt , µi , Σi )
(2.49)
where the Gaussian distribution N (xt , µi , Σi ) is defined in (2.29). The parameter
set in this algorithm is λ = {w, µ, Σ}, where µ = {µi }, Σ = {Σi } are computed as
the Gustafson-Kessel algorithm, and w = {wi }, wi are mixture weights computed as
follows
wi =
T
∑
t=1
um
it
/∑
C
T ∑
um
it
(2.50)
t=1 i=1
In contrast to the FCM and the Gustafson-Kessel algorithms, the Gath-Geva algorithm is not based on an objectice function, but is a fuzzification of statistical
estimators. If we were to apply for the Gath-Geva algorithm the same technique as
for the FCM and the Gustafson-Kessel algorithms, i.e. minimising the least-squares
function in (2.44), the resulting system of equations could not be solved analytically.
In this sense, the Gath-Geva algorithm is a good heuristic on the basis of an analogy
with probability theory [Höppner et al. 1999].
2.4.7
Noise Clustering
Both HCM and FCM clustering methods have a common disadvantage in the problem
of sensitivity to outliers. As can be seen from (2.38) and (2.43), the memberships are
2.4 Fuzzy Clustering Techniques
40
relative numbers. The sum of the memberships of a feature vector xt across classes is
always equal to one both for clean data and for noisy data, i.e. data “contaminated”
by erroneous points or “outliers”. It would be more reasonable that, if the feature
vector xt comes from noisy data or outliers, the memberships should be as small as
possible for all classes and the sum should be smaller than one. This property is
important since all parameter estimates are computed based on these memberships.
An idea of a noise cluster has been proposed by Davé [1991] to deal with noisy data
or outliers for fuzzy clustering methods.
The noise is considered to be a separate class and is represented by a prototype—
a parameter subset characterising a cluster—that has a constant distance δ from all
feature vectors. The membership u•t of a vector xt in the noise cluster is defined to
be
u•t = 1 −
C
∑
uit
t = 1, . . . , T
(2.51)
i=1
Therefore, the membership constraint for the “good” clusters is effectively relaxed to
C
∑
uit <1
t = 1, . . . , T
(2.52)
i=1
This allows noisy data and outliers to have arbitrarily small membership values in
good clusters. The objective function in the noise clustering (NC) approach is as
follows
Jm (U, λ; X) =
T
C ∑
∑
2
um
it dit
+
T
∑
t=1
i=1 t=1
δ
2
(
1−
C
∑
uit
i=1
)m
(2.53)
where U = {uit } is a noise-clustering C-partition of X and m > 1. Since the second term in (2.53) is independent of the parameter set λ and the distance measure,
parameters are estimated by minimising the first term—the squared-errors function
in FCM clustering (see 2.44), with respect to λ. Therefore (2.46) and (2.50) still
apply to this approach for parameter estimation. Minimising the objective function
Jm (U, λ; X) in (2.53) with respect to uit gives
uit =
1
C
∑
(d2it /d2kt )1/(m−1)
(2.54)
+
(d2it /δ 2 )1/(m−1)
k=1
The second term in the denominator of (2.54) becomes quite large for outliers, resulting in small membership values in all the good clusters for outliers. The advantage of
2.4 Fuzzy Clustering Techniques
41
this approach is that it forms a more robust version of the FCM algorithm and can
be used instead of the FCM algorithm provided a suitable value for constant distance
δ can be found [Davé and Krishnapuram 1997].
2.4.8
Summary
Fuzzy set theory, membership functions and clustering techniques have been reviewed
in this Section. Figure 2.6 illustrates the clustering techniques and shows the constraints on memberships for each technique. Since the Gath-Geva technique is not
based on an objective function as discussed above, corresponding versions of the
Gath-Geva for the NC and the PCM techniques have not been proposed. Extended
versions of the HCM are also extensions of VQ, which have been reviewed in Section
2.3.6.
Clustering Techniques
Memberships 0 ≤ uit ≤ 1, 0 <
T
∑
uit < T
t=1
uit ∈ {0, 1}
C
∑
uit = 1
λ = {µ}
λ = {µ, Σ}
λ = {µ, Σ, w}
❄
HCM
uit < 1
i=1
i=1
Parameter set
C
∑
❄
❄
FCM
NC
❄
GustafsonKessel
❄
Extended
NC
❄
GathGeva
Figure 2.6: Clustering techniques and their extended versions
The fuzzy membership of a feature vector in a cluster depends not only on where
the feature vector is located with respect to the cluster, but also on how far away
it is with respect to other clusters. Therefore fuzzy memberships are spread across
2.5 Fuzzy Approaches in the Literature
42
the classes and depend on the number of clusters present. FCM clustering has been
shown to be advantageous over hard clustering. It has become more attractive with
the connection to neural networks [Kosko 1992]. Recent advances in fuzzy clustering
have shown spectacular ability to detect not only hypervolume clusters, but also
clusters which are actually “thin shells” such as curves and surfaces [Davé 1990,
Krishnapuram et al. 1992].
However, the FCM membership values are relative numbers and thus cannot distinguish between feature vectors and outliers. It has been shown that the NC approach is quite successful in improving the robustness of a variety of fuzzy clustering algorithms. A robust-statistical foundation for the NC method was established
by Davé and Krishnapuram [1997]. Another approach is the possibilistic C-means
method, which is presented in Section 8.2.1 as a further extension of fuzzy set theorybased clustering techniques.
2.5
Fuzzy Approaches in the Literature
This chapter presents some fuzzy approaches to speech and speaker recognition in
the literature. The first approach is to apply the maximum membership decision rule
in Section 2.4.2. The second is the use of the FCM algorithm instead of the HCM
(K-means) algorithm in coding a cepstral vector sequence X for the discrete HMM.
The third approach is to apply fuzzy rules in hybrid neuro-fuzzy systems. The last
approach is not reviewed in this section since it is out of the scope of this thesis.
The works relating this approach can be found in Kasabov [1998] and Kasabov et al.
[1999].
2.5.1
Maximum Membership Rule-Based Approach
An early application based on fuzzy set theory for decision making was proposed
[Pal and Majumder 1977]. Recognition of vowels and identifying speakers using the
first three f ormants (F1 , F2 , and F 3) were implemented by using the membership
function ui (x) associated with an unknown vector x = {x1 , . . . , xn } for each model
2.5 Fuzzy Approaches in the Literature
43
λi , i = 1, . . . , M as follows
ui (x) =
1
1 + [d(x, λi )/E]F
(2.55)
where E is an arbitrary positive constant, F is any integer, and d(x, λi ) is the weighted
distance from vector x to the nearest prototype of model λi . Prototype points chosen
are the average of the coordinate values corresponding to the entire set of samples in
a particular class. Experiments were carried out on a set of Telugu (one of the major
Indian languages) words containing about 900 commonly used speech units for 10
vowels and uttered by three male informants in the age group of 28-30 years. Overall
recognition is about 82%.
An alternative application is the use of fuzzy algorithms for assigning phonetic
and phonemic labels to speech segments was presented in [De Mori and Laface 1980].
A method consisting of fuzzy restriction for extracting features, fuzzy relations for
relating these features with phonetic and phonemic interpretation, and their use for
interpretation of a speech pattern in terms of possibility theory has been described.
Experimental results showed an overall recognition of about 95% for 400 samples
pronounced by the four talkers.
2.5.2
FCM-Based Approach
This approach investigates the use of the fuzzy C-means algorithm instead of the
K-means (hard C-means) algorithm in coding a spectral vector sequence X for the
discrete HMM. This modification of the VQ is called the fuzzy VQ (FVQ).
Since the input of the discrete HMM is an observation sequence O = (o1 . . . oT )
consisting of discrete symbols, the spectral continuous vector sequence X = (x1 . . . xT )
needs to be transformed into the discrete symbol sequence O. This is normally performed by a VQ source coding technique, where each vector xt is coded into a discrete
symbol vk —the index of the codevector closest to vector xt
ot = vk = arg min d(xt , µi )
1≤i≤K
(2.56)
and the observation probability distribution B is of the form defined in (2.15).
The FVQ uses fuzzy C-partitioning on X, each vector xt belongs to classes with
corresponding memberships, thus the FVQ maps vector xt into an observation vector
2.5 Fuzzy Approaches in the Literature
44
ot = (u1t , . . . , uCt ), where uit is the membership of vector xt in class i and is computed
by using (2.45). For the observation probability distribution B = {bj (ot )}, authors
have proposed different computation methods. Following [Tseng et al. 1987], B is
computed as follows
bj (ot ) =
C
∑
uit bij ,
1 ≤ j ≤ N,
1≤t≤T
(2.57)
i=1
where bij is reestimated by
bij =
T∑
−1
uit αt (j)βt (j)
t=1
T∑
−1
,
1 ≤ i ≤ C,
1≤j≤N
(2.58)
αt (j)βt (j)
t=1
Experiments were conducted to compare three cases: using VQ/HMM, using
FVQ/HMM for training only, and using FVQ/HMM for both training and recognition, where the HMMs are 5-state left-right ones. The highest isolated-word recognition rates for the three cases are 72%, 77%, and 77%, respectively, where the degree
of fuzziness is m = 1.25, 10 training utterances are used, and the vocabulary is the
E-set consisting of 9 English letters {b, c, d, e, g, p, t, v, z}.
To obtain more tractable computation, Tsuboka and Nakahashi [1994] have proposed two alternative methods
1. Multiplication-type FVQ:
bj (ot ) =
C
∏
buijit ,
1 ≤ j ≤ N,
1≤t≤T
(2.59)
i=1
and bij is computed as in (2.58)
2. Addition-type FVQ:
bj (ot ) =
C
∑
uit bij ,
1 ≤ j ≤ N,
1≤t≤T
(2.60)
i=1
and bij is computed as
bij =
T∑
−1
t=1
ζij (t),
ζij (t) =
αt (j)βt (j)
T∑
−1
αt (j)βt (j)
t=1
×
uit bij
C
∑
,
1 ≤ i ≤ C,
1≤j≤N
uit bij
i=1
(2.61)
2.5 Fuzzy Approaches in the Literature
45
It was reported that for practical applications the multiplication type is more suitable
than the addition type. In isolated-word recogniton experiments, the number of states
of each HMM was set to be 1/5 of the average length in frames of training data. The
vocabulary is 100 city names in Japan and the degree of fuzziness is m = 2. The
highest recognition rates reached with a codebook size of 256 were 98.5% for the
multiplication type, 98.2% for the addition type, and 97.5% for the VQ/HMM.
An extension of this approach has been proposed by Chou and Oh [1996] where
a distribution normalisation dependent on the codevectors and a fuzzy contribution
based on weighting and smoothing the codevectors by distance have been applied.
Chapter 3
Fuzzy Entropy Models
This chapter proposes a new fuzzy approach to speech and speaker recognition as well
as to cluster analysis in pattern recognition. Models developed in this approach can be
called fuzzy entropy models since they are based on a basic algorithm called fuzzy entropy
clustering. The goal of this approach is not only to propose a new fuzzy method but
also to show that statistical models, such as HMMs in the maximum likelihood scheme,
can be viewed as fuzzy models, where probabilities of unobservable data given observable
data are used as fuzzy membership functions. An introduction of fuzzy entropy clustering
techniques is presented in Section 3.1. Relationships between clustering and modelling
problems are shown in Section 3.2. Section 3.3 presents an optimisation criterion proposed
as maximum fuzzy likelihood and formulates the fuzzy EM algorithm. Fuzzy entropy
models for HMMs, GMMs and VQ are presented in the next Sections 3.4, 3.5 and 3.6,
respectively. The noise clustering approach is also considered for these models. Section
3.7 presents a comparison between conventional models and fuzzy entropy models.
3.1
Fuzzy Entropy Clustering
Let us consider the following function [Tran and Wagner 2000f]
Hn (U, λ; X) =
T
C ∑
∑
uit d2it + n
T
C ∑
∑
uit log uit
(3.1)
i=1 t=1
i=1 t=1
where n > 0, λ is the model parameter set, dit is the distance between vector xt
and cluster i, and U = [uit ] with uit being the membership of vector xt in cluster i.
46
3.1 Fuzzy Entropy Clustering
47
Assuming that the matrices U satisfy the following conditions
C
∑
uit = 1
∀t,
0<
T
∑
uit < T
∀i
(3.2)
t=1
i=1
which mean that each xt belongs to C clusters, no cluster is empty and no cluster is
all of X because of 2 ≤ C < T .
We wish to show that minimising the function Hn (U, λ; X) on U yields solutions
uit ∈ [0, 1], hence all constraints in (2.43) are satisfied. This means that the matrices
U determine the fuzzy C-partition space for X and Hn (U, λ; X) is a fuzzy objective
function.
The first term on the right-hand side in (3.1) is the sum-of-squared-errors function
J1 (U, λ; X) defined in (2.39) for hard C-means clustering. The second term is the
negative of the following function E(U ) multiplied by n
E(U ) = −
T
C ∑
∑
uit log uit
(3.3)
i=1 t=1
The function E(U ) is maximum if uit = 1/C ∀i, and minimum if uit = 1 or 0. On
the other hand, the function J1 (U, λ; X) needs to be minimised to obtain a good
partition for X. Therefore, we can see that uit can take values in the interval [0, 1] if
the function Hn (U, λ; X) is minimised over U . Indeed, with the assumption in (3.2),
the Lagrangian Hn∗ (U, λ; X) is of the form
Hn∗ (U, λ; X) =
T
C ∑
∑
uit d2it + n
T
C ∑
∑
uit log uit +
i=1 t=1
i=1 t=1
C
∑
ki (
i=1
T
∑
uit − 1)
(3.4)
t=1
Hn∗ (U, λ; X) is minimised by setting its gradients with respect to U and the Lagrange
multipliers {ki } to zero
2
dit + n(1 + log uit ) + ki = 0
C
∑
uit = 1
∀t
∀i, t
(3.5)
i=1
This is equivalent to
2
uit = Si e−dit /n ,
Si = e−1 − (ki /n)
(3.6)
Using the constraint in (3.5), we can compute Si and hence
2
uit =
e−dit /n
C
∑
k=1
2
e−dkt /n
(3.7)
3.1 Fuzzy Entropy Clustering
48
From (3.7), it can be seen that 0 ≤ uit ≤ 1. Therefore, the matrices U determine
a fuzzy C-partition space for X. In this case, the function E(U ) is called the fuzzy
entropy function which has been considered by many authors. Clusters are considered
as fuzzy sets and the fuzzy entropy function expresses the uncertainty of determining
whether xt belongs to a given cluster or not. Measuring the degree of uncertainty
of fuzzy sets themselves was first proposed by De Luca and Termini [1972]. In other
words, the function E(U ) expresses the average degree of nonmembership of members
in a fuzzy set [Li and Mukaidono 1999]. The function E(U ) was also considered by
Hathaway [1986] for mixture distributions in relating the EM algorithm to clustering
techniques. For the function Hn (U, λ; X) in (3.1), the function E(U ) is employed to
“pull” memberships away from values equal to 0 or 1.
Based on the above discussions, this clustering technique is called fuzzy entropy
clustering [Tran and Wagner 2000f] to distinguish it from FCM clustering that has
been reviewed in Chapter 2. In general, the task of fuzzy entropy clustering is to minimise the fuzzy objective function Hn (U, λ; X) over variables U and λ, namely, finding
a pair of (U , λ) such that Hn (U , λ; X) ≤ Hn (U, λ; X). This task is implemented by
an iteration of the two steps: 1) Finding U such that Hn (U , λ; X) ≤ Hn (U, λ; X),
and 2) Finding λ such that Hn (U , λ; X) ≤ Hn (U , λ; X).
U is obtained by using the solution in (3.7), which can be presented in a similar
form to FCM clustering:
uit =
[
C (
∑
2
2
edit edjt
/
j=1
)1/n ]−1
(3.8)
Since the function E(U ) in (3.1) is not dependent on dit , determining λ is performed by minimising the first term, that is the function J1 (U, λ; X). Thus the
parameter estimation equations are identical to those in HCM clustering.
For the Euclidean distance d2it = (xt − µi )2 , we obtain [Tran and Wagner 2000g]
µi =
T
∑
uit xt
t=1
/∑
T
uit
(3.9)
t=1
For the Mahalanobis distance d2it = (xt − µi )′ Σ−1
i (xt − µi ) , we obtain
µi =
T
∑
t=1
uit xt
/∑
T
t=1
uit ,
Σi =
T
∑
t=1
uit (xt − µi )(xt − µi )′
/∑
T
t=1
uit
(3.10)
3.1 Fuzzy Entropy Clustering
49
The function Hn (U, λ; X) also has a physical interpretation. In statistical physics,
Hn (U, λ; X) is known as free energy, the first term in Hn (U, λ; X) is the expected
energy under U and the second one is the entropy of U [Jaynes 1957]. The expression
of uit in (3.8) is of the form of the Boltzmann distribution exp(−ɛ/kB τ ), a special
case of the Gibbs distribution, where ɛ is the energy, kB is the Boltzmann constant,
and τ is the temperature. Based on this property, we can apply a simulated annealing
method [Otten and Ginnenken 1989] to find a global minimum solution for λ (the way
that liquids freeze and crystallise in thermodynamics) by decreasing the temperature
τ , i.e. decreasing the value of n.
The degree of fuzzy entropy n determines the partition of X. As n → ∞, we
have uit → (1/C), each feature vector is equally assigned to C clusters, so we have
only a single cluster. As n → 0, uit → 0 or 1, and the function Hn (U, λ; X) approaches J1 (U, λ; X), it can be said that FE clustering reduces to HCM clustering
[Tran and Wagner 2000f]. Figure 3.1 illustrates the generation of clusters with different values of n. On the other hand, for data having well-separated clusters with
Figure 3.1: Generating 3 clusters with different values of n: hard clustering as n → 0,
clusters increase their overlap with increasing n > 0, and are identical to a single
cluster as n → ∞
memberships converging to the values 0 or 1, the fuzzy entropy term approaches 0
due to 1 log 1 = 0 log 0 = 0, and the FE function itself reduces to the HCM function
for all n > 0.
3.2 Modelling and Clustering Problems
3.2
50
Modelling and Clustering Problems
To apply FE clustering to statistical modelling techniques, we need to determine
relationships between the modelling and clustering problems. The first task for solving
modelling and clustering problems is to establish an optimisation criterion known as
an objective function. For modelling purposes, optimising the objective function
is to find the right parametric form of the distributions. For clustering purposes,
the optimisation is to find optimal partitions of data. Clustering is a geometric
method, where considering data involves considering shapes and locations of clusters.
In statistical modelling, data structure can be described by considering data density.
Instead of finding clusters, we find high data density areas, and thus the consideration
of data structure involves considering data distributions via the use of statistical
distribution functions. A mixture of normal distributions, or Gaussians, is effective
in the approximate description of a complicated distribution. Moreover, an advantage
of statistical modelling is that it can effectively express the temporal structure of data
through the use of a Markov process, a problem which is not addressed by clustering.
It would be useful if we could take advantages of both methods in a single approach. In order to implement this, we first define a general distance dXY for clustering. It denotes a dissimilarity between observable data X and unobservable data
(cluster, state) Y as a decreasing function of the distribution of X on component Y ,
given a model λ
d2XY = − log P (X, Y |λ)
(3.11)
This distance is used to relate the clustering problem to the statistical modelling
problem as well as the minimum distance rule to the maximum likelihood rule. Indeed, since minimising this distance leads to maximising the component distribution
P (X, Y |λ), grouping similar data points into a cluster by the minimum distance rule
thus becomes grouping these into a component distribution by the maximum likelihood rule. Clusters are now represented by component distribution functions and
hence the characteristics of a cluster are not only its shape and location, but also the
data density in the cluster and, possibly, the temporal structure of data if the Markov
process is also applied [Tran and Wagner 2000a].
3.3 Maximum Fuzzy Likelihood Estimation
3.3
51
Maximum Fuzzy Likelihood Estimation
The distance defined in (3.11) is used to transform the clustering problem to a modelling problem. Using this distance, we can relate the FE function in (3.1) to the
likelihood function. For example, we consider the case that feature vectors are assumed to be statistically independent. Using Jensen’s inequality [Ghahramani 1995]
for the log-likelihood L(λ; X), we can show that
L(λ; X) = log P (X|λ) = log
T
∏
P (xt |λ) =
≥
≥
T
∑
log
C
∑
P (xt , i|λ) =
t=1
i=1
C
T
∑∑
t=1 i=1
C
T ∑
∑
T
∑
log
t=1
uit log
log P (xt |λ)
t=1
t=1
=
T
∑
P (xt , i|λ)
uit
uit log P (xt , i|λ) −
C
∑
uit
i=1
C
T ∑
∑
P (xt , i|λ)
uit
uit log uit
(3.12)
t=1 i=1
t=1 i=1
On the other hand, according to (3.11), replacing the distance
d2it = − log P (xt , i|λ)
(3.13)
into the FE function in (3.1) and into the membership in (3.8), we obtain
Hn (U, λ; X) = −
and
uit =
T
C ∑
∑
uit log P (xt , i|λ) + n
i=1 t=1
{∑
C
T
C ∑
∑
[P (xt , i|λ)/P (xt , j|λ)]1/n
j=1
uit log uit
(3.14)
i=1 t=1
}−1
(3.15)
From (3.12), (3.14) and (3.15) we can show that
and
L(λ; X) ≥ −H1 (U, λ; X)
(3.16)
L(λ; X) = −H1 (U , λ; X)
(3.17)
where according to (3.15) we obtain U = {uit }, uit = P (i|xt , λ) as n = 1. The equality
in (3.17) shows that, if we find λ such that H1 (U , λ; X) ≤ H1 (U , λ; X) then we will
have L(λ; X) ≥ L(λ; X). It means that, as n = 1, minimising the FE function in (3.1)
using the distance in (3.13) leads to maximising the likelihood function. Therefore,
3.4 Fuzzy Entropy Hidden Markov Models
52
we can define a function Ln (U, λ; X) as follows [Tran and Wagner 2000a]
Ln (U, λ; X) = −Hn (U, λ; X) =
C
T ∑
∑
uit log P (xt , i|λ) − n
t=1 i=1
C
T ∑
∑
uit log uit (3.18)
t=1 i=1
From the above consideration, Ln (U, λ; X) can be called the fuzzy likelihood function.
Maximising this function (also minimising the FE function) is implemented by the
fuzzy EM algorithm, which is different from the standard EM algorithm in the E-step.
The fuzzy EM algorithm can be formulated as follows
Algorithm 2 (The Fuzzy EM Algorithm)
1. Initialisation: Fix n and choose an initial estimate λ
2. Fuzzy E-step: Compute U and Ln (U , λ; X)
3. M-step: Use a certain optimisation method to determine λ , for which Ln (U , λ; X)
is maximised
4. Termination: Set λ = λ and U = U , repeat from the E-step until the change of
Ln (U, λ; X) falls below a preset threshold.
3.4
Fuzzy Entropy Hidden Markov Models
This section is to apply the proposed fuzzy methods to the parameter estimation
problem for the fuzzy entropy HMM (FE-HMM). The fuzzy EM algorithm can be
viewed as a generalised Baum-Welch algorithm. FE-HMMs reduce to conventional
HMMs as the degree of fuzzy entropy n = 1.
3.4.1
Fuzzy Membership Functions
Fuzzy sets in the fuzzy HMM are determined in this section to compute the matrices U
for the fuzzy EM algorithm in Section 3.3. In the conventional HMM, each observation
ot is in each of N possible states at time t with a corresponding probability. In the
fuzzy HMM, each observation ot is regarded as being in N possible states at time t
with a corresponding degree of belonging known as the fuzzy membership function.
A state at time t is thus considered as a time-dependent fuzzy set or fuzzy state st .
3.4 Fuzzy Entropy Hidden Markov Models
53
Fuzzy states s1 s2 . . . sT are also considered as a fuzzy state sequence S. There are
N fuzzy states at each time t = 1, . . . , T , and a total of N T possible fuzzy state
sequences in the fuzzy HMM. Figure 3.2 illustrates all fuzzy states as well as fuzzy
state sequences in the HMM.
Figure 3.2: States at each time t = 1, . . . , T are regarded as time-dependent fuzzy
sets. There are N ×T fuzzy states connected by arrows into N T fuzzy state sequences
in the fuzzy HMM.
On the other hand, the observations are always considered in the sequence O and
related to the state sequence S. Therefore we define the fuzzy membership function of
the sequence O in fuzzy state sequence S based on the fuzzy membership function of
the observation ot in the fuzzy state st . For example, the fuzzy membership ust =i (O)
denotes the degree of belonging of the observation sequence O to fuzzy state sequences
being in fuzzy state st = i at time t, where i = 1, . . . , N . For computing the state
transition matrix A, we consider 2N fuzzy states at time t and time t + 1 included in
2N fuzzy state sequences and define the fuzzy membership function ust =i st+1 =j (O).
This membership denotes the degree of belonging of the observation sequence O to
fuzzy state sequences being in fuzzy state st = i at time t and fuzzy state st+1 = j at
time t + 1, where i, j = 1, . . . , N . For simplicity, this membership can be rewritten as
3.4 Fuzzy Entropy Hidden Markov Models
54
uijt (O) or uijt . Figure 3.3 illustrates such fuzzy state sequences in the fuzzy HMM.
Figure 3.3: The observation sequence O belongs to fuzzy state sequences being in
fuzzy state i at time t and fuzzy state j at time t + 1.
Similarly, we can determine fuzzy sets for computing the parameters {w, µ, Σ} in
the fuzzy continuous HMM, where the observation sequence O is the vector sequence
X. Fuzzy sets are fuzzy states and fuzzy mixtures at time t. The fuzzy membership
function ust =j mt =k (X), or ujkt for simplicity, denotes the degree of belonging of the
observation sequence X to fuzzy state j and fuzzy mixture k at time t as illustrated
in Figure 3.4.
3.4.2
Fuzzy Entropy Discrete HMM
From (3.18), the fuzzy likelihood function for the fuzzy entropy discrete HMM (FEDHMM) is proposed as follows [Tran and Wagner 2000a]
Ln (U, λ; O) = −
T∑
−1 ∑
∑
t=0 st st+1
ust st+1 d2st st+1 − n
T∑
−1 ∑
∑
ust st+1 log ust st+1
(3.19)
t=0 st st+1
where n > 0, ust st+1 = ust st+1 (O) and d2st st+1 = − log P (O, st , st+1 |λ). Note that πs1
is denoted by as0 s1 in (3.19) for simplicity. Assuming that we are in state i at time t
3.4 Fuzzy Entropy Hidden Markov Models
55
and state j at time t + 1, the function Ln (U, λ; O) can be rewritten as follows
Ln (U, λ; O) = −
N
N ∑
T∑
−1 ∑
uijt d2ijt − n
N
N ∑
T∑
−1 ∑
uijt log uijt
(3.20)
t=0 i=1 j=1
t=0 i=1 j=1
where
[
]
d2ijt = − log P (O, st = i, st+1 = j|λ) = − log αt (i)aij bj (ot+1 )βt+1 (j)
(3.21)
uijt = uijt (O) is the fuzzy membership function denoting the degree to which the
observation sequence O belongs to the fuzzy state sequences being in state i at time
t and state j at time t + 1. From the definition of the fuzzy membership (2.43), we
obtain
0 ≤ uijt ≤ 1
N
N ∑
∑
∀i, j, t,
uijt = 1
i=1 j=1
The inequalities in 0 <
state do not occur.
∑T
t=1
∀t,
0<
T
∑
uiit < T
∀i
t=1
(3.22)
uiit < T mean that the state sequences having only one
Fuzzy E-Step: Since maximising the function Ln (U, λ; O) on U is also minimising
the corresponding function Hn (U, λ; O), the solution U in (3.8) is used. The distance
Figure 3.4: The observation sequence X belongs to fuzzy state j and fuzzy mixture
k at time t in the fuzzy continuous HMM.
3.4 Fuzzy Entropy Hidden Markov Models
56
is defined in (3.21). We obtain
2
uijt =
e−dijt /n
N
N ∑
∑
e
[P (O, st = i, st+1 = j|λ)]1/n
=
N
N ∑
∑
−d2klt /n
k=1 l=1
(3.23)
1/n
[P (O, st = k, st+1 = l|λ)]
k=1 l=1
M-step: Note that the second term of the function Ln (U, λ; O) is not dependent on λ, therefore maximising Ln (U , λ; O) over λ is equivalent to maximising the
following function
L∗n (U , λ; O) = −
N
N ∑
T∑
−1 ∑
uijt d2ijt
(3.24)
t=0 i=1 j=1
Replacing the distance (3.21) into (3.24), we can regroup the function in (3.24) into
four terms as follows
L∗n (U , λ; O)
=
N
N (∑
∑
i=1
)
K (
N ∑
∑
T
∑
j=1
+
uij0 log πj +
j=1 k=1
+
i=1 j=1
t=1
s.t. ot =vk
−1
N T∑
N ∑
∑
−1
N ( T∑
N ∑
∑
N
∑
t=1
)
uijt log aij
)
uijt log bj (k)
i=1
uijt log [αt (i)βt+1 (j)]
(3.25)
i=1 j=1 t=1
where the last term including αt (i)βt+1 (j) can be ignored, since the forward-backward
variables can be computed from π, A, B by the forward-backward algorithm (see Section 2.3.4). Maximising the function L∗n (U , λ; O) on π, A, B is performed by using
Lagrange multipliers and the following constraints
N
∑
πj = 1,
N
∑
aij = 1,
j=1
j=1
K
∑
bj (k) = 1
(3.26)
k=1
We obtain the parameter reestimation equations as follows [Tran 1999]
πj =
N
∑
uij0 ,
i=1
aij =
T∑
−1
uijt
t=1
N
T∑
−1 ∑
,
uijt
t=1 j=1
bj (k) =
T
∑
t=1
s.t. ot =vk
N
∑
uijt
i=1
N
T ∑
∑
(3.27)
uijt
t=1 i=1
In the case of n = 1, the membership function in (3.23) becomes
uijt =
P (O, st = i, st+1 = j|λ)
N
N ∑
∑
k=1 l=1
P (O, st = k, st+1 = l|λ)
= P (st = i, st+1 = j|O, λ) = ξt (i, j)
(3.28)
3.4 Fuzzy Entropy Hidden Markov Models
57
where ξt (i, j) is defined in (2.20). The parameter reestimation equations in (3.27) are
now identical to those obtained by the Baum-Welch algorithm in Section 2.3.4.
3.4.3
Fuzzy Entropy Continuous HMM
Similarly, ujkt = ujkt (X) is defined as the fuzzy membership function denoting the
degree to which the vector sequence X belongs to fuzzy state st = i and fuzzy
Gaussian mixture mt = k at time t, satisfying
0 ≤ ujkt ≤ 1
N ∑
K
∑
∀j, k, t,
ujkt = 1
∀t,
0<
T
∑
ujkt < T
∀j, k
t=1
j=1 k=1
(3.29)
and the distance djkt is of the form
d2jkt = − log P (X, st = j, mt = k|λ) = − log
[∑
N
αt−1 (i)aij wjk N (xt , µjk , Σjk )βt (j)
i=1
]
(3.30)
We obtain the fuzzy EM algorithm for the fuzzy entropy continuous HMM (FECHMM) as follows [Tran and Wagner 2000a]
Fuzzy E-Step:
2
e−djkt /n
ujkt =
N ∑
K
∑
=
−d2ilt /n
e
i=1 l=1
[P (X, st = j, mt = k|λ)]1/n
N ∑
K
∑
[P (X, st = i, mt = l|λ)]
(3.31)
1/n
i=1 l=1
M-step: Similar to the continuous HMM, the parameter estimation equations for
the π and A distributions are unchanged, but the output distribution B is estimated
via Gaussian mixture parameters (w, µ, Σ) as follows
wjk =
T
∑
ujkt
t=1
K
T ∑
∑
,
ujkt
t=1 k=1
µjk =
T
∑
ujkt xt
t=1
T
∑
,
ujkt
t=1
Σjk =
T
∑
ujkt (xt − µjk )(xt − µjk )′
t=1
T
∑
ujkt
t=1
(3.32)
In the case of n = 1, the membership function in (3.31) becomes
ujkt =
P (X, st = j, mt = k|λ)
N
N ∑
∑
i=1 l=1
P (X, st = i, mt = l|λ)
= P (st = j, mt = k|X, λ) = ηt (j, k)
(3.33)
3.4 Fuzzy Entropy Hidden Markov Models
58
where ηt (j, k) is defined in (2.22). The parameter reestimation equations in (3.32)
are now identical to those obtained by the Baum-Welch algorithm in Section 2.3.4.
3.4.4
Noise Clustering Approach
The speech signal is influenced by the speaking environment, the transmission channel, and the transducer used to capture the signal. So there exist some bad observations regarded as outliers, which influence speech recognition performance. For the
fuzzy entropy HMM in the noise clustering approach (NC-FE-HMM), a separate state
is used to represent outliers and is termed the garbage state [Tran and Wagner 1999a].
This state has a constant distance δ from all observation sequences. The membership
u•t of an observation sequence O at time t in the garbage state is defined to be
u•t = 1 −
N
N ∑
∑
uijt
1≤t≤T
(3.34)
i=1 j=1
Therefore, the membership constraint for the “good” states is effectively relaxed to
N
N ∑
∑
uijt < 1
1≤t≤T
(3.35)
i=1 j=1
This allows noisy data and outliers to have arbitrarily small membership values in
good states. The fuzzy likelihood function for the FE-DHMM in the NC approach
(NC-FE-DHMM) is as follows
Ln (U, λ; O) = −
N ∑
T∑
−1 ∑
N
uijt d2ijt − n
−
uijt log uijt
t=0 i=1 j=1
t=0 i=1 j=1
T∑
−1
N
N ∑
T∑
−1 ∑
2
u•t δ − n
T∑
−1
u•t log u•t
(3.36)
t=0
t=0
Replacing u•t in (3.34) into (3.36) and maximising the fuzzy likelihood function over
U , we obtain [Tran and Wagner 2000a]
Fuzzy E-Step:
uijt =
1
N
N ∑
∑
2
2
2
2
(edijt /edklt )1/n + (edijt /eδ )1/n
k=1 l=1
=
N
N ∑
∑
k=1 l=1
[P (O, st = i, st+1 = j|λ)]1/n
[P (O, st = k, st+1 = l|λ)]
1/n
(3.37)
−δ 2 /n
+e
3.5 Fuzzy Entropy Gaussian Mixture Models
59
M-Step: The M-step is identical to the M-step of the FE-DHMM in Section
3.4.2.
where the distance dijt is computed as in (3.21). The second term in the denominator
of (3.37) becomes quite large for outliers, resulting in small membership values in all
the good states for outliers. The advantage of this approach is that it forms a more
robust version of the fuzzy entropy algorithm and can be used instead of the fuzzy
entropy algorithm provided a suitable value for the constant distance δ can be found.
Similarly, the FE-CHMM in the NC approach (NC-FE-CHMM) is as follows
Fuzzy E-Step:
ujkt =
1
N ∑
K
∑
2
2
2
2
(edjkt /edilt )1/n + (edjkt /eδ )1/n
i=1 l=1
=
K
N ∑
∑
[P (X, st = j, mt = k|λ)]1/n
1/n
[P (X, st = i, mt = l|λ)]
+e
(3.38)
−δ 2 /n
i=1 l=1
M-Step: The M-step is identical to the M-step of the FE-CHMM in Section 3.4.3.
3.5
Fuzzy Entropy Gaussian Mixture Models
Although we can obtain equations for the fuzzy entropy GMM (FE-GMM) from the
FE-CHMM with the number of states set to N = 1, for practical applications, they
are summarised as follows.
3.5.1
Fuzzy Entropy GMM
For a training vector sequence X = (x1 x2 . . . xT ), the fuzzy likelihood of the FE-GMM
is
Ln (U, λ; X) = −
T
K ∑
∑
i=1 t=1
uit d2it − n
T
K ∑
∑
uit log uit
(3.39)
i=1 t=1
where
d2it = − log P (xt , i|λ) = − log [wi N (xt , µi , Σi )]
(3.40)
3.5 Fuzzy Entropy Gaussian Mixture Models
60
and uit = ui (xt ) is the fuzzy membership function denoting the degree to which
feature vector xt belongs to fuzzy Gaussian mixture i, satisfying
0 ≤ uit ≤ 1
K
∑
∀i, t,
uit = 1
∀t,
0<
T
∑
uit < T
∀i
(3.41)
t=1
i=1
Fuzzy E-Step: Maximising the fuzzy likelihood function on U gives
[P (xt , i|λ)]1/n
2
uit =
e−dit /n
K
∑
e
=
−d2kt /n
k=1
K
∑
(3.42)
1/n
[P (xt , k|λ)]
k=1
M-Step: Maximising the fuzzy likelihood function on λ gives
wi =
T
∑
uit
t=1
K
T
∑∑
µi =
,
ukt
t=1 k=1
T
∑
uit xt
t=1
T
∑
Σi =
,
T
∑
uit (xt − µi )(xt − µi )′
t=1
uit
t=1
Again if n = 1 we obtain
uit =
uit
t=1
P (xt , i|λ)
K
∑
(3.43)
T
∑
= P (i|xt , λ)
(3.44)
P (xt , k|λ)
k=1
and the FE-GMM reduces to the conventional GMM.
3.5.2
Noise Clustering Approach
The fuzzy likelihood function for the FE-GMM in the NC approach (NC-FE-GMM)
is as follows
Ln (U, λ; X) = −
−
K
T ∑
∑
t=1 i=1
T
∑
uit d2it
and λ gives
∑K
i=1
uit log uit
t=1 i=1
2
u•t δ − n
t=1
where u•t = 1 −
−n
K
T∑
−1 ∑
T
∑
u•t log u•t
(3.45)
t=1
uit , 1 ≤ t ≤ T . Maximising the fuzzy likelihood function on U
Fuzzy E-Step:
[P (xt , i|λ)]1/n
2
uit =
e−dit /n
K
∑
k=1
e
−d2kt /n
+e
=
−δ 2 /n
K
∑
1/n
[P (xt , k|λ)]
(3.46)
−δ 2 /n
+e
k=1
M-Step: the M-step is identical to the M-step of the FE-GMM in Section 3.5.1.
3.6 Fuzzy Entropy Vector Quantisation
3.6
61
Fuzzy Entropy Vector Quantisation
Reestimation algorithms for fuzzy entropy VQ (FE-VQ) are derived from the algorithms for FE-GMMs by using the Euclidean or the Mahalanobis distances. We
obtain the following [Tran and Wagner 2000g, Tran and Wagner 2000h]:
Fuzzy E-Step: Choose one of the following cases
• FE-VQ:
uit =
1
)
d2 − d2kt
exp it
n
k=1
K
∑
(3.47)
(
• FE-VQ with noise clusters (NC-FE-VQ):
uit =
1
K
∑
exp
k=1
(
d2it
−
n
)
d2kt
d2 − δ 2
+ exp it
n
(
)
(3.48)
M-Step: Choose one of the following cases
• Using the Euclidean distance: d2it = (xt − µi )2
µi =
T
∑
uit xt
t=1
T
∑
(3.49)
uit
t=1
• Using the Mahalanobis distance: d2it = (xt − µi )′ Σ−1
i (xt − µi )
µi =
T
∑
uit xt
t=1
T
∑
,
uit
t=1
3.7
Σi =
T
∑
uit (xt − µi )(xt − µi )′
t=1
T
∑
(3.50)
uit
t=1
A Comparison Between Conventional and Fuzzy
Entropy Models
It may be useful for applications to consider the differences between the conventional
models and FE models. In general, the difference is mainly the conventional Estep and the fuzzy E-step in the parameter reestimation procedure. A weighting
3.7 A Comparison Between Conventional and Fuzzy Entropy Models
62
exponent 1/n on each joint probability P (X, Y |λ) between observable data X and
unobservable data Y is introduced in the FE models. If n = 1, the FE models reduce
to the conventional models, as shown in Figure 3.5
[
]1/n
P (X, Y |λ)
uY (X) = ∑ [
]1/n
P (X, Y |λ)
n=1
✲
P (X, Y |λ)
= P (Y |X, λ)
uY (X) = ∑
P (X, Y |λ)
Y
Y
Fuzzy Entropy Models
Conventional Models
Figure 3.5: From fuzzy entropy models to conventional models
The role of the degree of fuzzy entropy n can be considered in depth via its
influence on the parameter reestimation equations. Without loss of generality, let
us consider a problem of GMMs. Given a vector xt , let us assume that the cluster
i among K clusters has the highest component density P (xt , i|λ), i.e. the shortest
distance d2it = − log P (xt , i|λ). Consider the membership uit of the fuzzy entropy
GMM with n > 1. From (3.42), it can be rewritten as
2
uit =
e−dit /n
K
∑
k=1
=
−d2kt /n
e
=
[P (xt , i|λ)]1/n
K
∑
[P (xt , k|λ)]1/n
k=1
[P (xt , i|λ)]1/n
[P (xt , i|λ)]1/n +
K
∑
k=1
k6=i
[P (xt , k|λ)]1/n
=
1+
K
∑
k=1
k6=i
[
1
(3.51)
]
P (xt , k|λ) 1/n
P (xt , i|λ)
For all k = 1, . . . , K and k 6= i, we obtain the following equivalent inequalities:
⇔
⇔
P (xt , k|λ) < P (xt , k|λ)
P (xt , k|λ)
< 1
P (xt , i|λ)
[
]
P (xt , k|λ)
P (xt , k|λ) 1/n
<
P (xt , i|λ)
P (xt , i|λ)
since n > 1
3.7 A Comparison Between Conventional and Fuzzy Entropy Models
⇔
1+
K
∑
k=1
k6=i
[
63
1
1
]1/n <
K
∑ P (xt , k|λ)
P (xt , k|λ)
1+
P (xt , i|λ)
k=1 P (xt , i|λ)
k6=i
⇔
uit < P (i|xt , λ)
(3.52)
The same can be shown easily for the remaining cases. In general, we obtain
• P (xt , i|λ) ≥ P (xt , k|λ) ∀k 6= i : xt is closest to cluster i
uit
> P (i|xt , λ)
0<n<1
= P (i|xt , λ)
n=1
< P (i|xt , λ)
n>1
(3.53)
• P (xt , i|λ) ≤ P (xt , k|λ) ∀k 6= i : xt is furthest from cluster i
uit
< P (i|xt , λ)
0<n<1
= P (i|xt , λ)
n=1
> P (i|xt , λ)
n>1
(3.54)
As discussed above, the distance between vector xt and cluster i is a monotonically
decreasing function of the joint probability P (xt , i|λ) (see (3.40)), therefore we can
have an interpretation for the parameter n. If 0 < n < 1, comparing with the a
posteriori probability P (i|xt , λ) in the GMM, the degree of belonging uit of vector
xt to cluster i is higher than P (i|xt , λ) if xt is close to cluster i and is lower than
P (i|xt , λ) if xt is far from cluster i. The reverse result is obtained for n > 1. Since
model parameters λ = {w, µ, Σ} are determined by memberships (see (3.43) in the
M-step), we can expect that the use of the parameter n will yield a better parametric
form for the GMMs. As discussed in Section 2.3.2, when the amount of the training
data is insufficient, the quality of the distribution parameter estimates cannot be
guaranteed. In reality, this problem often occurs. Therefore fuzzy entropy models
may enhance the above quality by their adjustable parameter n. If we wish to decrease
the influence of vectors far from cluster center, we reduce the value of n to less than 1.
Inversely, values of n greater than 1 increase the influence of those vectors. In general,
there does not exist a best value of n in all cases. For different applications and data,
suitable values of n may be different. The membership function with different values
of n versus the distance between vector xt and cluster i is shown in Figure 3.6. The
limit of n = 0 represents hard models, which will be presented in Chapter 5.
3.8 Summary and Conclusion
64
Figure 3.6: The membership function uit with different values of the degree of fuzzy
entropy n versus the distance dit between vector xt and cluster i
For example, consider the case of 4 clusters. Given P (xt , i|λ), i = 1, 2, 3, 4, we
compute uit with n = 1 for the GMM, and with n = 0.5 and n = 2 for the FE-GMM.
Table 3.1 shows these values for comparison.
Cluster i
i=1
Given P (xt , i|λ)
uit = P (i|xt , λ) for GMM
i=2
i=3
i=4
0.0016 0.0025 0.0036 0.0049
(n = 1.0)
0.13
0.20
0.28
0.39
uit for fuzzy entropy GMM (n = 0.5)
0.06
0.14
0.28
0.52
uit for fuzzy entropy GMM (n = 2.0)
0.18
0.23
0.27
0.32
Table 3.1: An example of memberships for the GMM and the FE-GMM
3.8
Summary and Conclusion
Fuzzy entropy models have been presented in this chapter. Relationships between
fuzzy entropy models are summarised in Figure 3.7 below. A parameter is introduced
for the degree of fuzzy entropy n > 0. With n → 0, we obtain hard models, which
will be presented in Chapter 5. With n → ∞, we obtain maximally fuzzy entropy
3.8 Summary and Conclusion
65
models, equivalent to only a single state or a single cluster. With n = 1, fuzzy entropy
models reduce to conventional models in the maximum likelihood scheme. This result
shows that the statistical models can be viewed as special cases of fuzzy models. An
advantage obtained from this viewpoint is that we can get ideas from fuzzy methods
and apply them to statistical models. For example, by letting n = 1 in the noise
clustering approach, we obtain new models for HMMs, GMMs and VQ without any
fuzzy parameters. Moreover, the adjustibility of the degree of fuzzy entropy n in FE
models is also an advantage. When conventional models do not work well in some
cases due to the insufficiency of the training data or the complexity of the speech data,
such as the nine English E-set words, a suitable value of n can be found to obtain
better models. Experimental results for these models will be reported in Chapter 7.
FE Models
❄
FE
Discrete Models
❄
FE
DHMM
❄
FE
Continuous Models
❄
NC-FE
DHMM
❄
NC
CHMM
❄
NC-FE
CHMM
❄
FE
GMM
❄
NC-FE
GMM
❄
FE
VQ
❄
NC-FE
VQ
Figure 3.7: Fuzzy entropy models for speech and speaker recognition
Chapter 4
Fuzzy C-Means Models
This chapter proposes a fuzzy approach based on fuzzy C-means (FCM) clustering to
speech and speaker recognition. Models in this approach can be called FCM models
and are estimated by the minimum fuzzy squared-error criterion used in FCM clustering.
This criterion is different from the maximum likelihood criterion, hence FCM models
cannot reduce to statistical models. However, the parameter estimation procedure of
FCM models is similar to that of FE models, where distances are defined in the same way.
In this chapter, the fuzzy EM algorithm is reformulated for the minimum fuzzy squarederror criterion in Section 4.1. FCM models for HMMs, GMMs and VQ are presented in
the next Sections 4.2, 4.3 and 4.4, respectively. The noise clustering approach is also
considered for these models. A discussion on the role of fuzzy memberships of FCM
models and a comparison between FCM models and FE models are presented in Section
4.5.
4.1
Minimum Fuzzy Squared-Error Estimation
The fuzzy squared-error function in (2.44) page 38 is used as an optimisation criterion
to estimate FCM models. For convenience, we reintroduce this function as follows
Jm (U, λ; X) =
T
C ∑
∑
2
um
it dit
(4.1)
i=1 t=1
where U = {uit } is a fuzzy C-partition of X, m > 1 is a weighting exponent on
each fuzzy membership uit and is called the degree of fuzziness, λ and dit are defined
66
4.2 Fuzzy C-Means Hidden Markov Models
67
for particular models. Minimising the function Jm (U, λ; X) over the variables U
and λ is implemented by an iteration of the two steps: 1) Finding U such that
Jm (U , λ; X) ≤ Jm (U, λ; X), and 2) Finding λ such that Jm (U , λ; X) ≤ Jm (U , λ; X).
The fuzzy EM algorithm on page 52 is reformulated for FCM models as follows
[Tran and Wagner 1999b]
Algorithm 3 (The Fuzzy EM Algorithm)
1. Initialisation: Fix m and choose an initial estimate λ
2. Fuzzy E-step: Compute U and Jm (U , λ; X)
3. M-step: Use a certain minimisation method to determine λ , for which Jm (U , λ; X)
is minimised
4. Termination: Set λ = λ and U = U , repeat from the E-step until the change of
Jm (U, λ; X) falls below a preset threshold.
4.2
Fuzzy C-Means Hidden Markov Models
This section proposes a parameter estimation procedure for the HMM using the minimum fuzzy squared-error estimation. Noise clustering and possibilistic C-means
approaches are also considered for both discrete and continuous HMMs.
4.2.1
FCM Discrete HMM
From (4.1), the fuzzy squared-error function for the FCM discrete HMM (FCMDHMM) is proposed as follows [Tran and Wagner 1999c]
Jm (U, λ; O) =
T∑
−1 ∑
∑
t=0 st st+1
2
um
st st+1 dst st+1
(4.2)
where m > 1, ust st+1 = ust st+1 (O) and d2st st+1 = − log P (O, st , st+1 |λ). Note that πs1
is denoted by as0 s1 in (4.2) for simplicity. Assuming that we are in state i at time t
and state j at time t + 1, the function Jm (U, λ; O) can be rewritten as follows
Jm (U, λ; O) =
N
N ∑
T∑
−1 ∑
t=0 i=1 j=1
2
um
ijt dijt
(4.3)
4.2 Fuzzy C-Means Hidden Markov Models
68
where uijt and dijt are defined in (3.22) and (3.21), respectively. Minimising the
function Jm (U, λ; O) over U and λ gives the following fuzzy EM algorithm for the
FCM-DHMM
Fuzzy E-Step: Minimising the function Jm (U, λ; O) in (4.2) over U gives
uijt =
[
N (
N ∑
∑
d2ijt
k=1 l=1
/
d2klt
)1/(m−1) ]−1
(4.4)
M-step: Replacing the distance (3.21) into (4.3), we can regroup the function in
(4.3) into four terms as follows
Jm (U , λ; O) = −
N
N (∑
∑
i=1
j=1
−
K
N ∑
∑
j=1 k=1
−
)
um
ij0 log πj −
(
i=1 j=1
T
∑
t=1
s.t. ot =vk
−1
N T∑
N ∑
∑
−1
N ( T∑
N ∑
∑
N
∑
)
um
ijt log aij
t=1
)
um
ijt log bj (k)
i=1
um
ijt log [αt (i)βt+1 (j)]
(4.5)
i=1 j=1 t=1
where, similar to the FE-DHMM, the last term including αt (i)βt+1 (j) is ignored,
since the forward-backward variables can be computed from π, A, B by the forwardbackward algorithm (see Section 2.3.4). Minimising the function Jm (U , λ; O) on
π, A, B is performed by using Lagrange multipliers and the same constraints as in
(3.26) [Tran and Wagner 1999c]. We obtain the parameter reestimation equations as
follows
πj =
N
∑
um
ij0 ,
aij =
i=1
T∑
−1
um
ijt
t=1
N
T∑
−1 ∑
,
bj (k) =
um
ijt
t=1 j=1
4.2.2
T
∑
N
∑
um
ijt
t=1
i=1
s.t. ot =vk
N
T ∑
∑
um
ijt
t=1 i=1
(4.6)
FCM Continuous HMM
Similarly, using the fuzzy membership function ujkt and the distance djkt in (3.29)
and (3.30), respectively, we obtain the fuzzy EM algorithm for the FCM continuous
HMM (FCM-CHMM) as follows [Tran and Wagner 1999a]
Fuzzy E-Step:
ujkt =
[
K (
N ∑
∑
i=1 l=1
d2jkt
/
d2ilt
)1/(m−1) ]−1
(4.7)
4.2 Fuzzy C-Means Hidden Markov Models
69
M-step: Similar to the FE-CHMM, the parameter estimation equations for the
π and A distributions are unchanged, but the output distribution B is estimated via
Gaussian mixture parameters (w, µ, Σ) as follows
wjk =
T
∑
um
jkt
t=1
K
T
∑∑
,
µjk =
um
jkt
t=1 k=1
4.2.3
T
∑
um
jkt xt
t=1
T
∑
,
Σjk =
T
∑
′
um
jkt (xt − µjk )(xt − µjk )
t=1
T
∑
um
jkt
t=1
um
jkt
t=1
(4.8)
Noise Clustering Approach
The concept of the garbage state in Section 3.4.4, page 58 is applied to the FCMDHMM. The fuzzy objective function for the FCM-DHMM in the NC approach (NCFCM-DHMM) is as follows [Tran and Wagner 1999a]
Jm (U, λ; O) =
N ∑
N
T∑
−1 ∑
2
um
ijt dijt +
T∑
−1
t=0
t=0 i=1 j=1
(1 −
N ∑
N
∑
uijt )m δ 2
(4.9)
i=1 j=1
The fuzzy EM algorithm for the NC-FCM-CHMM is as follows
Fuzzy E-Step:
uijt =
1
N
N ∑
∑
(d2ijt /d2klt )1/(m−1)
(4.10)
+
(d2ijt /δ 2 )1/(m−1)
k=1 l=1
where dijt is defined in (3.21). The second term in the denominator of (4.10) becomes
quite large for outliers, resulting in small membership values in all the good states
for outliers.
M-Step: identical to the M-step of the FCM-DHMM in Section 4.2.1.
Similarly, the FCM-CHMM in the NC approach (NC-FCM-CHMM) is as follows
Fuzzy E-Step:
ujkt =
1
K
N ∑
∑
(d2jkt /d2ilt )1/(m−1)
(4.11)
+
(d2jkt /δ 2 )1/(m−1)
i=1 l=1
where djkt is defined in (3.30).
M-Step: identical to the M-step of the FCM-CHMM in Section 4.2.2.
4.3 Fuzzy C-Means Gaussian Mixture Models
4.3
70
Fuzzy C-Means Gaussian Mixture Models
Similar to FE-GMMs, FCM Gaussian mixture models (FCM-GMMs) are summarised
as follows.
4.3.1
Fuzzy C-Means GMM
For a training vector sequence X = (x1 x2 . . . xT ), the fuzzy objective function of the
fuzzy C-means GMM (FCM-GMM) is [Tran et al. 1998a]
Jm (U, λ; X) =
T
K ∑
∑
2
um
it dit
(4.12)
i=1 t=1
where
d2it = − log P (xt , i|λ) = − log [wi N (xt , µi , Σi )]
(4.13)
and uit = ui (xt ) is the fuzzy membership function denoting the degree to which
feature vector xt belongs to Gaussian distribution i, satisfying
0 ≤ uit ≤ 1
K
∑
∀i, t,
uit = 1
∀t,
0<
T
∑
uit < T
∀i
(4.14)
t=1
i=1
The fuzzy EM algorithm for the FCM-GMM is as follows [Tran and Wagner 1998]
Fuzzy E-Step: Minimising the fuzzy objective function over U gives
uit =
[
K (
∑
d2it
k=1
/
d2kt
)1/(m−1) ]−1
(4.15)
M-Step: Minimising the fuzzy objective function over λ gives
wi =
T
∑
um
it
t=1
K
T ∑
∑
t=1 k=1
4.3.2
,
um
kt
µi =
T
∑
um
it xt
t=1
T
∑
,
Σi =
T
∑
′
um
it (xt − µi )(xt − µi )
t=1
T
∑
um
it
t=1
(4.16)
um
it
t=1
Noise Clustering Approach
The fuzzy objective function for the FCM-GMM in the NC approach (NC-FCMGMM) is as follows [Tran and Wagner 1999e]
Jm (U, λ; X) =
K
T ∑
∑
t=1 i=1
uit d2it +
T
∑
t=1
(1 −
K
∑
i=1
uit )m δ 2
(4.17)
4.4 Fuzzy C-Means Vector Quantisation
71
Minimising the fuzzy objective function on U and λ gives
Fuzzy E-Step:
uit =
1
K
∑
1/(m−1)
(d2it /d2kt )
(4.18)
2 1/(m−1)
+ (d2it /δ )
k=1
M-Step: identical to the M-step of the FCM-GMM in Section 4.3.1.
4.4
Fuzzy C-Means Vector Quantisation
The reestimation algorithms for fuzzy C-means VQ (FCM-VQ) are identical to the
FCM algorithms reviewed in Section 2.4.6, page 37.
4.5
Comparison Between FCM and FE Models
We have considered three kinds of models: 1) Conventional models in Chapter 2,
2) FE models in Chapter 3, and 3) FCM models in this chapter. The relationship
between conventional and FE models has been discussed in Section 3.7 page 61 and
can be summarised in Figure 4.1. Conventional models (HMMs, GMMs and VQ) are
considered as a special group with a value of the degree of fuzzy entropy n = 1 within
the infinite family of FE model groups with values of n in the range (0, ∞).
0
•
✻
FE Models
Hard Models
1
•
✻
FE Models
n
✲
Conventional Models
Figure 4.1: The relationship between FE model groups versus the degree of fuzzy entropy
n
Such a relationship is not available for conventional models and FCM models. This
means that no suitable value of the degree of fuzziness m in (1, ∞) can be set for
FCM models to reduce to the conventional models. However, a similar relationship
can be established between the group well-known in pattern recognition with m = 2
4.5 Comparison Between FCM and FE Models
1
•
✻
FCM Models
Hard Models
2
•
✻
FCM Models
72
m
✲
Typical Models
Figure 4.2: The relationship between FCM model groups versus the degree of fuzziness m
and other FCM model groups with m > 1 and m 6= 2. Figure 4.2 shows the similar
relationship between FCM models to that in Figure 4.1.
Therefore we discuss this relationship before comparing FCM and FE models.
Without loss of generality, let us consider the following problem for GMMs. Given
a vector xt , we assume that xt is closest to the cluster i among K clusters. Consider
the membership uit of the FCM-GMM. From (4.15), it can be rewritten as
uit =
1
K
∑
k=1
(
)
d2it 1/(m−1)
d2kt
=
1+
K
∑
k=1
k6=i
(
1
)
d2it 1/(m−1)
d2kt
(4.19)
Writing u∗it for the membership if m = 2, we obtain
u∗it
/[
=1
1+
]
K ( 2 )
∑
dit
k=1
k6=i
d2kt
(4.20)
1
d2
< 1. From the above assumption for xt , we obtain 2it < 1
m−1
dkt
∀k =
6 i. Since the function ax decreases for a < 1, we can show that
As m > 2, we have
uit < u∗it
(4.21)
It can be easily shown for the remaining cases. In general, we obtain
• d2it ≤ d2kt
∀k 6= i : xt is closest to cluster i
uit
• d2it ≥ d2kt
> u∗it
=
<
u∗it
u∗it
1<m<2
m=2
(4.22)
m>2
∀k 6= i : xt is furthest from cluster i
uit
< u∗it
1<m<2
= u∗it
m=2
u∗it
m>2
>
(4.23)
4.5 Comparison Between FCM and FE Models
73
Comparing with the typical FCM model with m = 2, if we wish to decrease the
influence of vectors far from the cluster center, we reduce the value of m to less than
2. Inversely, values of m > 2 increase the influence of those vectors. The membership
function with different values of m versus the distance between vector xt and cluster
i is demonstrated in Figure 4.3, which is quite similar to that demonstrated in Figure
3.3 page 64 for FE models. The limit of m = 1 represents hard models, which will be
presented in Chapter 5.
Figure 4.3: The FCM membership function uit with different values of the degree of
fuzziness m versus the distance dit between vector xt and cluster i
It is known that model parameters are estimated by the memberships in the Mstep of reestimation algorithms, therefore selecting a suitable value of the degree of
fuzziness m is necessary to obtain optimum model parameter estimates. This can be
a solution for the insufficient training data problem. In this case, the quality of the
model parameter estimates trained by conventional methods cannot be guaranteed
and hence FCM models with the adjustable parameter m can be employed to find
better estimates. Although FE models also have the advantage of the adjustable
parameter n, the membership functions of FE and FCM models are different because
of the employed different optimisation criteria. We compare the expressions of FE
4.5 Comparison Between FCM and FE Models
74
and FCM memberships taken from (3.42) and (4.15) as follows
−d2it /n
FE:
uit =
e
K
∑
FCM:
e
−d2kt /n
k=1
uit =
(
1
d2it
)1/(m−1)
)
K (
∑
1 1/(m−1)
k=1
(4.24)
d2kt
For simplicity, we consider the typical cases where n = 1 and m = 2. It can be
seen that the FE membership employs the function f (x) = e−x whereas the FCM
membership employs the function f (x) = 1/x, where x = d2it . Figure 4.4 shows
curves representing these functions.
Figure 4.4: Curves representing the functions used in the FE and FCM memberships,
where x = d2it , m = 2 and n = 1
We can see that the change of the FCM membership is more rapid than the one of
the FE membership for short distances (0 < x < 1). Applied to cluster analysis, this
means that the FCM memberships of feature vectors close to the cluster center can
have very different values even if these vectors are close together, whereas their FE
memberships are not very different. On the other hand, in the M-step, FE and FCM
model parameters are estimated by uit and um
it , respectively. For long distances, i.e.
for feature vectors very far from the cluster center, the difference between uit for the
FE membership and um
it for the FCM membership is not significant. Indeed, although
Figure 4.4 shows the FCM membership value is much greater than the FE membership
4.6 Summary and Conclusion
75
value for long distances, for estimating the model parameters in the M-step, the FCM
membership value is reduced due to the weighting exponent m (um
it < uit as m > 1
and 0 ≥ uit ≥ 1).
4.6
Summary and Conclusion
Fuzzy C-means models have been proposed in this chapter. A parameter is introduced
as the degree of fuzziness m > 1. With m → 1, we obtain hard models, which will
be presented in Chapter 5. With m → ∞, we obtain maximally fuzzy models with
only a single state or a single cluster. Typical models with m = 2 are well known
in pattern recognition. The main differences between FE models and FCM models
are the optimisation criteria and the reestimation of the fuzzy membership functions.
The advantage of the FE and FCM models is that they have adjustable parameters
m and n, which may be useful for finding optimum models for solving the insufficient
training data problem. Relationships between the FCM models are summarised in
Figure 4.5 below. Experimental results for these models will be reported in Chapter
6.
FCM Models
❄
FCM
Discrete Models
❄
FCM
DHMM
❄
NC-FCM
DHMM
❄
FCM
Continuous Models
❄
NC-FCM
CHMM
❄
NC-FCM
CHMM
❄
FCM
GMM
❄
NC-FCM
GMM
❄
FCM
VQ
❄
NC-FCM
VQ
Figure 4.5: Fuzzy C-means models for speech and speaker recognition
Chapter 5
Hard Models
As the degrees of fuzzy entropy and fuzziness tend to their minimum values, both the
fuzzy entropy and the fuzzy C-means models approach the hard model. For fuzzy and
conventional models, the model structures are not fundamentally different, except for the
use of fuzzy optimisation criteria and the fuzzy membership. However, a different model
structure applies to the hard models because of the binary (zero-one) membership function.
For example, the hard HMM employs only the best path for estimating model parameters
and for recognition, and the hard GMM employs only the most likely Gaussian distribution
among the mixture of Gaussians to represent a feature vector. Although the smoothed
(fuzzy) membership is more successful than the binary (hard) membership in describing
the model structures, hard models can also be used because they are simple yet effective.
The simplest hard model is the VQ model, which is effective for speaker recognition.
This chapter proposes new hard models—hard HMMs and hard GMMs. These models
emerge as interesting consequences of investigating fuzzy approaches. Sections 5.2 and
5.3 present hard models for HMMs and GMMs, respectively. The last section gives a
summary and a conclusion for hard models.
5.1
From Fuzzy To Hard Models
Fuzzy and hard models have a mutual relation: fuzzy models are obtained by fuzzifying hard models, or inversely we can derive hard models by defuzzification of fuzzy
models. We consider this relation for the simplest models: VQ (hard C-means) and
76
5.1 From Fuzzy To Hard Models
77
fuzzy VQ (FE-VQ and FCM-VQ). As presented in Chapters 3 and 4, the fuzzification of the hard objective function (the sum-of-squared-error function) to the fuzzy
objective function is achieved by adding a fuzzy entropy term to the hard objective
function for FE-VQ or applying a weighting exponent m > 1 to each uit for FCM-VQ.
Figure 5.1 shows this fuzzification. Inversely, the defuzzification of FE-VQ by letting
n → 0 or FCM-VQ by letting m → 1 results in the same VQ. To obtain a simpler
calculation, a more convenient way is to use the minimum distance rule mentioned
in Equation (2.41) on page 37 to implement the defuzzification. Figure 5.2 shows the
defuzzification methods.
FE
J(U, λ; X) =
T
C ∑
∑
✲ Hn (U, λ; X) =
T
C ∑
∑
uit d2it + n
T
C ∑
∑
uit log uit
i=1 t=1
i=1 t=1
uit d2it ✲
i=1 t=1
FCM
✲
Jm (U, λ; X) =
T
C ∑
∑
2
um
it dit
i=1 t=1
Hard: uit = 0 or 1
Fuzzy: uit ∈ [0, 1]
Figure 5.1: From hard VQ to fuzzy VQ: an additional fuzzy entropy term for fuzzy
entropy VQ, and a weighting exponent m > 1 on each uit for fuzzy C-means VQ.
The mutual relation between fuzzy and hard VQ models has already been reported
in the literature. In this thesis, we show that this relation can be applied to more
general models, such as the GMM and the HMM. Indeed, the fuzzy HMMs and
FE-VQ:
n→0
FCM-VQ:
m→1
{
1 if dit < dkt ∀k
Minimum distance rule: uit =
0 otherwise
✲
VQ
Figure 5.2: From fuzzy VQ to (hard) VQ: n → 0 for FE-VQ or m → 1 for FCM-VQ
or using the minimum distance rule to compute uit directly.
5.2 Hard Hidden Markov Models
78
the fuzzy GMMs presented in Chapters 3 and 4 are generalised from fuzzy VQ by
using the distance-probability relation d2XY = − log P (X, Y ) that relates clustering
to modelling. In this chapter, the distance-probability relation is used to derive hard
HMMs and hard GMMs from fuzzy HMMs and fuzzy GMMs, respectively. Applying
this relation to the minimum distance rule, we obtain a probabilistic rule to determine
the hard membership value uY (X) as follows
uY (X) =
{
1
if P (X, Y ) > P (X, Z) ∀Z 6= Y
0
otherwise
(5.1)
where X denotes observable data and Y , Z denote unobservable data. This rule can
be called the maximum joint probability rule. Figure 5.3 shows the mutual relations
between fuzzy and hard models used in this thesis.
Hard HMMs
Hard GMMs
✛
Minimum distance rule
Fuzzy HMMs
Fuzzy GMMs
✻2
dXY = − log P (X, Y )
VQ
✛
Fuzzification
✲
Fuzzy VQ
Defuzzification
Figure 5.3: Mutual relations between fuzzy and hard models
5.2
Hard Hidden Markov Models
As discussed in Section 3.4.1, the membership uijt denotes the belonging of an observation sequence O to fuzzy state sequences being in state i at time t and state j at
time t + 1. There are N possible fuzzy states at each time t and they are concatenated into fuzzy state sequences. Figure 5.4 illustrates possible fuzzy state sequences
starting from state 1 at time t = 1 and ending at state 3 at time t = T in a 3-state
left-to-right HMM with ∆i = 1 (the Bakis HMM) and also in the corresponding fuzzy
HMM. As discussed above, the hard membership function takes only two values 0 and
1. This means that in hard HMMs, each observation ot belongs only to the most likely
state at each time t. In other words, the sequence O is in the most likely single state
5.2 Hard Hidden Markov Models
79
Figure 5.4: Possible state sequences in a 3-state Bakis HMM and a 3-state fuzzy
Bakis HMM
sequence. This is quite similar to the state sequence in HMMs using the Viterbi
algorithm, where if we round maximum probability values to 1 and others to 0, we
obtain hard HMMs. Therefore, conventional HMMs using the Viterbi algorithm can
be regarded as “pretty” hard HMMs. This remark gives an alternative approach to
hard HMMs from conventional HMMs based on the Viterbi algorithm. Figure 5.5
illustrates a possible single state sequence in the hard HMM [Tran et al. 2000a].
Figure 5.5: A possible single state sequence in a 3-state hard HMM
5.2 Hard Hidden Markov Models
5.2.1
80
Hard Discrete HMM
This section shows the reestimation algorithm for the hard discrete HMM (H-DHMM),
where the maximum joint probability rule in (5.1) is formulated as the “hard” E-step.
In previous chapters, we have used the fuzzy membership uijt for FE-DHMMs and
FCM-DHMMs. However, for reducing calculation, the hard membership uijt can be
computed by the product of the memberships uit and uj(t+1) . Indeed, uijt = 1 means
that the observation sequence O is in state i at time t (uit = 1) and state j at time
t + 1 (uj(t+1) = 1), thus uijt = uit uj(t+1) = 1. It is quite similar for the remaining
cases. Therefore we use the following distance for the membership uit
d2it = − log P (O, st = i|λ)
(5.2)
From (2.26) page 29, we can show that
P (O, st = i|λ) = αt (i)βt (i)
(5.3)
where αt (i) and βt (i) are computed in (2.24) and (2.25) page 28. Using the maximum
joint probability rule in (5.1), the reestimation algorithm for H-DHMM is as follows
Hard E-Step:
uit =
{
1
if αt (i)βt (i) > αt (k)βt (k) ∀k 6= i
0
otherwise
(5.4)
Ties are broken randomly.
M-Step:
π j = uj1 ,
aij =
T∑
−1
uit uj(t+1)
t=1
T∑
−1
,
uit
t=1
5.2.2
bj (k) =
T
∑
ujt
t=1
s.t. ot =vk
T
∑
(5.5)
ujt
t=1
Hard Continuous HMM
The distance in (3.30) page 57 defined for the FE-CHMM and the FCM-CHMM is
used for the hard continuous HMM (H-CHMM). Using the maximum joint probability
rule in (5.1), the reestimation algorithm for H-CHMM is as follows
5.3 Hard Gaussian Mixture Models
81
Hard E-Step:
ujkt =
{
1 if P (X, st = j, mt = k|λ) > P (X, st = h, mt = l|λ) ∀(h, l) 6= (j, k)
(5.6)
0 otherwise
where ties are broken randomly and P (X, st = j, mt = k|λ) is computed in (3.30).
M-Step:
wjk =
T
∑
ujkt
t=1
K
T ∑
∑
µjk =
,
ujkt
t=1 k=1
5.3
T
∑
ujkt xt
t=1
T
∑
,
Σjk =
T
∑
ujkt (xt − µjk )(xt − µjk )′
t=1
ujkt
t=1
T
∑
ujkt
t=1
(5.7)
Hard Gaussian Mixture Models
In fuzzy Gaussian mixture models, the membership uit denotes the belonging of vector
xt to cluster i represented by a Gaussian distribution. Since 0 ≤ uit ≤ 1, vector xt
is regarded as belonging to a mixture of Gaussian distributions. In hard Gaussian
mixture models (H-GMMs), the membership uit takes only two values 0 and 1. This
means that vector xt belongs to only one Gaussian distribution, or in other words,
Gaussian distributions are not mixed, they are separated by boundaries as in VQ.
Figure 5.6 illustrates a mixture of three Gaussian distributions in fuzzy GMMs and
Figure 5.7 illustrates three separate Gaussian distributions in the H-GMM. With
this interpretation, H-GMMs should be termed hard C-Gaussians models (from the
terminology “hard C-means”) or K-Gaussians models (from the terminology “Kmeans”) [Tran et al. 2000b].
The distance in (3.40) page 59 defined for the FE-GMM is used for the H-GMM.
From the maximum joint probability rule in (5.1), the reestimation algorithm for the
H-GMM is formulated as follows
Hard E-Step:
uit =
{
1
if P (xt , i|λ) > P (xt , k|λ) ∀i 6= k
0
otherwise
(5.8)
where ties are broken randomly and
P (xt , i|λ) = wi N (xt , µi , Σi )
(5.9)
5.3 Hard Gaussian Mixture Models
82
Figure 5.6: A mixture of three Gaussian distributions in the GMM or the fuzzy GMM.
Figure 5.7: A set of three non overlapping Gaussian distributions in the hard GMM.
5.4 Summary and Conclusion
83
wi and N (xt , µi , Σi ) are defined in Section 2.3.5, page 29.
M-Step:
wi =
T
∑
uit
t=1
K
T
∑∑
,
ukt
t=1 k=1
µi =
T
∑
uit xt
t=1
T
∑
,
Σi =
T
∑
uit (xt − µi )(xt − µi )′
t=1
uit
t=1
T
∑
(5.10)
uit
t=1
It can be seen that the above reeestimation algorithm is the most generalised
algorithm of the VQ algorithms mentioned in Section 2.3.6 page 31. Indeed, let Ti be
the number of vectors in cluster i, from (5.8) we obtain
T
∑
uit = Ti
(5.11)
t=1
Replacing (5.11) into (5.10) gives an alternative form for the M-step as follows
M-Step:
wi =
Ti
,
T
µi =
1 ∑
xt ,
Ti xt ∈Ci
Σi =
1 ∑
(xt − µi )(xt − µi )′
Ti xt ∈Ci
(5.12)
Since covariance matrices are also reestimated, this algorithm is more generalised
than the ECVQ algorithm reviewed in (2.36), page 32.
5.4
Summary and Conclusion
Hard models have been presented in this chapter. There are three ways to obtain hard
models: 1) Using the fuzzy entropy algorithm with n ≈ 0; 2) Using the fuzzy C-means
algorithm with m ≈ 1; and 3) Using the nearest prototype rule. The third way has
simpler calculations. We have proposed hard HMMs and hard GMMs, where only
the best state sequence is employed in hard HMMs and a non overlapping Gaussian
distribution set is employed in hard GMMs. Conventional HMMs using the Viterbi
algorithm can be regarded as “pretty” hard HMMs. Hard GMMs are regarded as the
most generalised VQ models from which VQ, extended VQ and entropy-constrained
VQ can be derived. Conventional HMMs using the Viterbi algorithm and VQ models
are widely used, which means that hard models play an important role in speech
and speaker recognition. Relationships between the hard models are summarised in
5.4 Summary and Conclusion
84
Figure 5.8 and experimental results for hard models will be reported in Chapter 6.
Hard Models
❄
❄
Hard
DHMM
λ = {π, A, B}
❄
Hard
CHMM
❄
Hard
GMM
❄
ECVQ
λ = {µ, w}
λ = {π, A, B = {w, µ, Σ}}
λ = {w, µ, Σ}
❄
Extended
VQ
λ = {µ, Σ}
❄
VQ
λ = {µ}
Figure 5.8: Relationships between hard models
Chapter 6
A Fuzzy Approach to Speaker
Verification
Fuzzy approaches have been in Chapters 3 and 4 to train speech and speaker models.
This chapter proposes an alternative fuzzy approach to speaker verification. For an input
utterance and a claimed identity, most of the current methods compute a claimed speaker’s
score, which is the ratio of the claimed speaker’s and the impostors’ likelihood functions,
and compare this score with a given threshold to accept or reject this speaker. Considering
the speaker verification problem based on fuzzy set theory, the claimed speaker’s score is
viewed as the fuzzy membership function of the input utterance in the claimed speaker’s
fuzzy set of utterances. Fuzzy entropy and fuzzy C-means membership functions are
proposed as fuzzy membership scores, which are the ratios of f unctions of the claimed
speaker’s and impostors’ likelihood functions. So a likelihood transformation is considered
to relate current likelihood and fuzzy membership scores. Based on this consideration,
more fuzzy scores are proposed to compare with current methods. Furthermore, the
noise clustering method supplies a very effective modification to all methods, which can
overcome some of the problems of ratio-type scores and greatly reduce the false acceptance
rate.
Some basic concepts of speaker verification relevant to this chapter have been reviewed
in Section 2.2.2 page 18. A more detailed analysis of a speaker verification system and of
current normalisation methods are provided in this chapter.
85
6.1 A Speaker Verification System
6.1
86
A Speaker Verification System
Let λ0 be the claimed speaker model and λ be a model representing all other possible
speakers, i.e. impostors 1 . For a given input utterance X and a claimed identity,
the choice is between the hypothesis H0 : X is from the claimed speaker λ0 , and
the alternative hypothesis H1 : X is from the impostors λ. A claimed speaker’s score
S(X) is computed to reject or accept the speaker claim. Depending on the meaning of
the score, we can distinguish between similarity scores L(X) and dissimilarity scores
D(X) between X and λ0 . Likelihood scores are included in L(X) and VQ distortion
scores are included in D(X). These scores satisfy the following rules
L(X)
{
> θL
accept
≤ θL
reject
D(X)
{
< θD
accept
and
where θL and θD
(6.1)
(6.2)
≥ θD
reject
are the decision thresholds. Figure 6.1 presents a typical speaker
verification system.
Claimed Identity
Input Speech
✲
Speech
Processing
✗
Speaker
✲
Models
✖
✲
✔✗
Threshold
✕✖
❄
Score
Determination
✲
❄
Hypothesis
Testing
✔
✕
Accept
✲
or Reject
Figure 6.1: A Typical Speaker Verification System
Assuming that speaker models are given, this chapter solves the problem of finding
the effective scores of the claimed speaker such that the equal error rate (EER)
mentioned in Section 2.2.2 page 18 is minimised. We define equivalent scores as
scores giving the same equal error rate (EER) even though they possibly use different
thresholds. For example, Sa (X) = P (X|λ0 ) and Sb (X) = log P (X|λ0 ) are equivalent
scores, but use thresholds θ and log θ, respectively .
1
In this context, we use the term imposter for any speaker other than the claimed speaker and
without any implication of fraudulent intent or active voice manipulation
6.2 Current Normalisation Methods
6.2
87
Current Normalisation Methods
The simplest method of scoring is to use the absolute likelihood score (unnormalised
score) of an utterance. In the log domain, that is
L0 (X) = log P (X|λ0 )
(6.3)
This score is strongly influenced by variations in the test utterance such as the
speaker’s vocal characteristics, the linguistic content and the speech quality. It is
very difficult to set a common decision threshold to be used over different tests. This
drawback is overcome to some extent by using normalisation. According to the Bayes
decision rule for minimum risk, a likelihood ratio
L1 (X) =
P (X|λ0 )
P (X|λ)
(6.4)
is used. This ratio produces a relative score which is less volatile to non-speaker
utterance variations [Reynolds 1995a]. In the log domain, (6.4) is equivalent to the
following normalisation technique, proposed by Higgins et al., 1991
L1 (X) = log P (X|λ0 ) − log P (X|λ)
(6.5)
The term log P (X|λ) in (6.5) is called the normalisation term and requires calculation
of all impostors’ likelihood functions. An appoximation of this method is to use only
the closest impostor model for calculating the normalisation term [Liu et al. 1996]
L2 (X) = log P (X|λ0 ) − max log P (X|λ)
λ6=λ0
(6.6)
However when the size of the population increases, both of these normalisation methods L1 (X) and L2 (X) are unrealistic since all impostors’ likelihood functions must
be calculated for determining the value of the normalisation term. Therefore a subset of the impostor models is used. This subset consists of B “background” speaker
models λi , i = 1, . . . , B and is representative of the population close to the claimed
speaker, i.e. the “cohort speaker” set [Rosenberg et al. 1992]. Depending on the approximation of P (X|λ) in (6.4) by the likelihood functions of the background model
set P (X|λi ), i = 1, . . . , B, we obtain different normalisation methods. An approximation [Reynolds 1995a] has been applied that is the arithmetic mean (average) of
6.2 Current Normalisation Methods
88
the likelihood functions of B background speaker models. The corresponding score
for this approximation is
L3 (X) = log P (X|λ0 ) − log
{
}
B
1 ∑
P (X|λi )
B i=1
(6.7)
If the claimed speaker’s likelihood function is also included in the above arithmetic
mean, we obtain the normalisation method based on the a posteriori probability
[Matsui and Furui 1993]
L4 (X) = log P (X|λ0 ) − log
B
∑
P (X|λi )
(6.8)
i=0
Note that i = 0 in (6.8) denotes the claimed speaker model and the constant term
1/B is accounted for in the decision threshold. If the geometric mean is used instead
of the arithmetic mean to approximate P (X|λ), we obtain the normalisation method
[Liu et al. 1996] as follows
L5 (X) = log P (X|λ0 ) −
B
1 ∑
log P (X|λi )
B i=1
(6.9)
Normalisation methods can also be applied to the likelihood function of each vector
xt , t = 1, . . . , T in X, and such methods are called frame level normalisation methods.
Such a method has been proposed as follows [Markov and Nakagawa 1998b]
L6 (X) =
T [
∑
log P (xt |λ0 ) − log
t=1
B
∑
]
P (xt |λi )
i=1
(6.10)
For VQ-based speaker verification systems, the following score is widely used
D1 (X) = D(X, λ0 ) −
B
1 ∑
D(X, λi )
B i=1
(6.11)
where
D(X, λi ) =
T
∑
(i)
dkt
(6.12)
t=1
(i)
where dkt is the VQ distance between vector xt ∈ X and the nearest codevector k in
the codebook λi . It can be seen that this score is equivalent to L5 (X) if we replace
D(X, λi ) in (6.11) by [− log P (X|λi )], i = 0, 1, . . . , B.
6.3 Proposed Normalisation Methods
6.3
89
Proposed Normalisation Methods
Consider the speaker verification problem in fuzzy set theory. To accept or reject
the claimed speaker, the task is to make a decision whether the input utterance X is
either from the claimed speaker λ0 or from the set of impostors λ, based on comparing
the score for X and a decision threshold θ. Thus the space of input utterances can
be considered as consisting of two fuzzy sets: C for the claimed speaker and I for
impostors. Degrees of belonging of X to these fuzzy sets are denoted by the fuzzy
membership functions of X, where the fuzzy membership of X in C can be regarded
as a claimed speaker’s score satisfying the rule in (6.1). Making a (hard) decision is
thus a defuzzification process where X is completely in C if the fuzzy membership of
X in C is sufficiently high, i.e greater than the threshold θ.
In theory, there are many ways to define the fuzzy membership function, therefore
it can be said that this fuzzy approach proposes more general scores than the current
likelihood ratio scores for speaker verification. These are termed fuzzy membership
scores, which can denote the belonging of X to the claimed speaker. Their values
need not be scaled into the interval [0, 1] because of the above-mentioned equivalence
of scores. Based on this discussion, all of the above-mentioned likelihood-based scores
can also be viewed as fuzzy membership scores.
The next task is to find effective fuzzy membership scores. To do this, we need
to know what the shortcoming of likelihood ratio scores is. The main problem comes
from the relative nature of a ratio. Indeed, assuming L3 (X) in (6.7) is used, consider
the two equal likelihood ratios in the following example
L3 (X1 ) =
0.0000007
0.07
= L3 (X2 ) =
0.03
0.0000003
(6.13)
where both X1 and X2 are accepted if assuming the given threshold is 2. The first ratio
can lead to a correct decision that the input utterance X1 is from the claimed speaker
(true acceptance). However it is improbable that X2 is from the claimed speaker or
from any of background speakers since both likelihood values in the second ratio are
very low. X2 is probably from an impostor and thus a false acceptance can occur on
the basis of the likelihood ratio. This is a similar problem to that addressed by Chen
et al. [1994].
Two fuzzy membership scores proposed in previous chapters can overcome this
6.3 Proposed Normalisation Methods
90
Utterance
P (X|λ0 )
P (X|λ1 )
P (X|λ2 )
P (X|λ3 )
X1c
3.02 × 10−4
2.23 × 10−5
4.16 × 10−5
1.52 × 10−6
X2c
1.11 × 10−3
1.4 × 10−4
1.42 × 10−4
2.03 × 10−5
X3i
7.91 × 10−6
1.17 × 10−7
1.24 × 10−6
7.23 × 10−8
X4i
1.2 × 10−4
7.16 × 10−6
6.64 × 10−8
6.51 × 10−8
Table 6.1: The likelihood values for 4 input utterances X1 − X4 against the claimed
speaker λ0 and 3 impostors λ1 − λ3 , where X1c , X2c are from the claimed speaker and
X3i , X4i are from impostors
problem. The fuzzy entropy (FE) score L7 (X) and the fuzzy C-means (FCM) score
L8 (X) are rewritten as follows
L7 (X) =
[P (X|λ0 )]1/n
B
∑
[P (X|λi )]
(6.14)
1/n
i=0
L8 (X) =
[− log P (X|λ0 )]1/(1−m)
B
∑
(6.15)
[− log P (X|λi )]1/(1−m)
i=0
where m > 1, n > 0, and the impostors’ fuzzy set I is approximately represented by
B background speakers’ fuzzy subsets. Note that in the log domain, as n = 1, the
score L7 (X) after taking the logarithm reduces to the score based on the a posteriori
probability L4 (X) in (6.8). Experimental results in the next chapter show low EERs
for these effective scores.
To illustrate the effectiveness of these scores, for simplicity, we compare L3 (X)
with L8 (X) in a numerical example. Table 6.1 presents the likelihood values for 4
input utterances X1 − X4 against the claimed speaker λ0 and 3 impostors λ1 − λ3 ,
where X1c , X2c are from the claimed speaker and X3i , X4i are from impostors (these are
real values from experiments on the TI46 database). Given a score L(X), the EER
= 0 in the case that all the scores for X1c and X2c are greater than all those for X3i
and X4i .
Table 6.2 shows scores in (6.7) and (6.15) computed using these likelihood values
where m = 2 is applied to (6.15). It can be seen that with the score L3 (X), we always
have the EER 6= 0 since scores for X3i and X4i are higher than those for X1c and X2c .
6.3 Proposed Normalisation Methods
X1c
Score
91
X2c
X3i
X4i
L3 (X)
2.628 2.399 2.810 3.899
L8 (X)
0.316 0.316 0.302 0.350
Table 6.2: Scores of 4 utterances using L3 (X) and L8 (X)
However using L8 (X), the EER is reduced since the score for X3i is lower than those
for X1c and X2c .
A more robust method proposed in this chapter is to use fuzzy membership scores
based on the noise clustering (NC) method. Indeed, this fuzzy approach can reduce
the false acceptance error by forcing the membership value of the input utterance X
to become as small as possible if X is really from impostors, not from the claimed
speaker or background speakers. This fuzzy approach is simple but very effective,
just adding a suitable constant value ɛ > 0 (similar to the constant distance δ in the
NC method) to denominators of ratios, i.e. normalisation terms as follows
Change
∑
...
in the normalisation term to
∑
... + ɛ
(6.16)
Note that the NC method can be applied not only fuzzy membership scores but also
likelihood ratio scores. For example, NC-based versions of L3 (X), L7 (X) and L8 (X)
are as follows
L3nc (X) = log P (X|λ0 ) − log
L7nc (X) =
[P (X|λ0 )]1/n
B
∑
1/n
[P (X|λi )]
[
]
B
1 ∑
P (X|λi ) + ɛ3
B i=1
(6.17)
(6.18)
+ ɛ7
i=0
L8nc (X) =
B
∑
[− log P (X|λ0 )]1/(1−m)
(6.19)
[− log P (X|λi )]1/(1−m) + ɛ8
i=0
where the index “nc” means “noise clustering”. For illustration, applying NC-based
6.4 The Likelihood Transformation
X1c
Score
92
X2c
X3i
X4i
L3nc (X)
2.251 1.710 -2.547 0.158
L8nc (X)
0.139 0.152
0.109
0.136
Table 6.3: Scores of 4 utterances using L3nc (X) and L8nc (X)
scores for the first example in (6.13) with ɛ3 = 0.01 gives
L3nc (X1c ) =
0.07
= 1.75
0.03 + 0.01
>
L3nc (X2c ) =
0.0000007
= 0.000069
0.0000003 + 0.01
(6.20)
For the second example in Table 6.3, with ɛ3 = 10−4 and ɛ8 = 0.5, NC-based scores
are shown in Table 6.3. We can see that with thresholds θ3 = 1.0 and θ8 = 0.137, the
EER is 0 for both of these NC-based scores.
We have illustrated the effectiveness of fuzzy membership scores by way of some
numerical examples. To be more rigorous, this approach should be considered in
theoretical terms. Let us take into account expressions of likelihood ratio scores and
fuzzy membership scores. The former is the ratio of likelihood functions whereas the
latter is the ratio of f unctions of likelihood functions. Indeed, denoting P as the
likelihood function, we can see that the FE score employs the function f (P ) = P 1/n
and the FCM score employs f (P ) = ( − log P )1/(m−1) . In other words, likelihood
ratio scores are transformed to FE and FCM scores by using these functions. Such a
transformation is considered in the next section.
6.4
The Likelihood Transformation
Consider a transformation T : P → T (P ), where P is the likelihood function and
T (P ) is a certain continuous function of P . For example, T [P (X|λ0 )] = log P (X|λ0 ).
Applying this transformation to the likelihood ratio score L1 (X) gives an alternative
score S(X)
S(X) =
T [P (X|λ0 )]
T [P (X|λ)]
(6.21)
6.4 The Likelihood Transformation
93
The difference between S(X) and L1 (X) is
S(X) − L1 (X) =
P (X|λ0 ) T [P (X|λ0 )] T [P (X|λ)]
−
T [P (X|λ)] P (X|λ0 )
P (X|λ)
[
]
(6.22)
Assuming that in (6.22) T (P )/P is an increasing function, i.e. if P1 /P2 > 1 or
P1 > P2 then T (P1 )/P1 > T (P2 )/P2 , and T (P ) is a negative function for 0 ≤ P ≤ 1.
The two following cases can occur
•
P (X|λ0 )
> 1: the expression on the right hand side of (6.22) is negative, thus
P (X|λ)
P (X|λ0 )
T [P (X|λ0 )]
<
T [P (X|λ)]
P (X|λ)
•
⇔
S(X) < L1 (X)
(6.23)
P (X|λ0 )
≤ 1: the expression on the right hand side of (6.22) is non negative,
P (X|λ)
thus
P (X|λ0 )
T [P (X|λ0 )]
≥
T [P (X|λ)]
P (X|λ)
⇔
S(X) ≥ L1 (X)
(6.24)
In the first case, the transformation moves the likelihood ratios greater than 1 to the
transformed likelihood ratios less than 1 and vice versa in the second case, the ratios
less than 1 are moved to the transformed ratios greater than 1. Figure 6.2 illustrates
this transformation. This transformation is a nonlinear mapping since distances be-
Figure 6.2: The transformation T where T (P )/P increases and T (P ) is non positive
for 0 ≤ P ≤ 1: values of 4 ratios at A, B, C, and D are moved to those at A’, B’, C’,
and D’
tween different ratios and their transformations are different. For example, distances
6.4 The Likelihood Transformation
94
AA’, BB’, CC’ and DD’ in Figure 6.4 are different. If T (P ) = log P , a numerical
example for these ratios is as follows
0.011
≃ 1.222
0.009
0.0009
≃ 1.125
B:
0.0008
0.01
≃ 1.111
C:
0.009
0.044
D:
≃ 0.863
0.051
−→
A:
−→
−→
−→
log 0.011
≃ 0.957
log 0.009
log 0.0009
B′ :
≃ 0.9834
log 0.0008
log 0.01
C′ :
≃ 0.978
log 0.009
log 0.044
D′ :
≃ 1.049
log 0.051
A′ :
(6.25)
Comparisons of the order of the 4 points A, B, C, D with the order of A’, B’, C’
D’ and of likelihood values at point B (0.0009 and 0.0008) with other values show
that this transformation T can “recognise” ratios of small likelihood values and thus
leads to a reduction of the false acceptance error rate (and the EER also) as shown
in the previous section’s examples. Figure 6.3 shows histograms of a female speaker
labelled
f7 in/the TI46 corpus]using 16-mixture GMMs, where the score L3 (X) =
[
log P (X|λ0 )
1
B
∑B
i=1
P (X|λi ) in Figure 6.3a is transformed by T (P ) = log P to
[
the score D3 (X) = log log P (X|λ0 )
/
log { B1
∑B
i=1
]
P (X|λi )} in Figure 6.3b.
Figure 6.3: Histograms of speaker f7 in the TI46 using 16-mixture GMMs. The EER
is 6.67% for Fig. 6.3a and is 5.90% for Fig. 6.3b.
Considerations where T (P )/P is a decreasing function or T (P ) is a positive function can be similarly considered. In general, there may not exist a function T (P ) for
6.4 The Likelihood Transformation
95
the transformation T such that the EER is 0. However, a function T (P ) for reducing the EER may exist. For convenience in calculating products of probabilities, the
function T (P ) should be related to the logarithm function. This is also convenient
for applying these methods to VQ-based speaker verification since the distance in VQ
can be defined as the negative logarithm of the corresponding likelihood function.
Based on this transformation and to compare with current likelihood ratio scores,
we also propose more fuzzy scores for the decision rule in (6.2) as follows
D2 (X) =
log P (X|λ0 )
max log P (X|λ)
(6.26)
λ6=λ0
log P (X|λ0 )
}
B
1 ∑
P (X|λi )
log
B i=1
log P (X|λ0 )
D4 (X) =
}
{ ∑
1 B
P (X|λi )
log
B i=0
log P (X|λ0 )
D5 (X) =
B
1 ∑
log P (X|λi )
B i=1
arctan[log P (X|λ0 )]
D6 (X) =
B
1 ∑
arctan[log P (X|λi )]
B i=1
D3 (X) =
{
(6.27)
(6.28)
(6.29)
(6.30)
where T [P (X|λ)] of the impostors is approximated by the transformed likelihood
functions of background speakers D2 (X), . . . , D5 (X) in the same manner as was applied to L2 (X), . . . , L5 (X). Note that the factor 1/B in D4 (X) is not accounted for
in the decision threshold as for L4 (X) in (6.8). The effectiveness of the score in (6.30)
using the arctan function will be shown in the next chapter. It should again be noted
that the NC-based versions of these scores are derived following the method in (6.16).
To apply these scores to VQ-based speaker verification systems, likelihood functions should be changed to VQ distortions as shown for the score D1 (X) in (6.11)
and (6.12).
6.5 Summary and Conclusion
6.5
96
Summary and Conclusion
A fuzzy approach to speaker verification has been proposed in this chapter. Using
fuzzy set theory, fuzzy membership scores are proposed as scores more general than
current likelihood scores. The likelihood transformation and 7 new scores have been
proposed. This fuzzy approach also leads to a noise clustering-based version for all
scores, which improves speaker verification performance markedly. Using the arctan
function in computing the score illustrates a theoretical extension for normalisation
methods, where not only the logarithm function but also other functions can be used.
Chapter 7
Evaluation Experiments and
Results
This chapter presents experiments performed to evaluate proposed models for speech and
speaker recognition as well as proposed normalisation methods for speaker verification.
Three speech data used in evaluation experiments are the TI46, the ANDOSL and the
YOHO corpora. Isolated word recognition experiments were performed on the word sets—
E set, 10-digit set, 10-command set and 46-word set—taken from the TI46 corpus using
conventional, fuzzy and hard hidden Markov models. Speaker identification and verification experiments were performed on 16 speakers of TI46, 108 speakers of ANDOSL and
138 speakers of YOHO using conventional, fuzzy and hard Gaussian mixture models as
well as vector quantisation. Experiments demonstrate that fuzzy models and their noise
clustering versions outperform conventional models. Hard hidden Markov models were
also obtained good results.
7.1
7.1.1
Database Description
The TI46 database
The TI46 corpus was designed and collected at Texas Instruments (TI). The speech
was produced by 16 speakers, 8 females and 8 males, labelled f1-f8 and m1-m8 respectively, consisting of two vocabularies—TI-20 and TI-alphabet. The TI-20 vocabulary
contains the ten digits from 0 to 9 and ten command words: enter, erase, go, help,
97
7.1 Database Description
98
no, rubout, repeat, stop, start, and yes. The TI-alphabet vocabulary contains the
names of the 26 letters of the alphabet from a to z. For each vocabulary item, each
speaker produced 10 tokens in a single training session and another two tokens in each
of 8 testing sessions. The words in the TI-20 vocabulary are highly discriminable,
with the majority of confusions occurring between go and no. By comparison, the
TI-alphabet is a much more difficult vocabulary since it contains several confusable
subsets of letters, such as the E-set {b, c, d, e, g, p, t, v, z} and the A-set {a, j, k}. The
TI-20 vocabulary is a good choice because it has been used for other tests and therefore can serve as a standard benchmark of the performance [Syrdal et al. 1995]. The
corpus was sampled at 12500 samples per second and 12 bits per sample.
7.1.2
The ANDOSL Database
The Australian National Database of Spoken Language (ANDOSL) [Millar et al. 1994]
corpus comprises carefully balanced material for Australian speakers, both Australianborn and overseas-born migrants. The aim was to represent as many significant
speaker groups within the Australian population as possible. Current holdings are
divided into those from native speakers of Australian English (born and fully educated in Australia) and those from non-native speakers of Australian English (first
generation migrants having a non-English native language). A subset used for speaker
recognition experiments in this paper consists of 108 native speakers, divided into 36
speakers of General Australian English, 36 speakers of Broad Australian English, and
36 speakers of Cultivated Australian English comprising 6 speakers of each gender
in each of three age ranges (18-30, 31-45 and 46+). So there are total of 18 groups
of 6 speakers labelled “ijk ′′ , where i denotes f (female) or m (male), j denotes y
(young) or m (medium) or e (elder), and k denotes g (general) or b (broad) or c
(cultivated). For example, the group f yg contains 6 female young general Australian
English speakers. Each speaker contributed in a single session, 200 phonetically rich
sentences. The averaged duration for each sentence is approximately 4 sec. The
speech was recorded in an anechoic chamber using a B&K 4155 microphone and a
DSC-2230 VU-meter used as a preamplifier and was digitised direct to computer disk
using a DSP32C analog-to-digital converter mounted in a PC. All waveforms were
sampled at 20 kHz and 16 bits per sample. For the processing as telephone speech,
7.2 Speech Processing
99
all waveforms were converted from 20 kHz to 8 kHz bandwidth. Low and high pass
cut-offs were set to 300 Hz and 3400 Hz.
7.1.3
The YOHO Database
The YOHO corpus was collected by ITT under a US government contract and was
designed for speaker verification systems in office environments with limited vocabulary. There are 138 speakers, 108 males and 30 females. The vocabulary consists
of 56 two-digit numbers ranging from 21 to 97 pronounced as “twenty-one”, “ninetyseven”, and spoken continuously in sets of three, for example “36-45-89”, in each
utterance. There are four enrolment sessions per speaker, numbered 1 through 4,
and each session contains 24 utterances. There are also ten verification sessions,
numbered 1 through 10, and each session contains 4 utterances. All waveforms are
low-pass filtered at 3.8 kHz and sampled at 8 kHz.
7.2
Speech Processing
For the TI46 corpus, the data were processed in 20.48 ms frames (256 samples) at
a frame rate of 10 ms. Frames were Hamming windowed and preemphasised with
m = 0.9. For each frame, 46 mel-spectral bands of a width of 110 mel and 20 melfrequency cepstral coefficients (MFCC) were determined [Wagner 1996] and a feature
vector with a dimension of 20 was generated for individual frames.
For the ANDOSL and YOHO corpora, speech processing was performed using
HTK V2.0 [Woodland 1997], a toolkit for building HMMs. The data were processed
in 32 ms frames at a frame rate of 10 ms. Frames were Hamming windowed and
preemphasised with m = 0.97. The basic feature set consisted of 12th-order MFCCs
and the normalised short-time energy, augmented by the corresponding delta MFCCs
to form a final set of feature vector with a dimension of 26 for individual frames.
7.3 Algorithmic Issues
7.3
100
Algorithmic Issues
7.3.1
Initialisation
It was shown in the literature that no significant difference in isolated-word recognition and speaker identification was found by using different initialisation methods
[Rabiner et al. 1983, Reynolds 1995b]. So HMMs, GMMs, and VQ are respectively
initialised as follows:
• HMMs and fuzzy HMMs: Widely-used HMMs for training are left-to-right
HMMs as defined in (2.14), i.e. state sequences begin in state 1 and end in
state N, therefore the discrete HMM parameter set in our experiments was
initialised as follows
π1 = 1
πi = 0,
2≤i≤N
aij = 0
1 ≤ i ≤ N,
j < i or j > i + 1
aij = 0.5
1 ≤ i ≤ N,
i≤j ≤i+1
bj (k) = 1/K
1≤j≤N
1≤k≤K
(7.1)
Fuzzy membership functions were initialised with essentially random choices.
Gaussian parameters in continuous HMMs were initialised as those in GMMs.
• GMMs and fuzzy GMMs: mixture weights, mean vectors, covariance matrices
and fuzzy membership functions were initialised with essentially random choices.
Covariance matrices are diagonal, i.e. [Σk ]ii = σk2 and [Σk ]ij = 0 if i 6= j, where
σk2 , 1 ≤ k ≤ K are variances.
• VQ and fuzzy VQ: only fuzzy membership functions in fuzzy VQ models were
randomly initialised. VQ models were trained by the LBG algorithm—a widely
used version of the K-means algorithm [Linde et al. 1980], where starting from
the codebook size of 1, a binary split procedure is performed to double the
codebook size in end of several iterations.
7.3.2
Constraints on parameters during training
• HMMs and fuzzy HMMs: If the B matrix is left completely unconstrained, that
is, a finite training sequence may result in bj (k) = 0, then the probability of that
7.3 Algorithmic Issues
101
sequence is equal to 0, and hence a recognition error must occur. This problem
was handled by using post-estimation constraints on the bj (k)’s of the form
bj (k) ≥ ɛ, where ɛ is a suitably chosen threshold value [Rabiner et al. 1983].
The number of states N = 6 was chosen for discrete HMMs based on an experiment considering the recognition error in different states as shown in Figure
7.3.2
Figure 7.1: Isolated word recognition error (%) versus the number of state N for
the digit-set vocabulary, using left-to-right DHMMs, codebook size K = 16, TI46
database
• GMMs and fuzzy GMMs: Similarly, a variance limiting constraint was applied
to all GMMs using diagonal covariance matrices. This constraint places a min2
imum variance value σmin
on elements of all variance vectors in the GMM, that
2
2
is, σi2 = σmin
if σi2 ≤ σmin
[Reynolds 1995b]. In our experiments, ɛ = 10−5
2
and σmin
= 10−2 . The chosen number of mixtures were 16, 32, 64, and 128 to
compare VQ models using the LBG algorithm.
• VQ and fuzzy VQ: Codebook sizes of 16, 32, 64, and 128 were chosen for all
experiments.
• Fuzzy parameters: Choosing the appropriate values of the degree of fuzziness
m and the degree of fuzzy entropy n was based on considering recognition error
for each database and each model. For example, in speaker identification using
7.4 Isolated Word Recognition
102
FCM-VQ models and the TI46 database, an experiment was implemented to
consider the identification error via different values of the degree of fuzziness m,
where the codebook size was fixed. Results are shown in Figure 7.3.2. Therefore
Figure 7.2: Speaker Identification Error (%) versus the degree of fuzziness m using
FCM-VQ speaker models, codebook size K = 16, TI46 corpus
the value m = 1.2 was chosen for FCM-VQ models using the TI46 database.
In general, our experiments showed that the degree of fuzziness m depends on
models and weakly on speech databases. Suitable values were m = 1.2 for
FCM-VQ, m = 1.05 for FCM-GMMs and m = 1.2 for FCM-HMMs.
Similarly, suitable values for the degree of fuzzy entropy n were n = 1 for FEVQ, n = 1.2 for FE-GMMs and n = 2.5 for FE-HMMs. Figure 7.3.2 shows the
word recognition error versus the degree of fuzzy entropy n.
7.4
Isolated Word Recognition
The first procedure in building an isolated word recognition system is that training
a HMM for each word. The speech signals for training HMMs were converted into
feature vector sequences as in (2.1) using the LPC analysis mentioned in Section
2.1.3. For training discrete models, these vector sequences were used to generate a
VQ codebook by means of the LBG algorithm and were encoded into observation
7.4 Isolated Word Recognition
103
Figure 7.3: Isolated word recognition error (%) versus the degree of fuzzy entropy n
for the E-set vocabulary, using 6-state left-to-right FE-DHMMs, codebook size of 16,
TI46 corpus
sequences by this VQ codebook. Observation sequences were used to train conventional DHMMs, FE-DHMMs, NC-FE-DHMMs, FCM-DHMMs, NC-FCM-DHMMs
and H-DHMMs following reestimation algorithms in Section 2.3.4 on page 26, Section
3.4.2 on page 54, Section 4.2.1 on page 67 and Section 5.2.1 on page 80. For training continuous models, vector sequences were directly used as observation sequences
to train conventional CHMMs, FE-CHMMs, NC-FE-CHMMs, FCM-CHMMs, NCFCM-CHMMs and H-CHMMs following reestimation algorithms in Section 2.3.4 on
page 26, Section 3.4.3 on page 57, Section 4.2.2 on page 68 and Section 5.2.2 on page
80. The second procedure is that recognising unknown words. Assuming that we
have a vocabulary of M words to be recognised and that a HMM was trained for each
word, the speech signal of unknown word is converted into an observation sequence O
using the above codebook. The probabilities P (O|λi ), i = 1, . . . , M were calculated
using (2.26) and the recognised word is the word whose probability is highest. The
three data sets of the TI46 corpus used for training discrete models are the E set,
the 10-digit set, and the 10-command set. The whole 46 words were used to train
continuous models.
7.4 Isolated Word Recognition
7.4.1
104
E set Results
Table 7.4.1 presents the experimental results for the recognition of the E set using 6state left-to-right HMMs in speaker-dependent mode. In the training phase, 10 training tokens (1 training session x 10 repetitions) of each word were used to train conventional DHMMs, FE-DHMMs, NC-FE-DHMMs, FCM-DHMMs, NC-FCM-DHMMs
and H-DHMMs using VQ codebook sizes of 16, 32, 64, and 128. In the recognition phase, isolated word recognition was carried out by testing all 160 test tokens
(10 utterances x 8 testing sessions x 2 repetitions) against conventional DHMMs,
FE-DHMMs, NC-FE-DHMMs, FCM-DHMMs, NC-FCM-DHMMs and H-DHMMs of
each of 16 speakers. For continuous models, 736 test tokens (46 utterances x 8 testing
sessions x 2 repetitions) were tested against conventional CHMMs, FE-CHMMs and
FCM-CHMMs.
Codebook Conventional
FE
NC-FE
FCM
NC-FCM
Hard
Size
DHMM
DHMM
DHMM
DHMM
DHMM
DHMM
16
54.54
41.51
41.48
51.97
42.74
44.51
32
39.41
34.38
34.36
37.46
33.46
36.72
64
33.84
29.92
29.88
30.54
27.87
30.89
128
33.98
31.28
31.25
32.27
31.85
33.07
Table 7.1: Isolated word recognition error rates (%) for the E set
In general, the errors are very different for the codebook size of 16, where the
highest error is for conventional models. This can be interpreted that the information
obtained from training data was lost by using small size codebooks and then not
sufficient for training conventional DHMMs. As discussed before, with a suitable
degree of fuzzy entropy or fuzziness, fuzzy models can reduce the errors due to this
problem. The average recognition error reductions by FE-DHMMs, NC-FE-DHMMs,
FCM-DHMMs, and NC-FCM-DHMMs in comparison with conventional DHMMs are
(54.54 − 41.51) = 13.03%, (54.54 − 41.23) = 13.31%, (54.54 − 51.97)% = 2.57%,
and (54.54 − 43.74)% = 10.8%, respectively. As the codebook size increases, the
errors for all models are reduced as shown with the codebook sizes of 32 and 64. The
error difference between these models also decreases as the codebook size increases
7.4 Isolated Word Recognition
105
since the information obtained from the training data is sufficient for conventional
DHMMs. However, the number of HMM parameters is proportional to the codebook
size because of the matrix B, therefore with the large codebook size, e.g. 128, the
training data is now insufficient for these HMMs and hence the errors for the codebook
size of 128 become larger than those for the codebook size of 64. We should pay
attention to results of hard DHMMs. This is surprising since hard models are often
performed worse than other models.
The results also show that FE models performed better than FCM models for
the E set. However for their noise clustering versions, NC-FE-DHMMs are slightly
better than FE-DHMMs whereas NC-FCM-DHMMs achieved significant results in
comparison with FCM-DHMMs. The lowest recognition error is 27.87% obtained by
NC-FCM-DHMMs using VQ codebook size of 64. This can be interpreted that the
decreasing speed of the exponential function in FE memberships is faster than that of
the inverse function in FCM memberships for long distances (see Figure 4.5 on page
74), so FCM models are more sensitive to outliers than FE models.
Experiments were in speaker-dependent mode, so recognition performance should
be evaluated in speaker-dependent results. Table 7.4.1 shows these speaker-dependent
recognition results using the codebook size of 16. Recognition error rates were significantly reduced by using fuzzy models for female speakers labelled f1, f3, f7, and
male speakers labelled m3, m6, m7, m8. Hard models were also more effective than
conventional models for speakers labelled f3, f6, f7, m3, m7 and m8, but for speakers
labelled f2, m1 and m5, they were worse than conventional models. In general, results
are not very dependent on speakers in using noise clustering-based fuzzy models.
7.4.2
10-Digit&10-Command Set Results
Tables 7.4.2 and 7.4.2 present the experimental results for the recognition of the
10-digit set and the 10-command set using 6-state left-to-right models in speakerdependent mode. These sets do not include a highly confusable vocabulary as the
E set, therefore the recognition error rates are low for all models. In these two
experiments we can see that the results for FCM-DHMMs and NC-FCM-DHMMs or
FE-DHMMs and NC-FE-DHMMs produce very similar results. This is probably due
to the fact that clusters in the 10-digit and 10-command sets are better separated
7.4 Isolated Word Recognition
106
than those in the E set.
7.4.3
46-Word Set Results
The whole vocabulary of 46 words in the TI46 corpus was used to train continuous
models. Based on the results for the 10-digit set and the 10-command set, recognition
performance on the 46-word set using noise-clustering-based fuzzy models is possibly
not much improved in comparison with fuzzy models. So in this experiment, only
the 3-state 2-mixture models and 5-state 2-mixture models including conventional
Conventional
FE
NC-FE
FCM
NC-FCM
Hard
Speaker
DHMM
DHMM
DHMM
DHMM
DHMM
DHMM
f1
68.06
54.17
54.11
62.76
43.75
65.28
f2
43.75
47.22
47.15
47.25
47.22
44.44
f3
63.19
43.75
43.75
59.88
54.86
51.39
f4
50.00
31.25
31.24
48.35
39.58
34.72
f5
55.56
54.86
54.81
54.83
54.17
53.47
f6
54.17
29.86
29.86
47.97
28.47
33.33
f7
69.44
44.44
44.41
59.23
40.97
43.06
f8
49.31
31.94
31.92
44.25
31.94
40.28
m1
27.78
26.39
26.39
28.89
28.47
36.11
m2
52.08
29.17
29.15
49.86
38.19
31.25
m3
62.50
44.44
44.42
57.68
49.31
49.31
m4
44.29
37.14
37.05
47.67
46.43
35.71
m5
33.33
39.01
38.89
33.21
31.21
39.01
m6
63.57
50.71
50.71
61.59
52.14
48.57
m7
63.38
40.14
40.12
59.45
45.77
43.66
m8
72.22
59.72
59.69
68.64
51.39
62.50
Female
56.69
42.19
42.16
53.07
42.62
45.75
Male
52.39
40.84
40.80
50.87
42.86
43.27
Average
54.54
41.51
41.48
51.97
42.74
44.51
Table 7.2: Speaker-dependent recognition error rates (%) for the E set
7.4 Isolated Word Recognition
Codebook Conventional
107
FE
NC-FE
FCM
NC-FCM
Hard
Size
DHMM
DHMM
DHMM
DHMM
DHMM
DHMM
16
6.21
4.84
4.32
5.83
4.74
5.03
32
2.25
1.81
1.72
2.16
2.06
1.81
64
0.43
0.37
0.34
0.39
0.38
0.43
128
0.39
0.35
0.35
0.38
0.38
0.43
Table 7.3: Isolated word recognition error rates (%) for the 10-digit set
Codebook Conventional
FE
NC-FE
FCM
NC-FCM
Hard
Size
DHMM
DHMM
DHMM
DHMM
DHMM
DHMM
16
15.74
6.45
6.32
13.78
13.74
9.07
32
4.36
3.64
3.52
3.88
3.76
3.50
64
2.43
2.27
2.24
2.28
2.28
2.18
128
1.65
1.45
1.43
1.60
1.58
1.73
Table 7.4: Isolated word recognition error rates (%) for the 10-command set
7.5 Speaker Identification
108
Model
Conventional CHMM
FE-CHMM
FCM-CHMM
3 states 2 mixtures
9.59
9.42
8.96
5 states 2 mixtures
7.19
7.15
6.57
Table 7.5: Isolated word recognition error rates (%) for the 46-word set
CHMMs, FE-CHMMs, and FCM-CHMMs were trained for each word. Table 7.4.3
presents recognition performance of these models.
In this experiment FCM-CHMMs were performed better than FE-CHMMs and
conventional CHMMs. Speaker-dependent results for the conventional CHMMs and
FCM-CHMMs are presented in Table 7.4.3. In comparison with conventional CHMMs,
recognition error rates for speakers labelled f2, f3, f4, m2, m3 and m6 were improved
in both cases using 3-state 2-mixture and 5-state 2-mixture models. There was only
a bad result for speaker labelled f8.
7.5
Speaker Identification
The TI46, the ANDOSL, and the YOHO corpora were used for speaker identification.
GMM-based speaker identification was performed by using conventional GMMs, FEGMMs, NC-FE-GMMs, FCM-GMMs and NC-FCM-GMMs. For VQ-based speaker
identification, FE-VQ, FCM-VQ and (hard) VQ codebooks were also used. The vector
sequences obtained after the LPC analysis of speech signals were used for training
speaker models. Reestimation formulas have been presented in previous sections such
as (2.30) page 30 for GMMs, (3.42) and (3.43) page 60 for FE-GMMs, (4.15) and
(4.16) page 70 for FCM-GMMs. For GMM-based speaker identification, testing was
performed by calculating the probabilities P (X|λi ), i = 1, . . . , M for the unknown X
with M speaker models using (2.31) and the recognised speaker is the speaker whose
probability is highest. For VQ-based speaker identification, testing was performed by
calculating the average distortion D(X, λi ), i = 1, . . . , M for the unknown X with
M speaker models using (6.12) on page 88 and the recognised speaker is the speaker
whose average distortion is lowest.
7.5 Speaker Identification
7.5.1
109
TI46 Results
The vocabulary was the 10-command set. In the training phase, 100 training tokens
(10 utterances x 1 training session x 10 repetitions) of each speaker were used to train
conventional GMMs, FCM-GMMs and NC-FCM-GMMs of 16, 32, 64, and 128 mixtures for GMM-based speaker identification, and VQ, FE-VQ, NC-FE-VQ codebooks
of 16, 32, 64, 128 codevectors for VQ-based speaker identification. The degrees of
fuzzy entropy and fuzziness were chosen n = 1 and m = 1.06, respectively. Speaker
identification was carried out in text-independent mode by testing all 2560 test tokens
(16 speakers x 10 utterances x 8 testing sessions x 2 repetitions) against conventional
GMMs, FCM-GMMs and NC-FCM-GMMs of all 16 speakers for GMM-based speaker
identification, and against VQ, FE-VQ, NC-FE-VQ codebooks of all 16 speakers for
VQ-based speaker identification.
Figure 7.4: Speaker identification error rate (%) versus the number of mixtures for
16 speakers, using conventional GMMs, FCM-GMMs and NC-FCM-GMMs
The experimental results are plotted in Figure 7.5.1 for GMM-based speaker
identification. The identification error rate was improved by using 64-mixture, 128mixture FCM-GMMs and NC-FCM-GMMs. The identification error rates of FCMGMMs and NC-FCM-GMMs were about 2% and 3% lower than the error rate of
conventional GMMs using 64 mixtures [Tran and Wagner 1998]. Figure 7.5.1 shows
experimental results for VQ-based speaker identification. FE-VQ and NC-FE-VQ
were also obtained similar good results as FCM-GMMs ans NC-FCM-GMMs.
7.5 Speaker Identification
110
Figure 7.5: Speaker identification error rate (%) versus the codebook size for 16
speakers, using VQ, FE-VQ and NC-FE-VQ codebooks
7.5.2
ANDOSL Results
For the ANDOSL corpus, the subset of 108 native speakers was used. For each
speaker, the set of 200 long sentences was divided into two subsets. The training
set consisting of 10 sentences numbered from 001 to 020 was used to train conventional GMMs, FE-GMMs and FCM-GMMs of 32 mixtures. Very low error rates were
obtained for all models thus noise clustering-based fuzzy models have not been considered in this experiment. The test set consisting of 180 sentences numbered from
021 to 200 was used to test the above models. Speaker identification was carried
out by testing all 19440 tokens (180 utterances x 108 speakers) against conventional
GMMs, FE-GMMs and FCM-GMMs of all 108 speakers. The degree of fuzzy entropy
and fuzziness were respectively set to n = 1.01 for FE-GMMs and to m = 1.026
for FCM-GMMs. The experimental results are presented in Table 7.5.2 for speaker
groups mentioned in Section 7.1.2. Results for the ANDOSL corpus are better than
those for the TI46 corpus since a larger training data set was used. Female speakers
have higher error rates than male speakers. Elder speakers have lower error rates
than young and medium speakers. Broad speakers have the lowest error rates in
comparison with cultivated and general speakers.
7.6 Speaker Verification
7.5.3
111
YOHO Results
For the experiments performed on the YOHO corpus, each speaker was modelled by
using 48 training tokens in enrolment sessions 1 and 2 only. Using all four enrolment
sessions, as done for example by [Reynolds 1995a], resulted in error rates that were
too low to allow meaningful comparisons between the different normalisation methods for speaker verification. Conventional GMMs, hard GMMs and VQ codebooks
were trained in the training phase. Speaker identification was carried out in textindependent mode by testing all 8280 test tokens (138 speakers x 10 utterances x
4 sessions) against conventional GMMs, hard GMMs and VQ codebooks of all 138
speakers. Table 7.5.3 shows identification error rates for these models. As expected,
results show that hard GMMs performed better than VQ but worse than conventional
GMMs.
7.6
Speaker Verification
As discussed in Chapter 6, speaker verification performance depends on speaker models and on the normalisation method used to verify speakers. So experiments were
performed in three cases: 1) Using proposed speaker models and current normalisation methods, 2) Using conventional speaker models and proposed normalisation
methods, and 3) Using proposed speaker models and proposed normalisation methods. Three speech corpora—TI46, ANDOSL, and YOHO—were used and all experiments were in text-independent mode. Speaker models for speaker verification are
also those for speaker identification. Verification rules in (6.1) and (6.2) on page
86 were used for similarity-based scores Li (X), i = 3, . . . , 6 and dissimilarity-based
scores Dj (X), j = 3, . . . , 9 in this chapter.
7.6.1
TI46 Results
The speaker verification experiments were performed on the TI46 corpus to evaluate
proposed speaker models. Conventional GMMs, FCM-GMMs and NC-FCM-GMMs
of 16, 32, 64, 128 mixtures and VQ, FE-VQ, NC-FE-VQ codebooks of 16, 32, 64, 128
codevectors in speaker identification were used. The normalisation method is L3 (X)
7.6 Speaker Verification
112
in (6.8) page 88 for GMM-based speaker verification, and is D1 (X) in (6.11) page
88 for VQ-based speaker verification. Since the TI46 corpus has a small speaker set
consisting of 8 female and 8 male speakers, therefore for a verification experiment, each
speaker was used as a claimed speaker with the remaining speakers (including 7 samegender background speakers) acting as impostors and rotating through all speakers.
So the total number of claimed test utterances and impostor test utterances are 2560
(16 claimed speakers x 10 test utterances x 8 sessions x 2 repetitions) and 38400 ((16
x 15) impostors x 10 test utterances x 8 sessions x 2 repetitions), respectively.
Figure 7.6: EERs (%) for GMM-based speaker verification performed on 16 speakers,
using GMMs, FCM-GMMs and NC-FCM-GMMs.
Experimental results are plotted in Figure 7.6.1 for GMM-based speaker verification. The lowest EER about 3% obtained for 128-mixture NC-FCM-GMMs.
Both FCM-GMMs and NC-FCM-GMMs show better results compared to conventional GMMs, where the highest improvement of about 1% is found for 64-mixture
NC-FCM-GMMs. Figure 7.6.1 shows EERs for VQ-based speaker verification using
VQ, FE-VQ and NC-FE-VQ codebooks. Similarly, with codebook size of 16 and 32,
FE-VQ and NC-FE-VQ obtain high improvements compared to VQ codebooks.
7.6.2
ANDOSL Results
Experiments were performed on 108 speakers using each speaker as a claimed speaker
with 5 closest background speakers or 5 same-subgroup background speakers as in-
7.6 Speaker Verification
113
Figure 7.7: EERs (%) for VQ-based speaker verification performed on 16 speakers,
using VQ, FE-VQ and NC-FE-VQ codebooks
dicated in Section 7.1.2 and 102 mixed-gender impostors (excluding 5 background
speakers) and rotating through all speakers. The total number of claimed test utterances and impostor test utterances are 20520 (108 claimed speakers x 190 test
utterances) and 2093040 ((108 x 102) impostors x 190 test utterances), respectively.
In general, EERs obtained using the top 5 background speaker set are lower
than those obtained using the 5 subgroup background speaker set. For both 16
and 32 mixture GMMs, proposed normalisation methods D3 (X), D4 (X), . . . , D9 (X),
especially D8 (X) (FCM membership) and D9 (X) (using the arctan fuction) produce lower EERs than current normalisation methods L3 (X), . . . , L6 (X). More interesting, noise clustering-based normalisation methods D3nc (X), D4nc (X), . . . , D9nc (X)
7.6 Speaker Verification
114
and L3nc (X), . . . , L6nc (X) produce better results compared to current and proposed
methods. The reason has been discussed in the last chapter, noise clustering-based
normalisation methods reduce false acceptance cases and hence the EER is also reduced. For 16-mixture GMMs, the geometric mean normalisation method L5 (X) using 5 subgroup background speaker set produces the highest EER of 3.52% whereas
the proposed method D8nc (X) using top 5 background speaker set produces the lowest
EER of 1.66%. Similarly, for 32-mixture GMMs, L5 (X) using 5 subgroup background
speaker set produces the highest EER of 3.26 % and D4nc (X) using top 5 background
speaker set produces the lowest EER of 1.26%.
7.6.3
YOHO Results
Experiments were performed on 138 speakers using each speaker as a claimed speaker
with 5 closest background speakers and 132 mixed-gender impostors (excluding 5
background speakers) and rotating through all speakers. The total number of claimed
test utterances and impostor test utterances are 5520 (138 claimed speakers x 40 test
utterances) and 728640 ((138 x 132) impostors x 40 test utterances), respectively.
Results are shown in two tables as follows.
Table 7.6.3 shows results for conventional GMMs and VQ codebooks to compare EERs obtained by using current and proposed normalisation methods. Similar to ANDOSL results, for 16, 32 and 64 mixture GMMs, proposed normalisation methods D3 (X), D4 (X), . . . , D9 (X), especially D9 (X) (using the arctan fuction) produced lower EERs than current normalisation methods L3 (X), . . . , L6 (X).
Noise clustering-based normalisation methods D3nc (X), D4nc (X), . . . , D9nc (X) and
L3nc (X), . . . , L6nc (X) also produced better results compared to current and proposed
methods. The current normalisation method L5 (X) produced the highest EER of
4.47% for 16-mixture GMMs and the proposed method D9nc (X) produced the lowest
EER of 1.83% for 64-mixture GMMs. The better performances were also obtained
for proposed methods using VQ codebooks. The highest EER of 6.80% was obtained
using the current method L5 (X) for VQ codebook size of 16 and the lowest EER of
2.74% was obtained using proposed methods D5nc (X) and D9nc (X).
Table 7.6.3 shows results for conventional GMMs, hard GMMs and VQ codebooks
to compare EERs obtained by using conventional and proposed models as well as cur-
7.7 Summary and Conclusion
115
rent and proposed normalisation methods. Similar to speaker identification, conventional GMMs produced better results than hard GMMs, and hard GMMs produced
better results than VQ codebooks. Results again show better results for proposed
normalisation methods. For hard GMMs, the current normalisation method L5 (X)
produced the highest EER of 5.90% for 16-component hard GMMs and the proposed
method L5nc (X) produced the lowest EER of 3.10% for 32-component hard GMMs.
7.7
Summary and Conclusion
Isolated word recognition, speaker identification and speaker verification experiments
have been performed on the TI46, ANDOSL and YOHO corpora to evaluate proposed
models and proposed normalisation methods. Fuzzy entropy and fuzzy C-means
methods as well as their noise clustering-based versions and hard methods have been
applied to train discrete and continuous hidden Markov models, Gaussian mixture
models and vector quantisation codebooks.
In isolated word recognition, experiments have shown significant results for fuzzy
entropy and fuzzy C-means hidden Markov models compared to conventional hidden
Markov models performed on the highly confusable vocabulary consisting of the nine
English E-set letters: b, c, d, e, g, p, t, u, v and z. The lowest recognition error
of 27.87% was obtained for noise clustering-based fuzzy C-means 6-state left-to-right
discrete hidden Markov models using VQ codebook size of 64. The highest recognition
error of 54.54 % was obtained for conventinal 6-state left-to-right discrete hidden
Markov models using VQ codebook size of 16. For the 10-digit and 10-command sets,
the lowest errors of 0.34% and 1.43% were obtained for noise clustering-based fuzzy
entropy 6-state left-to-right discrete hidden Markov models using VQ codebook size
of 64 and 128, respectively. Fuzzy C-means continuous hidden Markov models have
shown good results in 46-word recognition with the lowest error of 6.57% obtained
for 5-state 2-mixture fuzzy C-means continuous hidden Markov models.
In speaker identification, experiments have shown good results for fuzzy entropy
vector quantisation codebooks and fuzzy C-means Gaussian mixture models. For 16
speakers of the TI46 corpus, noise clustering-based fuzzy C-means Gaussian mixture
models were obtained the lowest error of 12% using 128 mixtures. Noise clustering-
7.7 Summary and Conclusion
116
based fuzzy entropy vector quantisation codebooks were also obtained the lowest error
of 10.13% using codebook size of 128. For 108 speakers of the ANDOSL corpus, the
lowest error of 0.65% was obtained for noise clustering-based fuzzy C-means Gaussian
mixture models using 32 mixtures. Results for 138 speakers of the YOHO corpus have
evaluated hard Gaussian mixture models as intermediate models in the reduction of
Gaussian mixture models to vector quantisation codebooks.
In speaker verification, proposed models and proposed normalisation methods
have been evaluated. For proposed models using current normalisation methods,
the lowest EERs obtained for 16 speakers of the TI46 corpus were about 3% using 128-mixture noise clustering-based fuzzy C-means Gaussian mixture models and
about 3.2% using 128-codevector noise clustering-based fuzzy entropy vector quantisation codebooks. For proposed normalisation methods using conventional models,
the lowest EER obtained for 108 speakers of the ANDOSL corpus was 1.22% for the
noise clustering-based normalisation method using the fuzzy C-means membership
function as a similarity score for 32-mixture Gaussian mixture speaker models. The
lowest EER for 138 speakers of the YOHO corpus was 1.83% for the noise clusteringbased normalisation method using the arc tan function as a dissimilarity score for
64-mixture Gaussian mixture speaker models.
7.7 Summary and Conclusion
117
Conventional CHMM
FCM-CHMM
Speaker
3 states 2 mixtures
5 states 2 mixtures
3 states 2 mixtures
5 states 2 mixtures
f1
12.26
7.22
11.31
9.40
f2
8.03
6.53
5.03
5.85
f3
12.24
9.93
10.61
7.62
f4
7.07
7.48
5.99
3.81
f5
15.22
9.92
14.40
10.60
f6
10.19
7.07
8.97
7.47
f7
7.47
5.43
6.52
5.30
f8
6.39
4.49
7.89
4.90
m1
5.77
4.12
4.81
4.40
m2
3.94
2.17
2.72
1.09
m3
12.96
10.09
10.23
8.87
m4
8.90
5.98
9.04
4.45
m5
5.69
3.61
6.24
3.19
m6
13.91
11.85
13.36
11.02
m7
9.74
6.86
10.70
6.31
m8
13.59
12.23
15.49
10.73
Female
9.86
7.26
8.84
6.87
Male
9.32
7.12
9.08
6.27
Average
9.59
7.19
8.96
6.57
Table 7.6: Speaker-dependent recognition error rates (%) for the 46-word set
7.7 Summary and Conclusion
Speaker
GMM
118
FE-GMM
FCM-GMM
Groups
16 mixtures
32 mixtures
16 mixtures
32 mixtures
16 mixtures
32 mixtures
fyb
0.83
0.54
0.83
0.52
0.50
0.58
fyc
3.95
2.55
3.91
2.51
3.92
2.50
fyg
3.38
2.78
3.35
2.75
3.08
2.17
fmb
2.03
0.98
2.01
0.89
1.75
0.42
fmc
2.25
1.95
2.25
1.89
2.17
1.50
fmg
2.08
1.87
2.05
1.85
1.83
1.50
feb
0.75
0.08
0.71
0.05
0.00
0.00
fec
3.92
1.74
3.90
1.70
3.92
1.58
feg
1.17
0.38
1.13
0.35
0.42
0.33
myb
1.17
0.44
1.06
0.40
0.92
0.25
myc
0.42
0.29
0.37
0.28
0.25
0.17
myg
0.42
0.30
0.38
0.28
0.58
0.25
mmb
0.92
0.17
0.86
0.16
0.50
0.17
mmc
0.42
0.09
0.47
0.09
0.33
0.08
mmg
0.33
0.07
0.30
0.05
0.08
0.00
meb
0.25
0.00
0.21
0.00
0.17
0.00
mec
0.08
0.04
0.06
0.04
0.08
0.00
meg
0.08
0.08
0.08
0.06
0.08
0.17
female
2.26
1.43
2.24
1.39
1.95
1.18
male
0.45
0.16
0.42
0.15
0.33
0.12
young
1.70
1.15
1.65
1.12
1.54
0.99
medium
1.34
0.86
1.32
0.82
1.11
0.61
elder
1.04
0.39
1.02
0.37
0.78
0.35
broad
0.99
0.37
0.95
0.34
0.64
0.24
cultivated
1.84
1.11
1.83
1.09
1.78
0.97
general
1.24
0.91
1.22
0.89
1.01
0.74
Average
1.36
0.80
1.33
0.77
1.14
0.65
Table 7.7: Speaker identification error rates (%) for the ANDOSL corpus using conventional GMMs, FE-GMMs and FCM-GMMs.
7.7 Summary and Conclusion
GMM
119
Hard GMM
VQ
Speaker
16 mixtures
32 mixtures
16 components
32 components
16 vectors
32 vectors
Female
15.08
7.27
14.14
6.88
17.11
8.91
Male
11.88
5.09
13.94
9.25
15.90
7.36
Average
13.48
6.18
14.04
8.06
16.5
8.13
Table 7.8: Speaker identification error rates (%) for the YOHO corpus using conventional GMMs, hard GMMs and VQ codebooks.
7.7 Summary and Conclusion
Normalisation
Methods
120
Top 5 Background Speaker Set
5 Subgroup Background Speaker Set
16-mixture GMM 32-mixture GMM 16-mixture GMM
32-mixture GMM
L3 (X)
2.16
1.79
2.68
2.11
D3 (X)
1.84
1.51
2.24
1.61
L3nc (X)
1.81
1.40
2.00
1.41
D3nc (X)
1.79
1.35
1.89
1.33
L4 (X)
2.16
1.79
2.68
2.10
D4 (X)
1.90
1.47
2.15
1.46
L4nc (X)
1.81
1.40
2.00
1.41
D4nc (X)
1.81
1.36
1.88
1.32
L5 (X)
3.16
3.03
3.51
3.26
D5 (X)
2.61
2.20
2.77
2.14
L5nc (X)
2.02
1.97
2.32
2.10
D5nc (X)
1.99
1.71
2.27
1.68
D7 (X)
2.11
1.84
2.63
2.07
D7nc (X)
1.97
1.51
2.12
1.59
D8 (X)
1.91
1.53
2.34
1.69
D8nc (X)
1.66
1.26
1.86
1.32
D9 (X)
2.08
1.66
2.33
1.69
D9nc (X)
1.89
1.54
2.23
1.57
Table 7.9: EER Results (%) for the ANDOSL corpus using GMMs with different
background speaker sets. Rows in bold are the current normalisation methods ,
others are the proposed methods. The index “nc” denotes noise clustering-based
methods.
7.7 Summary and Conclusion
Normalisation
Methods
121
GMM
VQ
16 mixtures 32 mixtures 64 mixtures 16 vectors
32 vectors
64 vectors
L3 (X)
4.42
3.12
2.43
5.87
4.60
3.60
D3 (X)
4.12
2.94
1.98
5.41
4.06
3.14
L3nc (X)
4.20
2.89
1.96
4.96
4.25
3.50
D3nc (X)
4.25
2.90
1.96
4.76
3.84
3.09
L4 (X)
4.41
3.13
2.43
5.85
4.58
3.58
D4 (X)
4.20
2.97
1.99
5.21
3.90
3.06
L4nc (X)
4.20
2.89
1.96
4.96
4.23
3.49
D4nc (X)
4.21
2.87
1.97
4.72
3.76
3.04
L5 (X)
4.47
3.30
2.44
6.80
5.06
4.22
D5 (X)
4.10
2.98
2.05
5.32
3.81
3.10
L5nc (X)
3.87
2.74
1.87
4.48
3.36
2.91
D5nc (X)
3.75
2.76
1.88
4.49
3.33
2.74
D7 (X)
4.41
3.20
2.44
6.09
4.75
3.62
D7nc (X)
4.30
3.10
2.27
5.10
4.39
3.54
D8 (X)
4.17
2.97
2.05
5.29
3.99
3.11
D8nc (X)
4.16
2.95
2.04
4.50
3.41
2.97
D9 (X)
3.89
2.84
1.85
4.51
3.31
2.76
D9nc (X)
3.86
2.81
1.83
4.34
3.27
2.74
Table 7.10: Equal Error Rate (EER) Results (%) for the YOHO corpus. Rows in
bold are the current normalisation methods , others are the proposed methods.
7.7 Summary and Conclusion
GMM
Normalisation
122
Hard GMM
VQ
Methods
16 mixtures
32 mixtures
16 components
32 components
16 vectors
32 vectors
L3 (X)
4.42
3.12
5.22
3.99
5.87
4.60
D3 (X)
4.12
2.94
4.60
3.40
5.41
4.06
L3nc (X)
4.20
2.89
5.07
3.88
4.96
4.25
D3nc (X)
4.25
2.90
4.38
3.33
4.76
3.84
L4 (X)
4.41
3.13
5.20
4.00
5.85
4.58
D4 (X)
4.20
2.97
4.66
3.48
5.21
3.90
L4nc (X)
4.20
2.89
4.57
3.89
4.96
4.23
D4nc (X)
4.21
2.87
4.57
3.45
4.72
3.76
L5 (X)
4.47
3.30
5.90
4.42
6.80
5.06
D5 (X)
4.10
2.98
4.81
3.41
5.32
3.81
L5nc (X)
3.87
2.74
4.26
3.10
4.48
3.36
D5nc (X)
3.75
2.76
4.30
3.19
4.49
3.33
D7 (X)
4.41
3.20
5.22
3.97
6.09
4.75
D7nc (X)
4.30
3.10
5.03
3.92
5.10
4.39
D8 (X)
4.17
2.97
4.86
3.62
5.29
3.99
D8nc (X)
4.16
2.95
4.45
3.21
4.50
3.41
D9 (X)
3.89
2.84
4.36
3.18
4.51
3.31
D9nc (X)
3.86
2.81
4.32
3.13
4.34
3.27
Table 7.11: Comparisons of EER Results (%) for the YOHO corpus using GMMs,
hard GMMs and VQ codebooks. Rows in bold are the current normalisation methods,
others are the proposed methods.
Chapter 8
Extensions of the Thesis
The fuzzy membership function in fuzzy set theory and the a posteriori probability in
the Bayes decision theory have similar roles. However the minimum recognition error rate
for a recogniser is obtained by the maximum a posteriori probability rule whereas the
maximum membership rule does not lead to such a minimum error rate. This problem
can be solved by using a recently developed branch of fuzzy set theory, namely possibility
theory. The development of this theory has led to a theory framework similar to that
of probability theory. Therefore, a possibilistic pattern recognition approach to speech
and speaker recognition may be developed which is powerful as the statistical pattern
recognition approach. In this chapter, possibility theory is introduced briefly and a small
application of this theory, namely a possibilistic C-means approach to speech and speaker
recognition is proposed. It is suggested that future research into the possibilistic pattern
recognition approach to speech and speaker recognition may be very promising.
8.1
8.1.1
Possibility Theory-Based Approach
Possibility Theory
Let us review the basis of the statistical pattern recognition approach in Section 2.3
page 20. The goal of a recogniser is to achieve the minimum recognition error rate.
The maximum a posteriori probability (MAP) decision rule is selected to implement
this goal. However, these probabilities are not known in advance and have to be
estimated from a training set of observations with known class labels. The Bayes
123
8.1 Possibility Theory-Based Approach
124
decision theory thus effectively transforms the recogniser design problem into the
distribution estimation problem and the MAP rule is transformed to the maximum
likelihood rule. The distributions are usually parameterised in order to be practically
implementable. The main task is to determine the right parametric form of the
distributions from the training data.
For the fuzzy pattern recognition approach in Section 2.4 page 34, the maximum
membership rule is selected to solve the recogniser design problem. This rule is transformed to the minimum distance rule in fuzzy cluster analysis. However, it has not
been shown that the implementation of these rules leads to the goal of the recogniser, i.e. the minimum recognition error rate as in the statistical pattern recognition
approach.
The fuzzy approaches proposed in this thesis have shown a solution for this problem. By defining the general distance as a decreasing function of the component
distribution, the minimum distance rule becomes the maximum likelihood rule. This
means that the minimum recognition error rate can be achieved in the proposed
approaches.
An alternative approach that is also based on fuzzy set theory is the use of possibility theory introduced by Zadeh [1978], where fuzzy variables are associated with
possibility distributions in a similar way to that in which random variables are associated with probability distributions. A possibility distribution is a representation
of knowledge and information. It can be said that probability theory is used to
deal with randomness and possibility is used to deal with vagueness and ambiguity
[Tanaka and Guo 1999]. Possibility theory is more concerned with the modelling of
partial belief which is due to incomplete data rather than that which is due to the
presence of random phenomena [Dubois and Prade 1988].
In possibilistic data analysis, the total error possibility is defined and the maximum
possibility rule is formulated. The possibility distributions are also parameterised and
the right parametric form can be determined from the training data. In a new application of the possibilistic approach to operations research [Tanaka and Guo 1999],
an exponential possibility distribution that is similar to a Gaussian distribution has
been proposed. Similarly, applying the possibilistic approach to speech and speaker
recognition would be worth investigating.
8.1 Possibility Theory-Based Approach
8.1.2
125
Possibility Distributions
A possibility distribution on a one-dimensional space is a fuzzy membership function
of a point x in a fuzzy set A and is denoted as ΠA (x). For a set of n numbers
h1 , . . . , hn , let the h-level sets Ahi = {x|ΠA (x) ≥ hi } be conventional sets (intervals)
such that if h1 ≤ h2 ≤ . . . ≤ hn then Ah1 ⊇ Ah2 ⊇ . . . ⊇ Ahn . The distribution ΠA (x)
should satisfy the following conditions
• There exists an x such that ΠA (x) = 1 (normality),
• h-level sets of fuzzy numbers are convex (convexity),
• ΠA (x) is piecewise continuous (continuity).
The possibility function is a unimodal function. The possibility distribution on the
d-dimensional space is similarly defined. Let A be a fuzzy vector defined as
A = {(x1 , . . . , xd )|x1 ∈ A1 , . . . , xd ∈ Ad }
(8.1)
where A1 , . . . , Ad are fuzzy sets. Denoting x = (x1 , . . . , xd )′ , the possibility distribution of A can be defined by
ΠA (x) = ΠA1 (x1 ) ∧ . . . ∧ ΠAd (xd )
(8.2)
For example, an exponential possibility distribution on the d-dimensional space can
be described as
ΠA (x) = exp { − (x − m)′ S−1
A (x − m)}
(8.3)
where m is a centre vector and SA is a symmetric positive definite matrix. The
parametric representation of the exponential possibility distribution is λ = (m, SA ).
8.1.3
Maximum Possibility Rule
Consider the simplest case where two classes A and B are characterised by two possibility distributions ΠA (x) and ΠB (x), respectively. The task is to classify the vector
x into A or B. Let uA (x) and uB (x) be the degrees of possibility to which x belongs
to A and B, respectively. The error possibilities that x belonging to A is assigned to
B and vice versa are denoted as
E(A → B) = max
uB (x)ΠA (x)
x
E(B → A) = max
uA (x)ΠB (x)
x
(8.4)
8.2 Possibilistic C-Means Approach
126
The total error possibility E can be defined as
E = E(A → B) + E(B → A)
(8.5)
It can further be shown that [Tanaka and Guo 1999]
E ≥ max [uB∗ (x)ΠA (x) + uA∗ (x)ΠB (x)]
(8.6)
x
where
uA ∗(x) =
{
1
ΠA (x) ≥ ΠB (x)
0
otherwise
uB ∗(x) =
{
1
ΠB (x) > ΠA (x)
0
otherwise
(8.7)
Then we obtain the maximum possibility rule written as an if-then rule as follows
If ΠA (x) ≥ ΠB (x) then x belongs to A,
If ΠA (x) < ΠB (x) then x belongs to B.
8.2
8.2.1
(8.8)
Possibilistic C-Means Approach
Possibilistic C-Means Clustering
As shown in the noise clustering method (see Section 2.4.7 on page 39), the FCM
method uses the probabilistic constraint that the memberships of a feature vector
xt across clusters must sum to one. It is meant to avoid the trivial solution of all
memberships being equal to zero. However, since the memberships generated by this
constraint are relative numbers, they are not suitable for applications in which the
memberships are supposed to represent typicality or compatibility with an elastic
constraint. A possibilistic C-means (PCM) clustering has been proposed to generate
memberships that have a typicality interpretation [Krishnapuram and Keller 1993].
Following the fuzzy set theory [Zadeh 1965], the membership uit = ui (xt ) is the
degree of compatibility of the feature vector xt with cluster i, or the possibility of
xt belonging to cluster i. If the clusters represented by the clusters are thought of
as a set of fuzzy subsets defined over the domain of discourse X = {x1 , . . . , xt }, then
there should be no constraint on the sum of the memberships.
Let U = [uit ] be a matrix whose elements are memberships of xt in cluster i,
i = 1, . . . , C, t = 1, . . . , T . Possibilistic C-partition space for X is the set of matrices
8.2 Possibilistic C-Means Approach
127
U such that
0 ≤ uit ≤ 1
∀i, t,
max uit > 0
∀t,
1≤i≤C
0<
T
∑
uit < T
∀i
t=1
(8.9)
The objective function may be formulated as follows
Jm (U, λ; X) =
T
C ∑
∑
2
um
it dit +
i=1 t=1
C
∑
ηi
i=1
T
∑
(1 − uit )m
(8.10)
t=1
where U = {uit } is a possibilistic C-partition of X, m > 1, and ηi , i = 1, . . . , C
are suitable positive numbers. Minimising the PCM objective function Jm (U, λ; X)
in (8.10) demands the distances in the first term to be as low as possible and the
uit in the second term to be as large as possible, thus avoiding the trivial solution.
Parameters are estimated similar to those in the NC approach. Minimising (8.10)
with respect to uit gives
uit =
1
) 1
d2 m−1
1 + it
ηi
(8.11)
(
Equation (8.11) defines a possibility distribution function Πi for cluster i over the domain of discourse consisting of all feature vectors xt ∈ X. The value of ηi determines
the relative degree to which the second term in the objective function is important
compared with the first. In general, ηi relates to the overall size and shape of cluster
i. In practice, the following definition works well
ηi = K0
T
∑
2
um
it dit
t=1
/∑
T
um
it
(8.12)
t=1
where typically K0 is chosen to be one.
8.2.2
PCM Approach to FE-HMMs
For the possibilistic C-Means (PCM) approach, based on (8.9) in Section 8.2.1, the
matrix U = [uijt ] is defined as
0 ≤ uijt ≤ 1
∀i, j, t,
max uijt > 0
1≤i≤N
1≤j≤N
∀t,
0<
T
∑
uiit < T
∀i
t=1
(8.13)
8.2 Possibilistic C-Means Approach
128
The fuzzy likelihood function in the PCM approach is as follows
Ln (U, λ; O) = −
N ∑
N
T∑
−1 ∑
uijt d2ijt −
t=0 i=1 j=1
N ∑
N
∑
nij
T∑
−1
(uijt log uijt − uijt )
(8.14)
t=0
i=1 j=1
where the degree of fuzzy entropy nij > 0 is dependent on the states. Since f (uijt ) =
uijt log uijt − uijt is a monotonically decreasing function in [0, 1], maximising the fuzzy
likelihood Ln (U, λ; O) in (8.14) over U forces uijt to be as large as possible. By setting
the derivative of Ln (U, λ; O) with respect to uijt to zero, we obtain the algorithm for
the FE-DHMM in the PCM approach (PCM-FE-DHMM)
Fuzzy E-Step:
uijt = exp
{
−d2ijt }
= [P (O, st = i, st+1 = j|λ)]1/nij
nij
(8.15)
M-Step: The M-step is identical to the M-step of the FE-DHMM in Section
3.4.2.
Similarly, the FE-CHMM in the PCM approach (PCM-FE-CHMM) is as follows
Fuzzy E-Step:
uijt
−d2jkt }
= exp
= [P (O, st = j, mt = k|λ)]1/njk
njk
{
(8.16)
M-Step: The M-step is identical to the M-step of the FE-CHMM in Section 3.4.3.
8.2.3
PCM Approach to FCM-HMMs
For the possibilistic C-Means (PCM) approach, based on (8.9) in Section 8.2.1 page
127, the matrix U = [uijt ] is defined as
0 ≤ uijt ≤ 1
∀i, j, t,
max uijt > 0
1≤i≤N
1≤j≤N
∀t,
0<
T
∑
uiit < T
∀i
t=1
(8.17)
The fuzzy objective function in the PCM approach is as follows
Jm (U, λ; O) =
N
N ∑
T∑
−1 ∑
t=0 i=1 j=1
2
um
ijt dijt +
N
N ∑
∑
i=1 j=1
ηij
T∑
−1
(1 − uijt )m
(8.18)
t=0
where ηij are suitable positive numbers. The second term forces uijt to be as large as
possible. By setting the derivative of Jm (U, λ; O) with respect to uijt to zero, we obtain
the algorithm for the FCM-DHMM in the PCM approach (PCM-FCM-DHMM)
8.2 Possibilistic C-Means Approach
129
Fuzzy E-Step:
1
uijt =
1+
(
(8.19)
d2ijt )1/(m−1)
ηij
where dijt is defined in (3.21).
M-Step: The M-step is identical to the M-step of the FCM-DHMM in Section
4.2.1.
Similarly, the FCM-CHMM in the PCM approach (PCM-FCM-CHMM) is as follows
Fuzzy E-Step:
1
ujkt =
1+
(
(8.20)
d2jkt )1/(m−1)
ηjk
where djkt is defined in (3.30).
M-Step: The M-step is identical to the M-step of the FCM-CHMM in Section
4.2.2.
8.2.4
PCM Approach to FE-GMMs
The matrix U = [uit ] is defined as
0 ≤ uit ≤ 1
∀i, t,
max uit > 0
∀t,
1≤i≤N
0<
T
∑
uit < T
∀i
t=1
(8.21)
The fuzzy likelihood function in the PCM approach is as follows
Ln (U, λ; X) = −
K
T ∑
∑
uit d2it −
t=1 i=1
K
∑
i=1
ni
T
∑
(uit log uit − uit )
(8.22)
t=1
where the degree of fuzzy entropy ni > 0 is dependent on the clusters. The algorithm
for the FE-GMM in the PCM approach (PCM-FE-GMM) is as follows
Fuzzy E-Step:
−d2it
(8.23)
= [P (xt , i|λ)]1/ni
ni
M-Step: The M-step is identical to the M-step of the FE-GMM in Section 3.5.1.
uit = exp
{
}
8.2 Possibilistic C-Means Approach
8.2.5
130
PCM Approach to FCM-GMMs
The matrix U = [uit ] is defined as
0 ≤ uit ≤ 1
∀i, t,
max uit > 0
∀t,
1≤i≤N
0<
T
∑
uit < T
∀i
t=1
(8.24)
The fuzzy objective function in the PCM approach is as follows
Jm (U, λ; O) =
K
T ∑
∑
2
um
it dit
+
t=1 i=1
K
∑
i=1
ηi
T
∑
(1 − uit )m
(8.25)
t=1
where ni , i = 1, . . . , K are positive numbers. The algorithm for the FCM-GMM in
the PCM approach (PCM-FCM-GMM) is as follows
Fuzzy E-Step:
1
uit =
1+
(
)
d2it 1/(m−1)
(8.26)
ηi
M-Step: The M-step is identical to the M-step of the FCM-GMM in Section
4.3.1.
8.2.6
Summary and Conclusion
The theory of possibility, in contrast to heuristic approaches, offers algorithms for
composing hypothesis evaluations which are consistent with axioms in a well-developed
theory. Possibility theory, rather than probability theory, relates to the perception of
degrees of evidence instead of degrees of likelihood. In any case, using possibilities
does not prevent us from using statistics in the estimation of membership functions
[De Mori and Laface 1980].
In the PCM approach, the membership values can be interpreted as possibility
values or degrees of typicality of vectors in clusters. The possibilistic C-partition
defines C distinct possibility distributions and the PCM algorithm can be used to
estimate possibility distributions directly from training data. The role of PCM clustering and PCM models can be seen by comparing the following figures 8.1, 8.2 and
8.3 with the corresponding figures on pages 41, 65 and 75.
8.2 Possibilistic C-Means Approach
131
Clustering Techniques
Memberships 0 ≤ uit ≤ 1, 0 <
T
∑
uit < T
t=1
uit ∈ {0, 1}
C
∑
C
∑
uit = 1
Parameter set
λ = {µ}
❄
HCM
λ = {µ, Σ}
uit < 1
max uit > 0
1≤i≤C
i=1
i=1
❄
❄
❄
FCM
NC
PCM
❄
GustafsonKessel
❄
Extended
NC
❄
Extended
PCM
❄
GathGeva
λ = {µ, Σ, w}
Figure 8.1: PCM Clustering in Clustering Techniques
FE Models
❄
FE
Discrete Models
❄
FE
DHMM
❄
NC-FE
DHMM
❄
FE
Continuous Models
❄
PCM-FE
DHMM
❄
FE
CHMM
❄
NC-FE
CHMM
❄
PCM-FE
CHMM
❄
FE
GMM
❄
NC-FE
GMM
❄
PCM-FE
GMM
❄
FE
VQ
❄
NC-FE
VQ
❄
PCM-FE
VQ
Figure 8.2: PCM Approach to FE models for speech and speaker recognition
8.2 Possibilistic C-Means Approach
132
FCM Models
❄
FCM
Discrete Models
❄
FCM
DHMM
❄
NC-FCM
DHMM
❄
FCM
Continuous Models
❄
PCM-FCM
DHMM
❄
FCM
CHMM
❄
NC-FCM
CHMM
❄
PCM-FCM
CHMM
❄
FCM
GMM
❄
NC-FCM
GMM
❄
PCM-FCM
GMM
❄
FCM
VQ
❄
NC-FCM
VQ
❄
PCM-FCM
VQ
Figure 8.3: PCM Approach to FCM models for speech and speaker recognition
Chapter 9
Conclusions and Future Research
9.1
Conclusions
Fuzzy approaches to speech and speaker recognition have been proposed and experimentally evaluated in this thesis. To obtain these approaches, the following basic
problems have been solved. First, the time-dependent fuzzy membership function
has been introduced to hidden Markov modelling to denote the degree of belonging of an observation sequence to a state sequence. Second, a relationship between
modelling techniques and clustering techniques has been established by using a general distance defined as a decreasing function of the component probability density.
Third, a relationship between fuzzy models and conventional models is also established by introducing a new technique—fuzzy entropy clustering. Finally, since the
roles of the fuzzy membership function and the a posteriori probability in the Bayes
decision theory are quite similar, the maximum a posteriori rule can be generalised
to the maximum membership rule. With the above general distance, the use of the
maximum membership rule also achieves the minimum recognition error rate for a
speech and speaker recogniser.
Fuzzy entropy models are the first set of proposed models in the fuzzy modelling
approach. A parameter is introduced as the degree of fuzzy entropy n > 0. With
n → 0, we obtain hard models. With n = 1, fuzzy entropy models reduce to conventional models in the maximum likelihood scheme. Thus, statistical models can
be viewed as special cases of fuzzy models. Fuzzy entropy hidden Markov models,
133
9.1 Conclusions
134
fuzzy entropy Gaussian mixture models and fuzzy entropy vector quantisation have
all been proposed.
Fuzzy C-means models are the second set of proposed models in the fuzzy modelling approach. A parameter is introduced as the degree of fuzziness m > 1. With
m → 1, we also obtain hard models. Fuzzy C-means hidden Markov models and
fuzzy C-means Gaussian mixture models have been respectively proposed.
Noise clustering is an interesting fuzzy approach to fuzzy entropy and fuzzy Cmeans models. This approach is simple but robust and has very good experimental
evaluations.
In general, fuzzy entropy and fuzzy C-means models have a common advantage,
that is they obtain adjustable parameters n and m. When conventional models do
not work well because of the insufficient training data problem or the complexity of
speech data, such as the nine English E-set words, a suitable value of n or m can be
found to obtain better models.
Hard models are the third set of proposed models. These models are regarded
as a consequence of fuzzy models as fuzzy parameters n and m tend to their limit
values. Hard HMMs are the single-state sequence HMMs. Conventional HMMs using
the Viterbi algorithm can be regarded as “pretty” hard HMMs.
The fuzzy approach to speaker verification is an alternative fuzzy approach in this
thesis. Based on the use of the fuzzy membership function as the claimed speaker’s
score and consideration of the likelihood transformation, six fuzzy membership scores
and ten noise clustering-based scores have been proposed. Using the arctan function
in computing the score illustrates a theoretical extension for normalisation methods,
where not only the logarithm function but also other functions can be used.
Isolated word recognition, speaker identification and speaker verification experiments have been performed on the TI46, ANDOSL and YOHO corpora to evaluate
proposed models and proposed normalisation methods. In isolated word recognition,
experiments have shown very good results for fuzzy entropy and fuzzy C-means hidden Markov models compared to conventional hidden Markov models performed on
the highly confusable vocabulary of the nine English E-set letters: b, c, d, e, g, p, t, u,
v and z. In speaker identification, experiments have shown good results for fuzzy entropy vector quantisation codebooks and fuzzy C-means Gaussian mixture models. In
9.2 Directions for Future Research
135
speaker verification, experiments have shown better results for the proposed normalisation methods, especially for the noise clustering-based methods. With 2,093,040
test utterances for each ANDOSL result and 728,640 test utterances for each YOHO
result, these evaluation experiments are sufficiently reliable.
9.2
Directions for Future Research
Several directions for future research have been suggested which may extend or augment the work in this thesis. These are:
• Possibilistic Pattern Recognition Approach: As shown in Chapter 8, this
approach would be worth investigating. A possibility theory framework to replace the Bayes decision theory framework for the minimum recognition error
rate task of a recogniser looks very promising.
• Fuzzy Entropy Clustering: will be further considered in both theoretical and
experimental aspects such as the local minima, convergence, cluster validity,
cluster analysis and classifier design in pattern recognition.
• Fuzzy Approach to Discriminative Methods: The fuzzy approaches proposed in this thesis are based on maximum likelihood-based methods. Since
discriminative methods such as maximum mutual information and generalised
probabilistic descent are also effective methods, finding a fuzzy approach to
these methods should be studied.
• Large Vocabulary Speech Recognition: The speech recognition experiments in this thesis were isolated word recognition experiments on small vocabularies. Therefore, to obtain a better evaluation for the proposed fuzzy models,
continuous speech recognition experiments on large vocabularies should be carried out.
• Likelihood Transformations: Since speaker verification has many important
applications, other likelihood transformations should be studied to find more
effective normalisation methods for speaker verification.
Bibliography
[Abramson 1963] N. Abramson, Information Theory and Coding, McGraw Hill, 1963.
[Allerhand 1987] M. Allerhand, Knowledge-Based Speech Pattern Recognition, Kogan Page Ltd,
London, 1987.
[Ambroise et al. 1997] C. Ambroise, M. Dang and G. Govaert, “Clustering of Spatial Data by the
EM Algorithm”, in A. Soares, J. Gomez-Hernandez and R. Froidevaux (eds), geoENV I - Geostatistics for Environmental Applications, vol. 9 of Quantitative Geology and Geostatistics,
Kluwer Academic Publisher, pp. 493-504, 1997.
[Ambroise and Govaert 1998] C. Ambroise and G. Govaert, “Convergence Proof of an EM-Type
Algorithm for Spatial Clustering”, Pattern Reccognition Letters, vol. 19, pp. 919-927, 1998.
[Atal 1974] B. S. Atal, “Effective of Linear Prediction Characteristics of Speech Wave for Automatic
Speaker Identification and Verification”, J. Acoust. Soc. Am., vol. 55, pp. 1304-1312, 1974.
[AT&T’s web site] http://www.research.att.com
[Bakis 1976] R. Bakis, “Continuous Speech Word Recognition via Centisecond Acoustic States”, in
Proc. ASA Meeting (Washington, DC), April, 1976.
[Banon 1981] G. Banon, “Distinction Between Several Subsets of Fuzzy Measures”, Fuzzy Sets and
Systems, vol. 5, pp. 291-305, 1981.
[Baum 1972] L. E. Baum, “An inequality and associated maximisation technique in statistical estimation for probabilistic functions of a Markov process”, Inequalities, vol. 3, pp. 1-8, 1972.
[Baum and Sell 1968] L. E. Baum and G. Sell, “Growth transformations for functions on manifolds”,
Pacific J. Maths., vol. 27, pp. 211-227, 1968.
[Baum and Eagon 1967] L. E. Baum and J. A. Eagon, “An inequality with applications to statistical
estimation for probabilistic functions of a Markov process and to a model for Ecology”, Bull.
Amer. Math. Soc., vol. 73, pp. 360-363, 1967.
136
BIBLIOGRAPHY
137
[BBN’s web site] http://www.gte.com/AboutGTE/gto/bbnt/speech/research/technologies/index.html.
[Bellegarda 1996] J. R. Bellegarda, “Context dependent vector quantization for speech recognition”,
chapter 6 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui
Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 133-158,
1996.
[Bellegarda and Nahamoo 1990] J. R. Bellegarda and D. Nahamoo, “Tied mixture continuous parameter modelling for speech recognition”, in IEEE Trans. Acoustics, Speech, Signal Proc.,
vol. 38, pp. 2033-2045, 1990.
[Bellman et al. 1966] R. Bellman, R. Kalaba and L. A. Zadeh, “Abstraction and Pattern Recognition”, J. Math. Anal. Appl., vol. 13, pp. 1-7, 1966.
[Bezdek et al. 1998] J. C. Bezdek, T. R. Reichherzer, G. S. Lim and Y. Attikiouzel, “Multipleprototype classifier design”, IEEE Trans. Syst. Man Cybern., vol. 28, no. 1, pp. 67-79, 1998.
[Bezdek 1993] J. C. Bezdek, “A review of probabilistic, fuzzy and neural models for pattern recognition”, J. Intell. and Fuzzy Syst., vol. 1, no. 1, pp. 1-25, 1993.
[Bezdek and Pal 1992] J. C. Bezdek and S. K. Pal, Fuzzy Models for Pattern Recognition, IEEE
Press, 1992.
[Bezdek 1990] J. C. Bezdek, “A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms”, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-2, no.1, pp. 1-8, January
1990.
[Bezdek 1981] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum
Press, New York and London, 1981.
[Bezdek and Castelaz 1977] J. C. Bezdek and P. F. Castelaz, “Prototype classification and feature
selection with fuzzy sets”, IEEE Trans. Syst. Man Cybern., vol. SMC-7, no. 2, pp. 87-92,
1977.
[Bezdek 1974] J. C. Bezdek, “Cluster validity with fuzzy sets”, J. Cybern., vol. 3, no. 3, pp. 58-72,
1974.
[Bezdek 1973] J. C. Bezdek, Fuzzy mathematics in Pattern Classification, Ph.D. thesis, Applied
Math. Center, Cornell University, Ithaca, 1973.
[Booth and Hobert 1999] J. G. Booth and J. P. Hobert, “Maximizing Generalized Linear Mixed
Model Likelihoods with an Automated Monte Carlo EM algorithm”, J. Roy. Stat. Soc., Ser.
B, 1999 (to appear).
BIBLIOGRAPHY
138
[Cambridge’s web site] http://svr-www.eng.cam.ac.uk/
[Campbell 1997] J. P. Campbell, “Speaker Recognition: A Tutorial”, in Special issue on Automated
biometric Syst., Proc. IEEE, vol. 85, no. 9, pp. 1436-1462, 1997.
[Chou et al. 1989] P. Chou, T. Lookabaugh and R. Gray, “Entropy-constrained vector quantisation”, IEEE Trans. Acoustic, Speech, and Signal Processing, vol. ASSP-37, pp. 31-42, 1989.
[Chou and Oh 1996] H. J. Choi and Y. H. Oh, “Speech recognition using an enhanced FVQ based on
codeword dependent distribution normalization and codeword weighting by fuzzy objective
function”, in Proceedings of the International Conference on Spoken Language Processing
(ICSLP), vol. 1, pp. 354-357, 1996.
[Cover and Hart 1967] T. M. Cover and P. E. Hart, “Nearest neighbour pattern classification”,
IEEE Trans. Inform. Theory, vol. IT-13, pp. 21-27, 1967.
[CSELT’s web site] http://www.cselt.it/
[Dang and Govaert 1998] M. Dang and G. Govaert, “Spatial Fuzzy Clustering using EM and Markov
Random Fields”, J. Syst. Research & Inform. Sci., vol. 8, pp. 183-202, 1998.
[Das and Picheny 1996] S. K. Das and M. A. Picheny, “Issues in practical large vocabulary isolated
word recognition: the IBM Tangora system”, chapter 19 in Automatic Speech and Speaker
Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K.
Paliwal, Kluwer Academic Publishers, USA, pp. 457-480, 1996.
[Davé and Krishnapuram 1997] R. N. Davé and R. Krishnapuram, “Robust clustering methods: a
unified view”, IEEE Trans. Fuzzy Syst., vol. 5, no.2, pp. 270-293, 1997.
[Davé and Bhaswan 1992] R. N. Davé and K. Bhaswan, “Adaptive fuzzy c-shells clustering and
detection of ellipses”, IEEE Trans. Neural Networks, vol. 3, pp. 643-662, May 1992.
[Davé 1991] R. N. Davé, “Characterization and detection of noise in clustering”, Pattern Recognition
Lett., vol. 12, no. 11, pp. 657-664, 1991.
[Davé 1990] R. N. Davé, “Fuzzy-shell clustering and applications to circle detection in digital images”, Int. J. General Systems, vol. 16, pp. 343-355, 1990.
[De Luca and Termini 1972] A. de Luca, S. Termini, “A definition of a nonprobabilistic entropy in
the setting of fuzzy set theory”, Inform. Control, vol. 20, pp. 301-312, 1972.
[De Mori and Laface 1980] R. De Mori and P. Laface, “Use of fuzzy algorithms for phonetic and
phonemic labeling of continuous speech”, IEEE trans. Pattern Anal. Machine Intell., vol.
PAMI-2, no. 2, pp. 136-148, March 1980.
BIBLIOGRAPHY
139
[Dempster et al. 1977] A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from
Incomplete Data via the EM algorithm”, J. Roy. Stat. Soc., Series B, vol. 39, pp. 1-38, 1977.
[DRAGON’s web site] http://www.dragonsys.com/products/index.html
[Dubois and Prade 1988] D. Dubois and H. Prade, Possibility Theory; An Approach to Computerized Processing of Uncertainty, Plenum Press, New York, 1988.
[Duda and Hart 1973] R. O. Duda and P. E. Hart, Pattern classification and scene analysis, John
Wiley & Sons, New York, 1973.
[Duisburg’s web site] http://www.uni-duisburg.de/e/Forschung/
[Dunn 1974] J. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact
Well-Separated Cluster”, J. Cybern., vol. 3, pp. 32-57, 1974.
[Doddington 1998] G. R. Doddington, “Speaker Recognition Evaluation Methodology – An
Overview and Perspective”, in Proc. Workshop on Speaker Recognition and its Commercial
and Forensic Applications (RLA2C), pp. 60-66, 1998.
[Fessler and Hero 1994] J. A. Fessler and A. O. Hero, “Space-alternating generalised EM algorithm”,
IEEE Trans. Signal Processing, vol. 42, pp. 2664-2677, 1994.
[Flanagan 1972] J. L. Flanagan, Speech Analysis, Synthesis, and Perception, 2nd ed., SpringerVerlag, New York, 1972.
[Fogel 1995] D. B. Fogel, Evolutionary Computation, Toward A New Philosophy of Machine Intelligence, IEEE Press, New York, 1995.
[Freitas 1998] J. F. G. Freitas, M. Niranjan and A. H. Gee, “The EM algorithm and neural networks for nonlinear state space estimation”, Technical Report CUED/F-INFENG/TR 313,
Cambridge University, 1998.
[Furui 1997] Sadaoki Furui, “Recent advances in speaker recognition”, Patter Recognition Lett., vol.
18, pp. 859-872, 1997.
[Furui 1996] Sadaoki Furui, “An Overview of Speaker Recognition Technology”, chapter 2 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K.
Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 31-56, 1996.
[Furui 1994] Sadaoki Furui, “An Overview of Speaker Recognition Technology”, in Proc. ESCA
Workshop on Automatic Speaker Recognition, Identification and Verification, pp. 1-9, 1994.
[Furui and Sondhi 1991a] Sadaoki Furui and M. Mohan Sondhi, Advances in Speech Signal Processing, Marcel Dekker, Inc., New York, 1991.
BIBLIOGRAPHY
140
[Furui 1991b] Sadaoki Furui, “Speaker-independent and speaker-adaptive recognition techniques”,
in Advances in Speech Signal Processing, edited by Sadaoki Furui and M. Mohan Sondhi,
Marcel Dekker, Inc., New York, pp. 597-622, 1991.
[Furui 1989] Sadaoki Furui, Digital Speech, Processing, Synthesis, and Recognition, Marcel Dekker,
Inc., New York, 1989.
[Furui 1981] Sadaoki Furui, “Cepstral Analysis Techniques for Automatic Speaker Verification”,
IEEE Trans. Acoustic, Speech, and Signal Processing, vol. 29, pp. 254-272, 1981.
[Gath and Geva 1989] I. Gath and A. B. Geva, “Unsupervised optimal fuzzy clustering”, IEEE
Trans. Patt. Anal. Mach. Intell., PAMI vol. 11, no. 7, pp. 773-781, 1989.
[Ghahramani 1997] Z. Ghahramani, “Factorial Hidden Markov Models”, in Machine Learning, vol.
29, pp. 245-275, Kluwer Academic Publisher, 1997.
[Ghahramani 1995] Z. Ghahramani, “Factorial Learning and the EM Algorithm”, in Adv. Neural
Inform. Processing Syst. G. Tesauro, D.S. Touretzky and J. Alspector (eds.), vol. 7, pp.
617-624, MIT Press, Cambridge, 1995.
[Gish 1990] H. Gish, “Robust discrimination in automatic speaker identification”, in Proc. IEEE
Inter. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’90), pp. 289-292, 1990.
[Gish et al. 1991] H. Gish, M.-H. Siu and R. Rohlicek, “Segregation of speakers for speech recognition and speaker identification”, in Proc. IEEE Inter. Conf. on Acoustics, Speech, and Signal
Processing (ICASSP’91), pp. 873-876, 1991.
[Goldberg 1989] D. E. Goldberg, Genetic Algorithms in Search, Optimization & Machine Learning,
Addison-Wesley, 1989.
[Gravier and Chollet 1998] G. Gravier and G. Chollet, “Comparison of Normalization Techniques
for Speaker Verification”, in Proc. on Speaker Recognition and its Commercial and Forensic
Applications (RLA2C), pp. 97-100, 1998.
[Gustafson and Kessel 1979] D. E. Gustafson and W. Kessel, “Fuzzy clustering with a Fuzzy Covariance Matrix”, in in Proc. IEEE-CDC, (K. S. Fu, ed.), vol. 2, pp. 761-766, IEEE Press,
Piscataway, New Jersey, 1979.
[Harrington and Cassidy 1996] J. Harrington and S. Cassidy, Techniques in Speech Acoustics,
Kluwer Academic Publications, 1996.
[Hartigan 1975] J. Hartigan, Clustering Algorithms, Wiley, NewYork, 1975.
[Hathaway 1986] R. Hathaway, “Another interpretation of the EM algorithm for mixture distribution”, J. Stat. Prob. Lett., vol. 4, pp. 53-56, 1986.
BIBLIOGRAPHY
141
[Higgins et al. 1991] A. L. Higgins, L. Bahler and J. Porter, “Speaker Verification using Randomnized Phrase Prompting”, Digital Signal Processing, vol. 1, pp. 89-106, 1991.
[Höppner et al. 1999] F. Höppner, F. Klawonn, R. Kruse and T. Runkler, Fuzzy Cluster Analysis
– Methods for classification, Data analysis and Image Recognition, John Wiley & Sons Ltd,
1999.
[Huang et al. 1996] X. Huang, A. Acero, F. Alleva, M. Huang, L. Jiang and M. Mahajan, “From
SPHINX-II to WHISPER: Making speech recognition usable”, chapter 20 in Automatic Speech
and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and
Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 481-508, 1996.
[Huang et al. 1990] X. D. Huang, Y. Ariki and M. A. Jack, Hidden Markov Models For Speech
Recognition, Edinburgh University Press, 1990.
[Huang and Jack 1989] X. D. Huang and M. A. Jack, “Semi-Continuous Hidden Markov Models
For Speech Signal”, Computer, Speech and Language, vol. 3, pp. 239-251, 1989.
[IBM’s web site] http://www-4.ibm.com/software/speech/
[Jaynes 1957] E. T. Jaynes, “Information theory and statistical mechanics”, Phys. Rev., vol. 106,
pp. 620-630, 1957.
[Juang 1998] B.-H. Juang, “The Past, Present, and Future of Speech Processing”, IEEE Signal
Processing Magazine, vol. 15, no. 3, pp. 24-48, 1998.
[Juang et al. 1996] B.-H. Juang, W. Chou and C.-H. Lee, “Statistical and discriminative methods
for speech recognition”, chapter 5 in Automatic Speech and Speaker Recognition, Advanced
Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic
Publishers, USA, pp. 109-132, 1996.
[Juang and Katagiri 1992] B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error
classification”, IEEE Trans. Signal processing, SP-40, no. 12, pp. 3043-3054, 1992.
[Juang and Rabiner 1991] B.-H. Juang and L. R. Rabiner, “Issues in using hidden Markov models
for speech and speaker recognition”, in Advances in Speech Signal Processing, edited by
Sadaoki Furui and M. Mohan Sondhi, Marcel Dekker, Inc., New York, pp. 509-554, 1991.
[Juang 1985] B.-H. Juang, “Maximum likelihood estimation for multivariate observations of Markov
sources”, AT&T Technical Journal, vol. 64, pp. 1235-1239, 1985.
[Kasabov 1998] N. Kasabov, “A framework for intelligent conscious machines and its application to
multilingual speech recognition systems”, in Brain-like computing and intelligent information
systems, S. Amari and N. Kasabov eds. Singapore, Springer Verlag, pp. 106-126, 1998.
BIBLIOGRAPHY
142
[Kasabov et al. 1999] N. Kasabov, R. Kozma, R. Kilgour, M. Laws, J. Taylor, M. Watts and
A. Gray, “Hybrid connectionist-based methods and systems for speech data analysis and
phoneme-based speech recognition” in Neuro-Fuzzy Techniques for Intelligent Information
Processing, N. Kasabov and R.Kozma, eds. Heidelberg, Physica Verlag, 1999.
[Katagiri and Juang 1998] S. Katagiri and B.-H. Juang, “Pattern Recognition using a family of
design algorithms based upon the generalised probabilistic descent method”, invited paper in
Proc. of the IEEE, vol. 86, no. 11, pp. 2345-2373, 1998.
[Katagiri et al. 1991] S. Katagiri, C.-H. Lee and B.-H. Juang, “New discriminative training algorithms based on the generalised descent method”, in Proc. of IEEE Workshop on neural
networks for signal processing, pp. 299-308, 1991.
[Keller et al. 1985] J. M. Keller, M. R. Gray and J. A. Givens, “A fuzzy k-nearest neighbor algorithm”, IEEE Trans. Syst. Man Cybern., vol. SMC-15, no. 4, pp. 580-585, 1985.
[Kewley-Port 1995] Diane Kewley-Port, “Speech recognition”, chapter 9 in Applied Speech Technology, edited by A. Syrdal, R. Bennett and S. Greenspan, CRC Press, Inc, USA, 1995.
[Koo and Un 1990] M. Koo and C. K. Un, “Fuzzy smoothing of HMM parameters in speech recognition”, Electronic Letters, vol. 26, pp. 7443-7447, 1990.
[Kosko 1992] B. Kosko, Neural Networks and Fuzzy Systems, Englewood Cliffs, NJ:Prentice-Hall,
1992.
[Krishnapuram and Keller 1993] R. Krishnapuram and J. M. Keller, “A possibilistic approach to
clustering”, IEEE Trans. Fuzzy Syst., vol. 1, pp. 98-110, 1993.
[Krishnapuram et al. 1992] R. Krishnapuram, O. Nasraoui and H. Frigui, “Fuzzy c-spherical shells
algorithm: A new approach”, IEEE Trans. Neural Networks, vol. 3, no. 5, pp. 663-671, 1992.
[Kulkarni 1995] V. G. Kulkarni, Modeling and analysis of stochastic systems, Chapman & Hall, UK,
1995.
[Kuncheva and Bezdek 1997] L. I. Kuncheva and J. C. Bezdek, “A fuzzy generalised nearest prototype classifier”, in Proc. the 7th IFSA World Congress, Prague, Czech, vol. III, pp. 217-222,
1997.
[Kunzel 1994] H. J. Kunzel, “Current approaches to forensic speaker recognition”, in Proc. ESCA
Workshop on Automatic Speaker Recognition, Identification and Verification, pp. 135-141,
1994.
[Le et al. 1999] T. V. Le, D. Tran and M. Wagner, “Fuzzy evolutionary programming for hidden
Markov modelling in speaker identification”, in Proc. the Congress on Evolutionary Computation 99, Washington DC, pp. 812-815, 1999.
BIBLIOGRAPHY
143
[LIMSI’s web site] http://www.limsi.fr/Recherche/TLP/reco/2pg95-sv/2pg95-sv.html
[C.-H. Lee et al. 1996] C.-H. Lee, F. K. Soong and K. K. Paliwal, Automatic speech and speaker
recognition, Advanced topics, Kluwer Academic Publishers, USA, 1996.
[C.-H. Lee and Gauvain 1996] C.-H. Lee and J.-L. Gauvain, “Bayesian adaptive learning and MAP
estimation of HMM”, chapter 4 in Automatic Speech and Speaker Recognition, Advanced
Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic
Publishers, USA, pp. 83-108, 1996.
[Lee and Leekwang 1995] K. M. Lee and H. Leekwang, “Identification of λ-Fuzzy Measure by Genetic Algorithms”, Fuzzy Sets Syst., vol. 75, pp. 301-309, 1995.
[K.-F. Lee and Alleva 1991] K.-F. Lee and Fil Alleva, “Continuous speech recognition”, in Advances
in Speech Signal Processing, edited by Sadaoki Furui and M. Mohan Sondhi, Marcel Dekker,
Inc., New York, pp. 623-650, 1991.
[Leszczynski et al. 1985] K. Leszczynski, P. Penczek and W. Grochulski, “Sugeno’s Fuzzy Measure
and Fuzzy Clustering”, Fuzzy Sets Syst., vol. 15, pp. 147-158, 1985.
[Levinson et al. 1983] S. E. Levinson, L. R. Rabiner and M. M. Sondhi, “An introduction to the
application of the theory of Probabilistic functions of a Markov process to automatic speech
recognition”, in The Bell Syst. Tech. Journal, vol. 62, no. 4, 1983, pp 1035-1074, 1983.
[Li and Mukaidono 1999] R.-P. Li and M. Mukaidono, “Gaussian clustering method based on
maximum-fuzzy-entropy interpretation”, Fuzzy Sets and Systems, vol. 102, pp. 253-258, 1999.
[Linde et al. 1980] Y. Linde, A. Buzo and R. M. Gray, “An Algorithm for Vector Quantization”,
IEEE Trans. Communications, vol. 28, pp. 84-95, 1980.
[Liu et al. 1996] C. S. Liu, H. C. Wang and C.-H. Lee, “Speaker Verification using Normalization
Log-Likelihood Score”, IEEE Trans. Speech and Audio Processing, vol. 4, pp. 56-60, 1980.
[Liu et al. 1998] C. Liu, D. B. Rubin and Y. N. Wu, “Parameter Expansion to Accelerate EM: the
PX-EM algorithm”, Biometrika, 1998 (to appear).
[Liu and Rubbin 1994] C. Liu and D. B. Rubin “The ECME algorithm: a simple extension of EM
and ECM with faster monotone convergence”, Biometrika, vol. 81, pp. 633-648, 1994.
[Markov and Nakagawa 1998a] K. P. Markov and S. Nakagawa, “Discriminative training of GMM
using a modified EM algorithm for speaker recognition”, in Proc. Inter. Conf. on Spoken
Language Processing (ICSLP’98), vol. 2, pp. 177-180, Sydney, Australia, 1998.
BIBLIOGRAPHY
144
[Markov and Nakagawa 1998b] K. P. Markov and S. Nakagawa, “Text-independent speaker recognition using non-linear frame likelihood transformation”, Speech Communication, vol. 24, pp.
193-209, 1998.
[Matsui and Furui 1994] T. Matsui and S. Furui, “A new similarity normalisation method for
speaker verification based on a posteriori probability”, in Proc. ESCA Workshop on Automatic Speaker Recognition, Identification and Verification, pp. 59-62, 1994.
[Matsui and Furui 1993] T. Matsui and S. Furui, “Concatenated Phoneme Models for Text Variable
Speaker Recognition”, in Proc. IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing
(ICASSP’93), pp. 391-394, 1993.
[Matsui and Furui 1992] T. Matsui and S. Furui, “Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs”, in Proc. IEEE Inter.
Conf. on Acoustic, Speech, and Signal Processing (ICASSP’92), San Francisco, pp. II-157-160,
1992.
[Matsui and Furui 1991] T. Matsui and S. Furui, “A text-independent speaker recognition method
robust against utterance variations”, in Proc. IEEE Inter. Conf. on Acoustic, Speech, and
Signal Processing (ICASSP’91), pp. 377-380, 1991.
[McDermott and Katagiri 1994] E. McDermott and S. Katagiri, “Prototype-based MCE/GPD
training for various speech units”, Comp. Speech Language, vol. 8, pp. 351-368, 1994.
[Medasani and Krishnapuram 1998] S. Medasani and R. Krishnapuram, “Categorization of Image
Databases for Efficient Retrieval Using Robust Mixture Decomposition”, in Proc. IEEE Workshop on Content Based Access of Images and Video Libraries, IEEE Conference on Computer
Vision and Pattern Recognition, Santa Barbara, pp. 50-54, 1998.
[Meng and Dyk 1997] X. L. Meng and V. Dyk, “The EM algorithm An old folk song sung to a fast
new tune (with discussion)”, J. Roy. Stat. Soc., Ser. B, vol. 59, pp. 511-567, 1997.
[Meng and Rubin 1993] X. L. Meng and D. B. Rubin, “Maximum likelihood estimation via the
ECM algorithm: a general framework”, Biometrika, vol. 80, pp. 267-278, 1993.
[MIT’s web site] http://www.sls.lcs.mit.edu/sls/
[Millar et al. 1994] J. B. Millar, J. P. Vonwiller, J. M. Harrington and P. J. Dermody, “The Australian National Database of Spoken Language”, in Proc. Inter. Conf. on Acoustic, Speech,
and Signal Processing (ICASSP’94), vol. 1, pp. 97-100, 1994.
[Nadas 1983] A. Nadas, “A decision theoretic formulation of atraining problem in speech recognition
and a comparison of training by unconditional versus conditional maximum likelihood”, IEEE
Trans. Signal Processing, vol. 31, no. 4, pp. 814-817, 1983.
BIBLIOGRAPHY
145
[Murofushi and Sugeno 1989] T. Murofushi and M. Sugeno, “An interpretation of Fuzzy Measure
and the Choquet Integral as an Integral with respect to a Fuzzy Measure”, Fuzzy Sets Syst.,
vol. 29, pp. 201-227, 1989.
[Normandin 1996] Y. Normandin, “Maximum mutual information estimation of hidden Markov
models”, chapter 3 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by
Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA,
pp. 57-82, 1996.
[Western Ontario’s web site] http://www.uwo.ca/nca/
[O’Shaughnessy 1987] Douglas O’Shaughnessy, Speech Communication, Addison-Wesley, USA,
1987.
[Ostendorf et al. 1997] M. Ostendorf, V. V. Digalakis and O. A. Kimball, “From HMM’s to segment
models: A unified view of stochastic modeling for speech recognition”, IEEE Trans. Speech
& Audio Processing, vol. 4, no. 5, pp. 360-378, 1997.
[Ostendorf 1996] M. Ostendorf, “From HMM’s to segment models: stochastic modeling for CSR”,
chapter 8 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui
Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 185-210,
1996.
[Otten and Ginnenken 1989] R. H. J. M. Otten and L. P. P. P. van Ginneken, The Annealing Algorithm, Kluwer, Boston, 1989.
[Owens 1993] F. J. Owens, Signal Processing of Speech, McGraw-Hill, Inc., New York, 1993.
[Pal and Majumder 1977] S. K. Pal and D. D. Majumder, “Fuzzy sets and decision making approaches in vowel and speaker recognition”, IEEE Trans. Syst. Man Cybern., pp. 625-629,
1977.
[Paul 1989] D. B. Paul, “The Lincoln Robust Continuous Speech Recogniser,” Proc. ICASSP 89,
Glasgow, Scotland, pp. 449-452, 1989.
[Peleg 1980] S. Peleg, “A new probability relaxation scheme”, IEEE Trans. Patt. Anal. Mach. Intell., vol. 7, no. 5, pp. 617-623, 1980.
[Peleg and Rosenfeld 1978] S. Peleg and A. Rosenfeld, “Determining compatibility coefficients for
curve enhancement relaxation processes”, IEEE Trans. Syst. Man Cybern., vol. 8, no. 7, pp.
548-555, 1978
BIBLIOGRAPHY
146
[Rabiner et al. 1996] L. R. Rabiner, B. H. Juang and C. H. Lee, “An Overview of Automatic Speech
Recognition”, chapter 1 in Automatic Speech and Speaker Recognition, Advanced Topics,
edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers,
USA, pp. 1-30, 1996.
[Rabiner and Juang 1993] L. R. Rabiner and B. H. Juang, Fundamentals of speech recognition, Prentice Hall PTR, USA, 1993.
[Rabiner 1989] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications speech
recognition”, in Proc. IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[Rabiner and Juang 1986] L. R. Rabiner and B. H. Juang, “An introduction to hidden Markov
models”, IEEE Acoustic, Speech, and Signal Processing Society Magazine, vol. 3, no. 1, pp.
4-16, 1986.
[Rabiner et al. 1983] L. R. Rabiner, S. E. Levinson and M. M. Sondhi, “On the application of vector
quantisation and hidden Markov models to speaker-independent, isolated word recognition”,
The Bell System Technical Journal, vol. 62, no. 4, 1983, pp 1075-1105, 1983.
[Rabiner and Schafer 1978] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals,
Prentice Hall PTR, USA, 1978.
[Rezaee et al. 1998] M. R. Rezaee, B. P. F. Lelieveldt and J. H. C. Reiber, “A new cluster validity
index for the fuzzy c-means”, Patt. Rec. Lett. vol. 19, pp. 237-246, 1998.
[Reynolds 1995a] Douglas A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models”, Speech Communication, vol. 17, pp. 91-108, 1995.
[Reynolds 1995b] Douglas A. Reynolds and Richard C. Rose, “Robust text-independent speaker
identification using Gaussian mixture models”, IEEE Trans. Speech and Audio Processing,
vol. 3, no. 1, pp. 72-83, 1995.
[Reynolds 1994] Douglas A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models”, in Proc. ESCA Workshop on Automatic Speaker Recognition, Identification and Verification, vol. 17, pp. 91-108, 1994.
[Reynolds 1992] Douglas A. Reynolds, A Gaussian Mixture Modeling Approach to Text-Independent
Speaker Identification, PhD thesis, Georgia Institute of Technology, USA, 1992.
[Rosenberg and Soong 1991] A. E. Rosenberg and Frank K. Soong, “Recent research in automatic
speaker recognition”, in Advances in Speech Signal Processing, edited by Sadaoki Furui and
M. Mohan Sondhi, Marcel Dekker, Inc., New York, pp. 701-740, 1991.
BIBLIOGRAPHY
147
[Rosenberg et al. 1992] A. E. Rosenberg, J. Delong, C.-H. Lee, B.-H. Juang and F. K. Soong, “The
use of cohort normalised scores for speaker verification”, in Proc. Inter. Conf. on Spoken
Language Processing (ICSLP’92), pp. 599-602, 1992.
[Ruspini 1969] E. H. Ruspini, “A new approach to clustering”, in Inform. Control, vol. 15, no. 1,
pp. 22-32, 1969.
[Sagayama 1996] S. Sagayama, “Hidden Markov network for precise acoustic modeling”, chapter
7 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee,
Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 159-184,
1996.
[Schwartz et al. 1996] R. Schwartz, L. Nguyen and J. Makhoul, “Multiple-pass search strategies”,
chapter 18 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by ChinHui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp.
429-456, 1996.
[Siu et al. 1992] M.-H. Siu, G. Yu and H. Gish, “An unsupervised, sequential learning algorithm for
the segmentation of speech waveforms with multiple speakers”, in Proc. IEEE Inter. Conf.
on Acoustics, Speech, and Signal Processing (ICASSP’92), pp. I-189-192, 1992.
[Soong et al. 1987] F. K. Soong, A. E. Rosenberg, L. R. Rabiner, and B. H. Juang, “A vector
quantisation approach to speaker recognition”, AT&T Tech. J., vol. 66, pp. 14-26, 1987.
[Syrdal et al. 1995] A. Syrdal, R. Bennett and S. Greenspan, Applied Speech Technology, CRC Press,
Inc, USA, 1995.
[Tanaka and Guo 1999] H. Tanaka and P. Guo, Possibilistic Data Analysis for Operations Research,
Physica-Verlag Heidelberg, Germany, 1999.
[Tran and Wagner 2000a] Dat Tran and Michael Wagner, “Fuzzy Modelling Techniques for Speech
and Speaker Recognition”, the Special Issue on Recognition Technology of the IEEE Transactions on Fuzzy Systems (accepted subject to revision).
[Tran and Wagner 2000b] Dat Tran and Michael Wagner, “Fuzzy Entropy Hidden Markov Models
for Speech Recognition”, submitted to the International Conference on Spoken Language
Processing (ICSLP2000), Beijing, China.
[Tran and Wagner 2000c] Dat Tran and Michael Wagner, “Fuzzy Normalisation Methods for
Speaker Verification”, submitted to the International Conference on Spoken Language Processing (ICSLP2000), Beijing, China.
BIBLIOGRAPHY
148
[Tran and Wagner 2000d] Dat Tran and Michael Wagner, “A Proposed Likelihood Transformation
for Speaker Verification”, the International Conference on Acoustics, Speech & Signal Processing (ICASSP’2000), Turkey (to appear).
[Tran and Wagner 2000e] Dat Tran and Michael Wagner, “Frame-Level Hidden Markov Models”,
the International Conference on Advances in Intelligent Systems: Theory and Applications
(ISTA’2000), Australia (to appear).
[Tran and Wagner 2000f] Dat Tran and Michael Wagner, “Fuzzy Entropy Clustering”, the FUZZIEEE’2000 Conference, USA (to appear).
[Tran and Wagner 2000g] Dat Tran and Michael Wagner, “An Application of Fuzzy Entropy Clustering In Speaker Identification”, in Proceedings of the Joint Conference on Information Sciences 2000 (Fuzzy Theory and Technology Track), vol. 1, pp. 228-231, 2000, Atlantic City,
NJ, USA.
[Tran and Wagner 2000h] Dat Tran and Michael Wagner, “A General Approach to Hard, Fuzzy, and
Probabilistic Models for Pattern Recognition”, the International Conference on Advances in
Intelligent Systems: Theory and Applications (ISTA’2000), Australia (to appear).
[Tran et al. 2000a] Dat Tran, Michael Wagner and Tuan Pham, “Hard Hidden Markov Models for
Speech Recognition”, the 4rd World Multiconference on Systemetics, Cybernetics and Informatics/ The 6th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’2000),
Florida, USA (to appear).
[Tran et al. 2000b] Dat Tran, Michael Wagner and Tuan Pham, “Hard Gaussian Mixture Models
for Speaker Recognition”, the 4rd World Multiconference on Systemetics, Cybernetics and Informatics/ The 6th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’2000),
Florida, USA (to appear).
[Tran 1999] Dat Tran, “Fuzzy Entropy Models for Speech Recognition”, the first prize of the 1999
IEEE ACT Section Student Paper Contest, in the Postgraduate Division, Australia.
[Tran and Wagner 1999a] Dat Tran and Michael Wagner, “Hidden Markov models using fuzzy estimation”, in Proceedings of the EUROSPEECH’99 Conference, vol. 6, pp. 2749-2752, 1999,
Hungary.
[Tran and Wagner 1999b] Dat Tran and Michael Wagner, “Fuzzy expectation-maximisation algorithm for speech and speaker recognition”, in Proceedings of the 18th International Conference
of the North American Fuzzy Information Society (NAFIPS’99), pp. 421-425, 1999, USA.
[Tran and Wagner 1999c] Dat Tran and Michael Wagner, “Fuzzy hidden Markov models for speech
and speaker recognition”, in Proceedings of the 18th International Conference of the North
American Fuzzy Information Society (NAFIPS’99), pp. 426-430, 1999, USA.
BIBLIOGRAPHY
149
[Tran and Wagner 1999d] Dat Tran and Michael Wagner, “Fuzzy approach to Gaussian mixture
models and generalised Gaussian mixture models”, in Proceedings of the Computation Intelligence Methods and Applications (CIMA’99) Conference, pp. 154-158, 1999, USA.
[Tran and Wagner 1999e] Dat Tran and Michael Wagner, “A robust clustering approach to fuzzy
Gaussian mixture models for speaker identification”, in Proceedings of the Third International
Conference on Knowledge-Based Intelligent Information Engineering Systems (KES’99), pp.
337-340, Adelaide, Australia.
[Tran et al. 1999a] Dat Tran, Michael Wagner, and Tongtao Zheng, “A Fuzzy Approach to Statistical Models in Speech and Speaker Recognition”, in 1999 IEEE international Fuzzy Systems
Conference Proceedings (FUZZ-IEEE’99), vol. 3, pp. 1275-1280, 1999, Korea.
[Tran et al. 1999b] Dat Tran, Michael Wagner and Tongtao Zheng, “Fuzzy nearest prototype classifier applied to speaker identification”, in Proceedings of the European Symposium on Intelligent Techniques (ESIT’99) on CD-ROM, abstract on page 34, 1999, Greece.
[Tran et al. 1999c] Dat Tran, Tuan Pham, and Michael Wagner, “Speaker recognition using Gaussian mixture models and relaxation labeling”, in Proceedings of the 3rd World Multiconference
on Systemetics, Cybernetics and Informatics/ The 5th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS99), vol. 6, pp. 383-389, 1999, USA.
[Tran et al. 1999d] Dat Tran, Michael Wagner and Tongtao Zheng, “State mixture modelling applied to speech and speaker recognition”, in a special issue of the Journal of Pattern Recognition Letters (Pattern Recognition in Practice VI), vol. 20, no. 11-13, pp. 1449-1456, 1999.
[Tran 1998] Dat Tran “Hidden Markov models using state distribution”, the first prize of the 1998
IEEE ACT Section Student Paper Contest, in the Postgraduate Division, Australia.
[Tran and Wagner 1998] Dat Tran and Michael Wagner, “Fuzzy Gaussian Mixture Models for
Speaker Recognition”, in Special issue of the Australian Journal of Intelligent Information
Processing Systems (AJIIPS), vol. 5, no. 4, pp. 293-300, 1998.
[Tran et al. 1998a] Dat Tran, T. VanLe and Michael Wagner, “Fuzzy Gaussian Mixture Models for
Speaker Recognition”, in Proceedings of the International Conference on Spoken Language
Processing (ICSLP98), vol. 2, pp. 759-762, 1998, Australia.
[Tran et al. 1998b] Dat Tran, Michael Wagner and T. VanLe, “A proposed decision rules based
on fuzzy c-means clustering for speaker recognition”, in Proceedings of the International
Conference on Spoken Language Processing (ICSLP98), vol. 2, pp. 755-758, 1998, Australia.
[Tran et al. 1998c] Dat Tran, Michael Wagner and Tuan Pham, “Minimum Classifier Error and Relaxation Labelling for Speaker Recognition”, in Proceedings of the Speech computer Workshop,
St Petersburg, (Specom 98), pp. 229-232, 1998, Russia.
BIBLIOGRAPHY
150
[Tran et al. 1998d] Dat Tran, Minh Do, Michael Wagner and T. VanLe, “A proposed decision rule
for speaker identification based on a posteriori probability”, in Proceedings of the ESCA
Workshop (RLA2C98), pp. 85-88, 1998, France.
[Tseng et al. 1987] H.-P. Tseng, M. J. Sabin and E. A. Lee, “Fuzzy vector quantisation applied
to hidden Markov modelling”, in Proc. of the Inter. Conf. on Acoustics, Speech & Signal
Processing (ICASSP’87), pp. 641-644, 1987.
[Tsuboka and Nakahashi 1994] E. Tsuboka and J. Nakahashi, “On the fuzzy vector quantisation
based hidden Markov model”, in Proc. Inter. Conf. on Acoustics, Speech & Signal Processing
(ICASSP’94), vol. 1, pp. 537-640, 1994.
[Upper 1997] D. R. Upper, Theory and algorithms for hidden Markov models and generalised hidden
Markov models, PhD thesis in Mathematics, University of California at Berkeley, 1997.
[Varga and Moore 1990] A. P. Varga and R. K. Moore, “Hidden Markov model decomposition
of speech and noise”, in Proc. Inter. Conf. on Acoustics, Speech & Signal Processing
(ICASSP’90), pp. 845-848, 1990.
[Wagner 1996] Michael Wagner, “Combined speech-recognition speaker-verification system with
modest training requirements”, in Proc. Sixth Australian International Conf. on Speech Science and Technology, Adelaide, Australia, pp. 139-143, 1996.
[Wang 1992] Z. Wang and G. J. Klir, Fuzzy Measure Theory, Plenum Press, 1992.
[Wilcox et al. 1994] L. Wilcox, F. Chen, D. Kimber, and V. Balasubramanian, “Segmentation of
speech using speaker identification”, in Proc. IEEE Inter. Conf. on Acoustics, Speech, and
Signal Processing (ICASSP’94), pp. I-161-164, 1994.
[Windham 1983] M. P. Windham, “Geometrical fuzzy clustering algorithms”, Fuzzy sets Syst., vol.
10, pp. 271-279, 1983.
[Woodland 1997] P. C. Woodland, “Broadcast news transcription using HTK”, in Proc. Inter. Conf.
on Acoustics, Speech & Signal Processing (ICASSP’97), pp. , USA, 1997.
[Wu 1983] C. F. J. Wu, “On the convergence properties of the EM algorithm”, Ann. Stat., vol. 11,
pp. 95-103, 1983.
[Yang and Cheng 1993] M. S. Yang and C. T. Chen, “On strong consistency of the fuzzy generalized
nearest neighbour rule”, Fuzzy Sets Syst., vol. 3, no. 60, pp. 273-281, 1993.
[Zadeh 1995] L. A. Zadeh, “Discussion: probability theory and fuzzy logic are complementary rather
than competitive”, Technometrics, vol. 37, no. 3, pp. 271-276, 1995.
BIBLIOGRAPHY
151
[Zadeh 1994] L. A. Zadeh, “Fuzzy logic, neural networks, and soft computing”, Communications of
the ACM, vol. 37, no. 3, pp. 77-84, 1994.
[Zadeh 1978] L. A. Zadeh, “Fuzzy sets as a basis for a theory of possibility”, Fuzzy Sets and Systems,
vol. 1, no. 1, pp. 3-28, 1978.
[Zadeh 1977] L. A. Zadeh, “Fuzzy sets and their application to pattern classification and clustering
analysis”, Classification and Clustering, edited by J. Van Ryzin, Academic Press Inc, pp.
251-282 & 292-299, 1977.
[Zadeh 1976] L. A. Zadeh, “The linguistic approach and its application to decision analysis”, Directions in large scale systems, edited by Y. C. Ho and S. K. Mitter, Plenum Publishing
Corporation, pp. 339-370, 1976.
[Zadeh 1968] L. A. Zadeh, “Probability measures of fuzzy events”, J. Math. Anal. Appl., vol. 23,
no. 2, pp. 421-427, 1968.
[Zadeh 1965] L. A. Zadeh, “Fuzzy Sets”, Inf. Control., vol. 8, no. 1, pp. 338-353, 1965.
[Zhuang et al. 1989] X. Zhuang, R. M. Haralick and H. Joo, “A simplex-like algorithm for the
relaxation labeling process”, IEEE Trans. Patt. Anal. Mach. Intell., vol. 11, pp. 1316-1321,
1989.
Appendix A
List of Publications
1. Dat Tran, “Fuzzy Entropy Models for Speech Recognition”, the first prize of the 1999 IEEE
ACT Section Student Paper Contest, in the Postgraduate Division, Australia.
2. Dat Tran “Hidden Markov models using state distribution”, the first prize of the 1998 IEEE
ACT Section Student Paper Contest, in the Postgraduate Division, Australia.
3. Dat Tran and Michael Wagner, “Fuzzy Modelling Techniques for Speech and Speaker Recognition”, the Special Issue on Recognition Technology of the IEEE Transactions on Fuzzy Systems
(accepted subject to revision).
4. Dat Tran and Michael Wagner, “Fuzzy Entropy Hidden Markov Models for Speech Recognition”, submitted to the International Conference on Spoken Language Processing (ICSLP’2000), Beijing, China.
5. Dat Tran and Michael Wagner, “Fuzzy Normalisation Methods for Speaker Verification”,
submitted to the International Conference on Spoken Language Processing (ICSLP’2000),
Beijing, China.
6. Dat Tran and Michael Wagner, “A Proposed Likelihood Transformation for Speaker Verification”, the International Conference on Acoustics, Speech & Signal Processing (ICASSP’2000),
Turkey (to appear).
7. Dat Tran and Michael Wagner, “Frame-Level Hidden Markov Models”, the International
Conference on Advances in Intelligent Systems: Theory and Applications (ISTA’2000), Australia (to appear).
8. Dat Tran and Michael Wagner, “A General Approach to Hard, Fuzzy, and Probabilistic
Models for Pattern Recognition”, the International Conference on Advances in Intelligent
Systems: Theory and Applications (ISTA’2000), Australia (to appear).
9. Dat Tran and Michael Wagner, “Fuzzy Entropy Clustering”, the FUZZ-IEEE’2000 Conference, USA (to appear).
152
153
10. Dat Tran and Michael Wagner, “An Application of Fuzzy Entropy Clustering In Speaker
Identification”, in Proceedings of the Joint Conference on Information Sciences 2000 (Fuzzy
Theory and Technology Track), vol. 1, pp. 228-231, 2000, Atlantic City, NJ, USA.
11. Dat Tran and Michael Wagner, “Hidden Markov models using fuzzy estimation”, in Proceedings of the EUROSPEECH’99 Conference, vol. 6, pp. 2749-2752, 1999, Hungary.
12. Dat Tran and Michael Wagner, “Fuzzy expectation-maximisation algorithm for speech and
speaker recognition”, in Proceedings of the 18th International Conference of the North American Fuzzy Information Society (NAFIPS’99), pp. 421-425, 1999, USA (Outstanding Student
Paper Award, Top Honor).
13. Dat Tran and Michael Wagner, “Fuzzy hidden Markov models for speech and speaker recognition”, in Proceedings of the 18th International Conference of the North American Fuzzy
Information Society (NAFIPS’99), pp. 426-430, 1999, USA (Outstanding Student Paper
Award, Top Honor).
14. Dat Tran and Michael Wagner, “Fuzzy approach to Gaussian mixture models and generalised
Gaussian mixture models”, in Proceedings of the Computation Intelligence Methods and
Applications (CIMA’99) Conference, pp. 154-158, 1999, USA.
15. Dat Tran and Michael Wagner, “A robust clustering approach to fuzzy Gaussian mixture
models for speaker identification”, in Proceedings of the Third International Conference on
Knowledge-Based Intelligent Information Engineering Systems (KES’99), pp. 337-340, Adelaide, Australia.
16. Dat Tran and Michael Wagner, “Fuzzy Gaussian Mixture Models for Speaker Recognition”,
in Special issue of the Australian Journal of Intelligent Information Processing Systems (AJIIPS), vol. 5, no. 4, pp. 293-300, 1998.
17. Dat Tran, Michael Wagner and T. VanLe, “A proposed decision rules based on fuzzy C-means
clustering for speaker recognition”, in Proceedings of the International Conference on Spoken
Language Processing (ICSLP’98), vol. 2, pp. 755-758, 1998, Australia.
18. Dat Tran, Michael Wagner and Tuan Pham, “Hard Hidden Markov Models for Speech Recognition”, the 4rd World Multiconference on Systemetics, Cybernetics and Informatics/ The 6th
Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’2000), Florida, USA (to
appear).
19. Dat Tran, Michael Wagner and Tuan Pham, “Hard Gaussian Mixture Models for Speaker
Recognition”, the 4rd World Multiconference on Systemetics, Cybernetics and Informatics/
The 6th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’2000), Florida,
USA (to appear).
154
20. Dat Tran, Michael Wagner and Tuan Pham, “Minimum classifier Error and Relaxation labelling for speaker recognition”, in Proceedings of the Speech computer Workshop, St Petersburg, (Specom’98), pp. 229-232, 1998, Russia.
21. Dat Tran, Michael Wagner and Tongtao Zheng, “State mixture modelling applied to speech
and speaker recognition”, in a special issue of the Journal of Pattern Recognition Letters
(Pattern Recognition in Practice VI), vol. 20, no. 11-13, pp. 1449-1456, 1999.
22. Dat Tran, Michael Wagner, and Tongtao Zheng, “A Fuzzy Approach to Statistical Models
in Speech and Speaker Recognition”, in 1999 IEEE international Fuzzy Systems Conference
Proceedings (FUZZ-IEEE’99), vol. 3, pp. 1275-1280, 1999, Korea.
23. Dat Tran, Michael Wagner and Tongtao Zheng, “Fuzzy nearest prototype classifier applied to
speaker identification”, in Proceedings of the European Symposium on Intelligent Techniques
(ESIT’99) on CD-ROM, abstract on page 34, 1999, Greece.
24. Dat Tran, Tuan Pham, and Michael Wagner, “Speaker recognition using Gaussian mixture
models and relaxation labeling”, in Proceedings of the 3rd World Multiconference on Systemetics, Cybernetics and Informatics/ The 5th Int. Conf. Information Systems Analysis
and Synthesis (SCI/ISAS’99), vol. 6, pp. 383-389, 1999, USA.
25. Dat Tran, T. VanLe and Michael Wagner, “Fuzzy Gaussian Mixture Models for Speaker
Recognition”, in Proceedings of the International Conference on Spoken Language Processing
(ICSLP’98), vol. 2, pp. 759-762, 1998, Australia (paper selected to publish in a special issue
of the Australian Journal of Intelligent Information Processing Systems).
26. Dat Tran, Minh Do, Michael Wagner and T. VanLe, “A proposed decision rule for speaker
identification based on a posteriori probability”, in Proceedings of the ESCA Workshop
(RLA2C’98), pp. 85-88, 1998, France.
27. Tuan Pham, Dat Tran and Michael Wagner, “Optimal fuzzy information fusion for speaker
verification”, in Proceedings of the Computation Intelligence Methods and Applications Conference (CIMA’99), pp. 141-146, 1999, USA.
28. Tuan Pham, Dat Tran and Michael Wagner, “Speaker verification using relaxation labeling”,
in Proceedings of the ESCA Workshop (RLA2C’98), pp. 29-32, 1998, France.
29. Le, T. V., Tran D., and Wagner, M., “Fuzzy evolutionary programming for hidden Markov
modelling in speaker identification”, in Proceedings of the Congress on Evolutionary Computation (CEC’99), Washington DC, pp. 812-815, July 1999.