- Research
- Open access
- Published:
Evaluation of a speaker identification system with and without fusion using three databases in the presence of noise and handset effects
EURASIP Journal on Advances in Signal Processing volume 2017, Article number: 80 (2017)
Abstract
In this study, a speaker identification system is considered consisting of a feature extraction stage which utilizes both power normalized cepstral coefficients (PNCCs) and Mel frequency cepstral coefficients (MFCC). Normalization is applied by employing cepstral mean and variance normalization (CMVN) and feature warping (FW), together with acoustic modeling using a Gaussian mixture model-universal background model (GMM-UBM). The main contributions are comprehensive evaluations of the effect of both additive white Gaussian noise (AWGN) and non-stationary noise (NSN) (with and without a G.712 type handset) upon identification performance. In particular, three NSN types with varying signal to noise ratios (SNRs) were tested corresponding to street traffic, a bus interior, and a crowded talking environment. The performance evaluation also considered the effect of late fusion techniques based on score fusion, namely, mean, maximum, and linear weighted sum fusion. The databases employed were TIMIT, SITW, and NIST 2008; and 120 speakers were selected from each database to yield 3600 speech utterances. As recommendations from the study, mean fusion is found to yield overall best performance in terms of speaker identification accuracy (SIA) with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings.
1 Introduction
Speaker identification is one important application of biometrics and forensics to identify speakers based on their unique voice pattern [1–3]. According to [4], feature extraction within speaker identification should be less influenced by noise or the person’s health. However, to improve the speaker identification accuracy (SIA), Mel frequency cepstral coefficients (MFCC) features were fused with inverse MFCC features (IMFCC) in [5], but the approach was limited by the number of GMM components. An overview of speaker identification was presented in [6] and increasing the number of speakers and using different types of realistic non-stationary noise (NSN) in evaluation was suggested to develop the field along with exploiting fusion techniques. Nakagawa et al. [7] proposed combining phase information with MFCC features to improve speaker identification. Despite this research, recognition rate is still a subject of focus. Murty and Yegnanarayana [8] elucidate improvements in a speaker verification system by combining the residual phase derived from linear prediction analysis of the speech signal with the spectral MFCC features. In addition, the National Institute of Standards and Technology (NIST) 2003 database [8] was used; a 14% equal error rate (EER) performance was achieved for MFCC and a 22% rate for the residual phase. Although the combination was better than the individual features alone, the system was not subjected to realistic noise conditions and channel variabilities. Similar to this approach, Wang et al. [9] used a linear weighted sum for the score fusion but the work did not consider noise, and likewise in [10] channel distortion seems to have been ignored. In [11], different feature combinations were presented using MFCC and linear prediction cepstrum coefficients (LPCC) to improve the recognition rate. However, a limited number of speakers was used, only digit speech was employed, and the system was only tested in ideal conditions.
Bhardwaj et al. [12] presented three scenarios for speaker identification, exploiting the generalized fuzzy model (GFM). However, the identification rate using the NIST 2003 database was poor. In [13], approximately 1000 speakers were selected and recordings were made, including in an acoustics room, with noise, and with varying microphone distance. However, the conditions were perhaps unfair and a non-standard database (derived from YouTube) was used. In addition, the tested system performed best with the Texas Instruments and Massachusetts Institute of Technology (TIMIT) database, with a reduction of 10% for the NIST 2002 database, and approximately 30% with the telephone bandwidth version of TIMIT, or Network TIMIT (NTIMIT). However, the system was not evaluated under different environmental noise conditions. In [14], a mean clustering approach was proposed for GMM speaker models, but the time complexity of the log-likelihood calculation was a bottleneck for the testing phase. The system achieved highest performance with TIMIT, with 10 and 30% reductions for the NIST 2002, and NTIMIT databases, respectively. Again, however, the system was not evaluated under different environmental noise conditions.
In another study, fuzzy clustering was presented in [15], which employed hierarchical tree decisions for speaker identification. The study involved 3805 speakers subjected to AWGN, and it was also noted that the system could be improved using fusion; however, no tests for realistic noise were conducted. In [16], both the NIST 2008 and TIMIT databases were employed to achieve robust speaker identification and mitigate room reverberation and additive noise, but again handset effects were ignored. Also, to accomplish robust speaker identification, Li and Huang [17] employed Cochlear filter cepstral coefficients (CFCCs) and used the NTIMIT and Speech Separation Challenge databases, although fusion can also be used to enhance the identification performance. Various neural network-based approaches were proposed in [18], without considering different noise and handset conditions. Furthermore, other researchers have employed deep neural network (DNN) analysis for speaker identification [19]. In [20], the authors selected 100 speakers from the TIMIT and self-collected databases using novel fuzzy vector quantization (NFVQ) techniques to enhance the speaker identification system (SIS). However, increasing the number of speakers reduced the recognition rate, and there was no testing under realistic noise and channel distortion conditions. Moreover, [21] produced a multi-modal neural network by exploiting wavelet analysis, without testing for noise and channel effects and only using 34 speakers. Other researchers have focused on speaker identification and verification applications with background noise to improve and create robust speaker recognition [22]. Khanteymoori et al. [23] utilized a dynamic Bayesian network (DBN) to model speakers and improve identification compared with GMMs, but a limited number of speakers was used. Furthermore, a new discriminative likelihood score weighting technique was proposed for speaker identification, and a likelihood score weighting method was presented for the speaker identification task [24]. In [25], a state of the art speech recognition system was exploited for noisy environments and reverberation. In addition, an empirical study was presented by Reynolds [26], which included the handset variability effects for the speaker recognition purpose using the Switchboard corpus. On the other hand, Reynolds et al. [27] focused on two issues in the speaker identification task, the size of the population and the degradation produced from the noisy telephone channel; their study used the TIMIT and the NTIMIT databases. However, only a limited number of studies have involved a handset, AWGN, and NSN types in conjunction with fusion strategies. In this work we extend our previous work in [28, 29] with four combinations of features and their score fusion methods for the original recordings; and with AWGN, and three types of NSN: street traffic, bus interior and crowd talk, with and without the G.712 type handset at 16 kHz, to provide a wide range of environmental noise conditions. We emphasize that, although the GMM-UBM approach is well established, no previous study has comprehensively considered three databases, one of which only appeared in 2016, nor the effect of such a wide range of NSN and handset effects.
Section 2 contextualises robust biometric speaker identification; Section 3 describes adding the noise and applying the handset; Section 4 explains the databases and simulation setup; Section 5 presents the simulation Results and Discussions; Section 6 includes comparisons with related work; Section 7 presents Conclusions and future work.
2 An overview of a robust biometric speaker identification system
The main system used in this paper is represented in Fig. 1. The figure has three sections: feature extraction and normalization, speaker modeling and matching, and fusion strategies; it also shows test signals.
2.1 Feature extraction and compensation
In our work, to mimic human ear perception, MFCC features are used [30] and combined with the corresponding power normalized cepstral coefficient (PNCC) features presented in speech recognition systems; these provide robustness [31], and are expected to improve SIA in the presence of background noise. A 16-feature dimension was used to mirror our work in [29, 32], which used both MFCC and PNCC. In addition, the MFCC features included the zero order C 0 coefficient and the PNCC features, including the P c 0 coefficent. A pre-emphasis finite impulse response (FIR) filter realizing a first order high-pass filter was employed to filter the speech samples with emphasis coefficient 0.96 [5]. In addition, framing and Hamming windowing were employed with a frame length of 16 ms with an inter-frame overlap of 8 ms [33]. Moreover, this work exploits a triangular/Mel filter bank (MFB) and the logarithmic non-linearity used in MFCC [34], as well as the Gammatone filter bank (GFB) and power law non-linearity for PNCC [31, 35, 36]. We focus on using the PNCC by exploiting the GFB to improve SIA in the presence of stationary AWGN and NSN background noise. In addition, temporal masking, asymmetric noise suppression (ANS), and power law non-linearity with a 1/15 exponent and GFB were the main elements in the PNCC construction. Further information about PNCC features is provided in [32, 37, 38]. Feature compensation (normalization) is widely and effectively used for speaker verification and identification tasks. The main aims of using normalization are to reduce the effects of noise, channel, and handset transducers and to alleviate linear and non-linear channel effects. In this study, feature warping (FW) and cepstral mean and variance normalization (CMVN) over a sliding window are used [39, 40] to reduce the noise and handset effects and mitigate linear channel effects; this gives improvements and robustness to SIA [6]. The features and feature normalization are as employed in [29].
2.2 Speaker modeling and matching
2.2.1 Gaussian mixture model (GMM)
In GMMs, each speaker can be represented by the multivariate parameters of the Gaussian components, namely, mean, covariance, and a finite weighted mixture. The weighted sum of the M components is called a Gaussian mixture density, as represented in Eqs. (1) and (2) in [29]:
where j=1,…,S and S is the number of speakers, ω i is the i-th mixture weight, and
where x is a D-dimensional random feature vector, and M is the number of Gaussian mixture components. A parameter set for each speaker model is λ j ={ω i ,μ i ,Σ i | i=1,…,M},μ i and Σ i are, respectively, the mean and covariance parameters of the i-th component density and (.)T denotes the transpose operator. In this paper, we used nodal, diagonal covariance matrices instead of full covariance as used in [6, 29]. In speaker modeling, the expectation maximization (EM) method estimates parameters for each mixture.
2.2.2 Gaussian mixture model-universal background model (GMM-UBM)
A Gaussian mixture model-universal background model (GMM-UBM) was used as in [29] and was trained offline with a large amount of data through EM. Furthermore, maximum a posteriori (MAP) approach adaptation was employed to train the individual speaker models, and this adaptation was initialized by the UBM and then coupled with the training data for each speaker. The coupling between large training data (UBM) and a small amount of class-specific data (individual speaker models) makes the GMM-UBM able to estimate a larger number of parameters which increases the mixture size dimension, and thus the SIA. As in our previous work [28], adaptation coefficients are used in the learning of the means, weight, and variances of the GMM models which can be represented by \(\alpha _{i}^{m}, \alpha _{i}^{w}, \alpha _{i}^{v}\), where i=1,…,S, respectively. The parameters and adaptation coefficients used in the paper can be listed as follows: for the initial UBM training finaliter=20; whereas for the MAP adaptation, the relevance factor r ρ=10, ρ∈{m,ω,}; and Nmix \(\in \{8, 16, 32, 64, 128, 256, 512\};\ d_{s} = 1; \alpha _{i}^{\rho } \in [0,1]\) and is calculated as \(\frac {n_{i}}{n_{i}+r^{\rho }}\) where \(n_{i}=\sum _{t=1}^{T_{\mathcal {F}}} \mathbf {Pr}(i{\mid \boldsymbol {x}}_{t})\), where \(T_{\mathcal {F}}\) is the number of feature vectors; where: Nmix is the number of Gaussian components. The d s factor is the feature sub-sampling factor every d s factor frames. finaliter is the number of expectation maximization (EM) iterations. More details of the parameters and how they are used in the adaptation of speaker models can be found in [28, 41].
2.2.3 Maximum log-likelihood scores
Matching between models built during training and evaluating datasets was carried out by log-likelihood ratios (LLRs). In our evaluating studies, 120 speakers were selected from each database. Each speaker has 10 speech utterances, 6 were employed for training, while the remaining 4 speech recordings were used for testing. In total, 720 utterances were used for training purpose (6 training files for each of the 120 speakers = 6 × 120). In addition, 480 speech utterances were exploited for testing (4 tests for each of the 120 speakers = 4 × 120). The model-test set with a length 57,600 represents the multiplication between 120 models with 480 tests (120 × 480). The log-likelihood ratios were calculated as in [29].
where X contains the corresponding \(T_{\mathcal {F}}\) feature vectors, \(X = [\boldsymbol {x}_{1},\ldots,\boldsymbol {x}_{T_{\mathcal {F}}}]\). Four sets of LLRs were found based on feature and normalization types as described in the next section. A maximum likelihood approach was used to identify speakers as a final decision, as in [6, 42].
The SIA can be calculated as in Eq. (4) [5, 43]:
2.3 Fusion strategies
Three methods to form a late fusion score were employed as in [29]: weighted sum, maximum, and mean fusion. Combined normalization methods were employed to produce normalized MFCC features (FWMFCC and CMVNMFCC). Likewise, normalized methods were used to form PNCC features (FWPNCC and CMVNPNCC). Four sets of score vectors could therefore be calculated and are denoted as [28, 29]: f 1 = feature warping MFCC scores vector (FWMFCC), f = CMVN MFCC scores vector, g 1 = feature warping PNCC scores vector (FWPNCC) and g 2 = CMVN PNCC scores vector. The maximum fusion of these score vectors adopted row wise maximum as in Eq. (5).
where fmax ij represents the score vectors for the fusion maximum.
Likewise, mean fusion is presented as:
where fmean ij denotes the score vectors for the fusion mean.
In addition, a linear weighted sum score fusion takes the form:
where, both i and j take values 1 and 2, therefore fweight ij takes one of four values fweight11, fweight12, fweight13, and fweight22, and fweight11 is the linear combination of f 1 and g 1, likewise fweight12 is the linear combination of f 1 and g 2 and so on. For each fweight ij , ω β can take on one of four values, namely, ω β ∈{0.9,0.8,0.77,0.7} which is chosen to give empirically the best SIA. We limit ω β to these four values as lower values have been found to be unsuitable to yield high SIA performance, because MFCC coefficients are more important in the speaker identification task with clean speech. Further details for fusion strategies can be found in [44, 45].
3 Adding noise and applying the G.712 type handset
3.1 Adding stationary AWGN and non-stationary noise
Non-stationary noise (NSN) available online from the websites [46, 47] were used to test the system. Both AWGN and NSN were trimmed to the same fixed length 129,250 speech samples (8 s). Different background noise types as well as AWGN were added only in the testing phase with seven SNR levels based on the corresponding noise power (0 to 30 dB) with step size 5 dB for each level as in [29].
3.2 G.712 type handset
A G.712 type handset at 16 kHz with a fourth order linear IIR filter was derived from the Z transform multiplication of two second order cascaded filters as previously exploited in [6]. We applied the G.712 handset to the normalized speech signal for both training and testing phases as employed in [29]. The main reason for applying and testing this channel distortion was to achieve robust SIA under clean, AWGN noisy speech, and realistic NSN conditions. The transfer function of the IIR filter in the z-domain is given as:
where the numerator parameters are [1,−0.0216047,−1.92904276,−0.0216047,1 ] and denominator parameters are [ 1, −0.2288945,−1.29745904,0.06100624,0.57315888].
4 Databases and simulation setups
4.1 Databases
4.1.1 TIMIT acoustic-phonetic continuous speech corpus-1993
The TIMIT database is one of the most familiar and widespread speech corpuses used for speech recognition [6] and is available online at the Linguistic Data Consortium website [48]. This corpus has 630 speakers recorded in 8 main dialects of American English. In this work, 120 speakers were selected from dialect regions one and 4 to mirror the work in [5] and our previous study in [29]. Each speaker has 10 speech utterances; 6 were used for training and 4 for testing. A fixed-speech length of 129,250 samples (8 s) was adopted for all 1200 speech utterances of the 120 speakers, concatenation was used when necessary.
4.1.2 The Speakers in the Wild (SITW) speaker recognition challenge 2016
This challenging database was collected to encourage researchers to develop novel algorithms for benchmarking speaker recognition technology and is available at [49]. The SITW database was collected under different challenging conditions for open source media: clean interview, outdoor conditions, stadium conditions, and red carpet interviews for single and multi-speakers. In the current study, we selected 120 speakers; most were single speakers, but some were unbalanced multi-speakers. In this case, the target speaker was selected so as to obtain a single speaker, using Goldwave and Audacity software. In addition, we divided each speech file into 10 equal lengths, with a fixed length (129,250 samples), to mirror our previous work. However, speech files of less than 8 s were concatenated to achieve the same fixed length. Six files were used for training and four for testing.
4.1.3 2008 NIST speaker recognition evaluation training set part 2-2011
The database is available at [50], and its sources are multilingual telephone and microphone speech of native and bilingual English interview speakers. We converted the sampling frequency from the original 8 to 16 kHz, and 120 English only microphone channel speakers were selected for comparison with the TIMIT and the SITW databases. Again, we selected only single speakers by deleting the interviewers and created six training files and four testing utterances with a fixed length of 8 s.
4.2 Simulation setups
Six main simulations were performed utilizing the TIMIT, SITW, and NIST 2008 databases. Simulation one tested the system without additional noise and handset effects, while simulation two evaluated noisy speech with both AWGN and the G.712 type handset at 16 kHz. Simulations 3–5 employed street traffic, a bus interior, and crowd talk NSN, with handset at 16 kHz, respectively. In simulation 6, we created PRSIA to measure the reduction caused by noise and handset effects. Table 1 explains the parameters used in the simulations for the three databases, as well as system details, conditions, databases, and methods.
5 Simulation results and discussion
In this section, the simulations will be considered in two groups, A and B. Part A includes the five simulations using the three databases: original speech recordings, AWGN with handset, street NSN with handset, bus NSN with handset and crowd talking NSN with handset, respectively. Part B includes further examination of the effects of noise and handset on SIA based on features and fusion methods.
In part A, simulation 1 shows the effect of the number of Gaussian mixture components (GMCs), namely {8, 16, 32, 64, 128, 256, 512}, upon SIA for speech utterances from the three databases, without noise or a handset. All other simulations in part A were on noisy speech, with seven SNR levels between 0 and 30 dB for the same databases at mixture size 256. This noisy speech included the G.712 type handset at 16 kHz under AWGN and three NSN types: street traffic, bus interior, and crowd talking.
In part B, PRSIA is used to give further quantitative perspective on each feature type (without fusion) and each fusion technique. In general, all simulations for parts A and B present the SIA for the four feature combinations based on MFCC and PNCC, these are FWMFCC, CMVNMFCC, FWPNCC, and CMVNPNCC. The scores for the best SIA between the MFCC features (FWMFCC (f 1) and CMVNMFCC (f 2)) were fused to obtain the best SIA with the PNCC features (FWPNCC (g 1) and CMVNPNCC (g 2).
In Tables 2, 3, 4, 5, and 6, the row corresponding to fusion decision defines which f and g vectors yield the highest SIA and therefore only two score vectors were fused. For example, for fweight ij i is equal 1 or 2, that means include either f 1 or f 2, and j is equal 1 or 2 implying using either g 1 or g 2, respectively. For example, when the fusion decision is given as f 1−g 1 and ω β equals to 0.9, then f w e i g h t 11=0.9×f 1+0.1×g 1. Their selection is based upon achieving the highest SIA. Furthermore, in this work, mixture sizes of 1024 and 2048 are not considered, because in this work there are insufficient data size for training; utilizing these mixture sizes causes a decline in the SIA performance.
5.1 Simulations and experiments for part A
In all experiments of parts A and B, the training and the testing of the GMM-UBM are achieved in total by 120 speakers (1200 speech utterances are split into 720 for training and 480 for testing) from the TIMIT database in order to produce the SIA for TIMIT. Likewise, the same partitioning method of training and testing, and number of speakers, was applied to both additional databases SITW and NIST 2008.
5.1.1 Evaluation of speech data from TIMIT, SITW, and NIST 2008 without handset and noise (part A)
In this subsection, Table 2 shows the relationship between SIA and GMCs for the three databases according to feature combinations (without fusion), based on MFCC and PNCC features, and various fusion schemes are also considered. According to Table 2, we highlight that the best SIA values were achieved using the same fusion decision (f 1- g 2) for all three databases and they are at 95.83% for the mixture size 64, 95% for the mixture size 512, and 82.5% for the mixture size 512 for the NIST 2008, TIMIT, and SITW databases, respectively. These best SIAs for the TIMIT and NIST 2008 databases were obtained with weighted sum fusion and ω β equal 0.9, while for SITW database, the best SIA was also acquired with the weighted sum fusion but with ω β equal 0.7. Additionally, from the results of simulation 1 in Table 2, we formed the plots in Fig. 2 to give more analysis and discussion. In Fig. 2, we selected the highest SIA regardless of using any feature type (without fusion) or fusion method for each mixture size for TIMIT, SITW, and NIST 2008 databases. On this basis, we made the following observations.
Firstly, increasing the GMCs always increases the SIA for all databases as in the simulations (1 A, 1 B, 1 C), except in mixture size 64 for the NIST 2008 database which obtains better SIA than other mixtures. This is because the GMM-UBM system was trained on a large number of speakers through the UBM, and individual speaker models were adapted through the GMMs. This coupling increases the dimensionality of the GMC to cover all speakers. Hence, this generally improves the SIA.
Secondly, the NIST 2008 evaluation, which is represented by the violet curve in Fig. 2 attained the best SIA performance followed by the red curve for the TIMIT database. In contrast, the evaluation of the SITW database (blue curve) has the lowest SIA performance, as expected, most probably due to the wild and challenging environments compared to the semi-ideal TIMIT database and the less challenging conditions of NIST 2008.
Finally, in Fig. 2 the NIST 2008 database curve has the smallest variation between the highest SIA (at mixture size 512) and the lowest SIA achieved at mixture size 8. The second smallest variation is for the SITW database. However, the largest variation was attained with the TIMIT database. The main reason for this is that TIMIT is pure clean speech (ideal database as described by [6]), so the highest SIA was achieved with the highest mixture component size (512) which gives very accurate modeling, whereas modeling with the smallest mixture size (8) was not very accurate thereby giving the lowest SIA. On the other hand, for the other databases which do not contain pure speech, such accurate speech modeling is not possible and therefore less variation in SIA as a function of mixture size is generally observed.
5.1.2 Evaluation of noisy speech data from TIMIT, SITW, and NIST 2008 with handset and noise (part A)
This subsection is represented by Tables 3, 4, 5, and 6, which show the evaluation of TIMIT, SITW, and NIST 2008 for noisy speech with handset using different background noises: AWGN, street traffic NSN, bus interior NSN, and crowd talking NSN, respectively.
In addition, the handset used in all simulations was the G.712 type handset at 16 kHz. From using time-frequency analysis of the three types of NSN, we have observed the street traffic and crowd talking have broad spectra and therefore have similar effect as AWGN. On the other hand, the dominant energy of the bus-interior noise is low frequency and therefore has least effect on the speech when it is added. Therefore, for the AWGN, street and crowd talking, we only consider reduction in SIA performance between 30 and 10 dB; whereas, for bus-interior, we consider between 30 and 0 dB. According to the tables from Tables 3, 4, 5, and 6, the highest SIA results are selected regardless of feature type (without fusion) or fusion method for each SNR level. Then, these results are shown in Fig. 3.
Firstly, for AWGN and G.712 type handset, represented in Table 3, the bar charts in Fig. 3a can be used to analyze and discuss the results given in Table 3. The figure shows the reduction in SIA was 75.83% at 30 dB to 7.5% at 10 dB for the TIMIT database, while in SITW the reduction in the SIA was 78.33% at 30 dB to 25% at 10 dB. In contrast, the NIST 2008 had the lowest SIA among all other databases at 30 dB with 26.67% then this was reduced to the 3.33% at 10 dB, as such all databases were affected by stationary noise, with a constant spectrum profile. The particular sensitivity to such noise when applied to the NIST 2008 database may be due to the natural characteristics of the interview speech.
Secondly, for street traffic NSN with handset, seen in Table 4, Fig. 3b shows that the reduction in SIA was from 90% at 30 dB to 31.67% at 10 dB for the TIMIT database. similarly, the reduction in SIA obtained by the NIST 2008 database was from 80 to 15%. In contrast, the lowest reduction in the performance accuracy was attained using the SITW database with SIA 81.67% at 30 dB dropping down to 46.88% at 10 dB. As a consequence, the SITW database has the lowest reduction in SIA compared with the other three databases used for the evaluation.
Thirdly, for the bus interior NSN, seen in Table 5, Fig. 3c illustrates that the reduction in SIA was from 91.67% at 30 dB to 56.67% at 0 dB for the TIMIT database. Likewise, for the SITW database the SIA reduction was from 80.83 to 66.67% for 30and 0 dB, respectively. However, the highest reduction in SIA was for the NIST 2008 database with SIA 92.5% at 30 dB to 22.5% at 0 dB.
Finally, the results in Table 6, Fig. 3d show that the evaluation of the crowd talking NSN with the handset evaluation was similar to the street NSN. For the TIMIT database, the reduction in SIA was from 90% at 30 dB to 39.17% at 10 dB. Similar to this reduction, the figure for the NIST 2008 database were 84.17% at 30 dB to 30% at 10 dB. In contrast, for the SITW database, the reduction in SIA was from 82.5 to 53.33%. Considering the reduction in SIA for all simulations as a result of noise and handset effects, the most important issue is the relative sensitivities of the various methods to the environments. To address this point, we consider further comparative analysis.
5.2 Simulations and experiments for part B
In this study, based on the feature types (using four feature combinations without fusion) and fusion methods, the quantitative perspectives were measured by calculating the PRSIA.
5.2.1 Quantitative perspective for noise and handset effects in part B
The PRSIA was calculated for different conditions as in Eq. (9):
where cond ∈ {1, 2, 3, 4 }, 1 refers to the AWGN and handset, 2 to street traffic NSN and handset, 3 to the bus interior NSN and handset, and 4 to the crowded talking NSN and handset. The handset used was G.712 type at 16 kHz. This equation measured the SIAclean at mixture size 256 for the original recordings in TIMIT, SITW, and NIST 2008, without noise and handset conditions. Then, we measured the SIAcond under the four conditions in the testing phase. Table 7 presents the results of PRSIA for each condition, depending on the noise type with handset, each feature type, and each fusion method. The negative sign “-” refers to reduction, while “+” refers to increase. It is surprising to see a few positive sign values in Table 7, as we are considering different background noise with handset effects, and the system should generally be degraded; but at SNR 30 dB, the very small amount of noise may have a stabilization effect on the speaker identification system. Moreover, all positive sign values in Table 7 are for the challenging new database (SITW). Generally, however, we can notice from Table 7 that all the results for TIMIT and NIST 2008 at SNR 30 dB have negative sign values, meaning a reduction in the SIA as a result of the noise and handset effects. Secondly, most of the fusion methods reduced the PRSIA for all databases used.
Further, and most importantly, NIST 2008 is more sensitive to noise, especially AWGN, and has a higher reduction in PRSIA compared with TIMIT and SITW. In contrast, SITW seems relatively robust against noise. The fusion mean seems to have the lowest reduction in SIA compared with other fusion methods. However, MFCC features have less reduction in SIA for the TIMIT database, while this position is reversed for SITW and NIST 2008. For PNCC, the features have less reduction than MFCC in terms of the SIA. Finally, the highest reduction in all databases occurred under the AWGN with handset condition, which is due to the uniformity of the spectrum effect of the noise. The bus interior NSN and handset has the lowest reduction which as stated earlier is due to its low frequency nature. The results for other noise conditions (street and crowded talking) are between the AWGN and bus NSN effects.
6 Related works based on the proposed speaker identification system
Table 8 summarizes results mostly at SNR 30 dB, where cond.1 is speech files from TIMIT, SITW, and NIST 2008 without handset and noise, termed clean speech; cond.2 is noisy speech by AWGN and handset; cond.3 is street NSN and handset; cond.4 is bus NSN and handset; cond.5 is crowded talking NSN and handset. The handset used in all noise conditions is G.712 type at 16 kHz. Comparisons show improvement in SIA with the TIMIT database in cond.1 over the state of the art methods due to Kumar et al. [5] and Togneri and Pullella [6]. However, Ming et al. in their earlier work in [29] attain higher SIA in cond.1 with TIMIT but only with a GMM model and 630 speakers, but they do not consider a handset in cond. 3. New benchmark figures contributed from this study for a range of environmental noise conditions with the three databases are provided by cond.2 – cond.5.
7 Conclusions
In this study, we provided a comprehensive evaluation of text independent closed set speaker identification in the presence of AWGN and NSN types with a G.712 type handset at 16 kHz to provide benchmark evaluations of three different databases. We presented different feature combinations based on MFCC and PNCC, modeled by the GMM-UBM approach with and without fusion techniques (maximum, mean and weighted sum fusion). The evaluations were conducted under challenging environments including in the presence of the G.712 handset, AWGN, and various NSN types. Three databases (TIMIT, NIST 2008, and SITW) with a wide range of seven SNR levels (0–30) dB with step size 5 dB were employed. In addition, a wide range of Gaussian mixture components {8, 16, 32, 64, 128, 256, 512 } for clean speech was also considered. Thorough evaluation and results were provided by this research in order to give benchmark evaluations and results for the three databases for other researchers working in the speaker identification area. The major findings from this study are
-
On the basis of the evaluations of three databases without the noise and handset conditions, the best speaker identification method for all three databases used was weighted sum fusion.
-
Based on the three databases without the noise and handset conditions, the order for best SIA was NIST2008, TIMIT, SITW with 95.83, 95, and 82.5%, respectively, at mixture sizes 64, 512, and also 512, respectively. These SIAs were achieved by using weighted sum fusion with 90% from FWMFCC features and 10% from the corresponding CMVNPNCC features for both the TIMIT and NIST 2008 database. On the other hand, in the SITW database, 70% from FWMFCC features was fused with 30% from the corresponding CMVNPNCC features. The weighting should therefore be chosen as a function of the fidelity of the speech recordings.
-
On the basis of the results in this paper, the evaluations in noisy conditions suggest that mean fusion of four combinations of two types of features from (FWMFCC, CMVNMFCC, FWPNCC, and CMVNPNCC) is the most robust method for a practical speaker identification system, but there is not a consistent best pairing.
Future work will consider a similar extensive evaluation for a speaker identification system built from an I-vector approach [4].
References
E Gopi, Digital speech processing using Matlab (Springer, India, 2014).
T Herbig, F Gerl, W Minker, Self-learning speaker identification: a system for enhanced speech recognition (2011).
FEA El-Samie, Information security for automatic speaker identification (Springer-Verlag, New York, 2011).
P Verma, PK DasM, I-Vectors in speech processing applications: a survey. Intl. J. Speech Technol. 18(4), 529–546 (2015).
RSS Kumari, SS Nidhyananthan, et al. Fused MEL feature sets based text-independent speaker identification using Gaussian mixture model. Procedia Eng. 30:, 319–326 (2012).
R Togneri, D Pullella, An overview of speaker identification: Accuracy and robustness issues. Circ. Syst. Mag. IEEE. 11(2), 23–61 (2011).
S Nakagawa, L Wang, S Ohtsuka, Speaker identification and verification by combining MFCC and phase information. IEEE Trans. Audio Speech Lang. Process. 20(4), 1085–1095 (2012).
KSR Murty, B Yegnanarayana, Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Process. Lett. 13(1), 52–55 (2006).
L Wang, N Kitaoka, S Nakagawa, Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM. Speech Commun. 49(6), 501–513 (2007).
L Wang, K Minami, K Yamamoto, S Nakagawa, Speaker recognition by combining MFCC and phase information in noisy conditions. IEICE Trans. Inf. Syst. 93(9), 2397–2406 (2010).
Y Yujin, Z Peihua, Z Qun, in 2010 IEEE International Conference on Intelligent computing and intelligent systems (ICIS). Research of speaker recognition based on combination of LPCC and MFCC. vol 3 (IEEEXiamen, 2010), pp. 765–767.
S Bhardwaj, S Srivastava, M Hanmandlu, J Gupta, GFM-based methods for speaker identification. IEEE Trans. Cybernet. 43(3), 1047–1058 (2013).
L Schmidt, M Sharifi, I Lopez Moreno, in Acoustics, speech and signal processing (ICASSP), 2014 IEEE International Conference on. Large-scale speaker identification (IEEEFlorence, 2014), pp. 1650–1654.
VR Apsingekar, PL De Leon, Speaker model clustering for efficient speaker identification in large population applications. IEEE Trans. Audio Speech Lang. Process. 17(4), 848–853 (2009).
Y Hu, D Wu, A Nucci, Fuzzy-clustering-based decision tree approach for large population speaker identification. IEEE Trans. Audio Speech Lang. Process. 21(4), 762–774 (2013).
X Zhao, Y Wang, D Wang, Robust speaker identification in noisy and reverberant conditions. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 836–845 (2014).
Q Li, Y Huang, An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions. IEEE Trans. Audio Speech Lang. Process. 19(6), 1791–1801 (2011).
Z Zhang, L Wang, A Kai, T Yamada, W Li, M Iwahashi, Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP J. Audio Speech Music Process. 2015(1), 1–13 (2015).
P Matějka, O Glembek, O Novotnỳ, O Plchot, F Grézl, L Burget, JH Cernockỳ, in 2016 IEEE International Conference on Acoustics, speech and signal processing (ICASSP). Analysis of DNN approaches to speaker identification (IEEEShanghai, 2016), pp. 5100–5104.
S Singh, MH Assaf, SR Das, SN Biswas, EM Petriu, V Groza, in 2016 IEEE International Instrumentation and Measurement Technology Conference Proceedings. Short duration voice data speaker recognition system using novel fuzzy vector quantization algorithm (IEEETaipei, 2016), pp. 1–6.
N Almaadeed, A Aggoun, A Amira, Speaker identification using multimodal neural networks and wavelet analysis. IET Biometrics. 4(1), 18–28 (2015).
N Wang, P Ching, N Zheng, T Lee, Robust speaker recognition using denoised vocal source and vocal tract features. IEEE Trans. Audio Speech Lang. Process. 19(1), 196–205 (2011).
A Khanteymoori, M Homayounpour, M Menhaj, in Computer Conference, 2009. CSICC 2009. 14th International CSI. Speaker identification in noisy environments using dynamic Bayesian networks, (2009), pp. 601–606.
Y Suh, H Kim, Discriminative likelihood score weighting based on acoustic-phonetic classification for speaker identification. EURASIP J. Adv. Signal Process. 2014(1), 126 (2014).
MJ Alam, V Gupta, P Kenny, P Dumouchel, Speech recognition in reverberant and noisy environments employing multiple feature extractors and I-vector speaker adaptation. EURASIP J. Adv. Signal Process. 2015(1), 50 (2015).
DA Reynolds, in 1996 IEEE International Conference on Acoustics, speech, and signal processing (ICASSP). The effects of handset variability on speaker recognition performance: experiments on the switchboard corpus, vol 1 (IEEE, 1996), pp. 113–116.
DA Reynolds, MA Zissman, TF Quatieri, GC O’Leary, BA Carlson, in 1995 IEEE International Conference on Acoustics, speech, and signal processing (ICASSP). The effects of telephone transmission degradations on speaker recognition performance, vol 1 (IEEE, 1995), pp. 329–332.
MTS Al-Kaltakchi, WL Woo, SS Dlay, JA Chambers, in 2016 4th International Conference on Biometrics and Forensics (IWBF). Study of fusion strategies and exploiting the combination of MFCC and PNCC features for robust biometric speaker identification (Limassol, 2016), pp. 1–6.
MTS Al-Kaltakchi, WL Woo, SS Dlay, JA Chambers, in 2016 IEEE Statistical signal processing workshop (SSP). Study of statistical robust closed set speaker identification with feature and score-based fusion (IEEEPalma de Mallorca, 2016), pp. 1–5.
CS Kumar, PM Rao, Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm. Int. J. Comput. Sci. Eng. 3(8), 2942 (2011).
C Kim, RM Stern, in 2012 IEEE International Conference on Acoustics, speech and signal processing (ICASSP). Power-normalized cepstral coefficients (PNCC) for robust speech recognition (IEEEKyoto, 2012), pp. 4101–4104.
E Ambikairajah, JMK Kua, V Sethu, H Li, in Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific. PNCC-Ivector-SRC based speaker verification (IEEEHollywood, 2012), pp. 1–7.
G Nijhawan, M Soni, A new design approach for speaker recognition using MFCC and VAD. Int. J. Image Graphics Signal Process. (IJIGSP). 5(9), 43–49 (2013).
A Rashed, WM Bahgat, Modified technique for speaker recognition using ANN. Intl. J. Comput. Sci. Netw. Security (IJCSNS). 13(8), 8 (2013).
M Sumithra, A Devika, in 2012 International Conference on Computer communication and informatics (ICCCI). A study on feature extraction techniques for text independent speaker identification (IEEECoimbatore, 2012), pp. 1–5.
I Trabelsi, D Ben Ayed, in 2012 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT). On the use of different feature extraction methods for linear and non linear kernels (IEEESousse, 2012), pp. 797–802.
K Kumar, C Kim, RM Stern, in 2011 IEEE International Conference on Acoustics, speech and signal processing (ICASSP). Delta-spectral cepstral coefficients for robust speech recognition (IEEEPrague, 2011), pp. 4784–4787.
C Kim, R Stern, in 2010 IEEE International Conference on. Acoustics, speech and signal processing (ICASSP). Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring (IEEEDallas, 2010), pp. 4574–4577.
NV Prasad, S Umesh, in 2013 IEEE Workshop on Automatic speech recognition and understanding (ASRU). Improved cepstral mean and variance normalization using Bayesian framework (IEEEOlomouc, 2013), pp. 156–161.
H Beigi, Fundamentals of speaker recognition (Springer, USA, 2011).
DA Reynolds, TF Quatieri, RB Dunn, Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 10(1-3), 19–41 (2000).
VR Apsingekar, PL De Leon, in 2009 Conference Record of the Forty-Third Asilomar Conference on Signals, systems and computers. Support vector machine based speaker identification systems using GMM parameters (IEEEPacific Grove, 2009), pp. 1766–1769.
SS Nidhyananthan, R Kumari, G Jaffino, in 2012 International Conference on Devices, circuits and systems (ICDCS). Robust speaker identification using vocal source information (IEEECoimbatore, 2012), pp. 182–186.
AA Ross, K Nandakumar, A Jain, Handbook of multibiometrics, vol. 6. (Springer, USA, 2006).
A Ross, A Jain, Information fusion in biometrics. Pattern Recognit. Lett. 24(13), 2115–2125 (2003).
Findsounds. [Online]. Available http://www.findsounds.com/.
Freesfx. [Online]. Available http://www.freesfx.co.uk/.
J Garofolo, L Lamel, W Fisher, J Fiscus, D Pallett, N Dahlgren, V Zue, TIMIT Acoustic-phonetic continuous speech corpus. Linguistic Data Consortium (1993). [Online]. Available https://catalog.ldc.upenn.edu/ldc93s1/.
Sitw database. [Online]. Available http://www.speech.sri.com/projects/sitw/.
Nist 2008 database. [Online]. Available https://catalog.ldc.upenn.edu/LDC2011S07.
J Ming, TJ Hazen, JR Glass, D Reynolds, et al, Robust speaker recognition in noisy conditions. IEEE Trans. Audio Speech Lang. Process. 15(5), 1711–1723 (2007).
Acknowledgements
The first author Musab Tahseen Salahaldeen Al-Kaltakchi would like to thank the Ministry of Higher Education and Scientific Research (MoHESR) in Iraq for funding his PhD scholarship.
Author information
Authors and Affiliations
Contributions
The main author is MTSA-K and the other authors are his supervisors. JAC, WLW, SD are the first, second and third supervisors, respectively.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
S. Al-Kaltakchi, M., Woo, W., Dlay, S. et al. Evaluation of a speaker identification system with and without fusion using three databases in the presence of noise and handset effects. EURASIP J. Adv. Signal Process. 2017, 80 (2017). https://doi.org/10.1186/s13634-017-0515-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13634-017-0515-7