skip to main content
research-article

Assessing the effectiveness of ensembles in Speech Emotion Recognition: : Performance analysis under challenging scenarios▪

Published: 01 June 2024 Publication History

Abstract

Speech Emotion Recognition (SER) is an important application in areas such as online gaming, e-learning, and medical care. However, recognizing emotion in speech is computationally difficult since it necessitates a thorough search for feature selection, algorithm hyperparameter tuning, or algorithm combinations, making ensemble use interesting. Although ensembles are frequently employed in SER, their application has not been greatly explored, and their potential benefits for enhancing recognition accuracy and robustness to variability in speech signals have not been fully realized. The purpose of this article is to assess the effectiveness of ensembles in SER by analyzing their performance under challenging scenarios. The experiment made in this study involved evaluating speech samples from various languages, using an out-of-date set of features, and using simple algorithms with default hyperparameters. For classifier set selection, a basic ensemble technique with decision-level voting and a rudimentary heuristic were applied.
The results indicated that basic classifiers significantly improved the SER rate, with an absolute improvement ranging from 0.57% to 9.89%. The suggested ensemble approach outperformed state of the art SER methods, including deep learning-based ones, in terms of recognition rates. The findings justify the use of ensembles in SER applications, particularly in circumstances with insufficient data or out-of-date features and algorithms. The work recommends further investigation of ensembles to enhance recognition accuracy and improve robustness in the face of voice signal variability. Finally, the results of the experiment show that ensembles have the potential to increase SER accuracy, and future research in this field can benefit from the study’s conclusions.

Highlights

Assessed the effectiveness of ensembles in SER under challenging scenarios.
Evaluated speech samples from various languages.
The suggested ensemble approach outperformed state-of-the-art SER methods.
The findings justify the use of ensembles in SER applications.
The work advises that ensembles be investigated further.

References

[1]
Abbaschian B.J., Sierra-Sosa D., Elmaghraby A., Deep learning techniques for speech emotion recognition, from databases to models, Sensors 21 (4) (2021) 1249.
[2]
Abdulmohsin H.A., Abdul wahab H.B., Abdul hossen A.M.J., A new proposed statistical feature extraction method in speech emotion recognition, Computers & Electrical Engineering 93 (2021).
[3]
Aha D.W., Kibler D., Albert M.K., Instance-based learning algorithms, Machine Learning 6 (1) (1991) 37–66.
[4]
Akçay M.B., Oğuz K., Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers, Speech Communication 116 (2020) 56–76.
[5]
Alghifari M.F., Gunawan T.S., Hashim N.N.W.N., Nordin M.A.b.W., Kartiwi M., On the effect of feature compression on speech emotion recognition across multiple languages, in: Ab. Nasir A.F., Ibrahim A.N., Ishak I., Mat Yahya N., Zakaria M.A., P. P. Abdul Majeed A. (Eds.), Recent trends in mechatronics towards industry 4.0, in: Lecture notes in electrical engineering, Springer, Singapore, 2022, pp. 703–713.
[6]
Alhamdoosh M., Wang D., Fast decorrelated neural network ensembles with random weights, Information Sciences 264 (2014) 104–117. Serious Games.
[7]
Álvarez A., Sierra B., Arruti A., López-Gil J.-M., Garay-Vitoria N., Classifier subset selection for the stacked generalization method applied to emotion recognition in speech, Sensors 16 (1) (2016) 21.
[8]
Anagnostopoulos C.-N., Iliou T., Giannoukos I., Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011, Artificial Intelligence Review 43 (2) (2015) 155–177.
[9]
Andonie R., Hyperparameter optimization in learning systems, Journal of Membrane Computing 1 (4) (2019) 279–291.
[10]
Arimoto Y., Challenges of building an authentic emotional speech corpus of spontaneous Japanese dialog, in: Proceedings of the LREC 2018 special speech sessions, Center for Corpus Development, National Institute for Japanese Language and …, 2018, pp. 6–13.
[11]
Arimoto Y., Kawatsu H., Ohno S., Iida H., Naturalistic emotional speech collection paradigm with online game and its psychological and acoustical assessment, Acoustical Science and Technology 33 (6) (2012) 359–369.
[12]
Atmaja B.T., Sasou A., Ensembling multilingual pre-trained models for predicting multi-label regression emotion share from speech, 2023.
[13]
Basu S., Chakraborty J., Aftabuddin M., Affect detection from speech using deep convolutional neural network architecture, 2017 14th IEEE India Council international conference, IEEE, 2017, pp. 1–5.
[14]
Breiman L., Random forests, Machine Learning 45 (1) (2001) 5–32.
[15]
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Ninth European conference on speech communication and technology (pp. 1517–1520).
[16]
Busso C., Bulut M., Lee C.-C., Kazemzadeh A., Mower E., Kim S., et al., IEMOCAP: Interactive emotional dyadic motion capture database, Language Resources and Evaluation 42 (4) (2008) 335–359.
[17]
Cestnik, B. (1990). Estimating probabilities: a crucial task in machine learning. In ECAI, vol. 90 (pp. 147–149).
[18]
Chai T., Draxler R.R., Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature, Geoscientific Model Development 7 (3) (2014) 1247–1250. Publisher: Copernicus GmbH.
[19]
Chandrashekar G., Sahin F., A survey on feature selection methods, Computers & Electrical Engineering 40 (1) (2014) 16–28.
[20]
Chen L., Mao X., Xue Y., Cheng L.L., Speech emotion recognition: Features and classification models, Digital Signal Processing 22 (6) (2012) 1154–1160.
[21]
Cleary J.G., Trigg L.E., K*: An instance-based learner using an entropic distance measure, in: Machine learning proceedings 1995, Elsevier, 1995, pp. 108–114.
[22]
Cohen, W. (1995). Fast effective rule induction. In Machine learning proceedings (pp. 115–123).
[23]
Costantini G., Iaderola I., Paoloni A., Todisco M., EMOVO corpus: an Italian emotional speech database, in: International conference on language resources and evaluation, European Language Resources Association (ELRA), 2014, pp. 3501–3504.
[24]
Cowie R., Douglas-Cowie E., Tsapatsoulis N., Votsis G., Kollias S., Fellenz W., et al., Emotion recognition in human-computer interaction, IEEE Signal Processing Magazine 18 (1) (2001) 32–80.
[25]
Dietterich T.G., Ensemble methods in machine learning, in: Multiple classifier systems, in: Lecture notes in computer science, Springer, Berlin, Heidelberg, 2000, pp. 1–15.
[26]
Doğdu C., Kessler T., Schneider D., Shadaydeh M., Schweinberger S.R., A comparison of machine learning algorithms and feature sets for automatic vocal emotion recognition in speech, Sensors 22 (19) (2022) 7561. Number: 19 Publisher: Multidisciplinary Digital Publishing Institute.
[27]
Dollmat K.S., Abdullah N.A., Machine learning in emotional intelligence studies: a survey, Behaviour & Information Technology (2021) 1–18.
[28]
Douglas-Cowie E., Campbell N., Cowie R., Roach P., Emotional speech: Towards a new generation of databases, Speech Communication 40 (1–2) (2003) 33–60.
[29]
El Ayadi M., Kamel M.S., Karray F., Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern Recognition 44 (3) (2011) 572–587.
[30]
Esparza J., Scherer S., Brechmann A., Schwenker F., Automatic emotion classification vs. human perception: Comparing machine performance to the human benchmark, in: 2012 11th international conference on information science, signal processing and their applications, IEEE, 2012, pp. 1253–1258.
[31]
Eyben F., Scherer K.R., Schuller B.W., Sundberg J., André E., Busso C., et al., The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing, IEEE Transactions on Affective Computing 7 (2) (2015) 190–202.
[32]
Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on multimedia (pp. 1459–1462).
[33]
Frank E., Hall M.A., Data mining: Practical machine learning tools and techniques, Morgan Kaufmann, 2011.
[34]
Frank, E., & Witten, I. H. (1998). Generating accurate rule sets without global optimization. In Int. conf. on machine learning (pp. 144–151).
[35]
Friedman N., Geiger D., Goldszmidt M., Bayesian network classifiers, Machine Learning 29 (2) (1997) 131–163.
[36]
Gournay, P., Lahaie, O., & Lefebvre, R. (2018). A canadian french emotional speech dataset. In Proceedings of the 9th ACM multimedia systems conference (pp. 399–402).
[37]
Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I.H., The WEKA data mining software: an update, ACM SIGKDD Explorations Newsletter 11 (1) (2009) 10–18.
[38]
Haq, S., Jackson, P. J., & Edge, J. (2008). Audio-visual feature selection and reduction for emotion classification. In Proc. int. conf. on auditory-visual speech processing (pp. 185–190).
[39]
Holte R.C., Very simple classification rules perform well on most commonly used datasets, Machine Learning 11 (1) (1993) 63–90.
[40]
Höök K., Affective computing, The Interaction Design Foundation, 2012.
[41]
Huang Y., Zhang G., Xu X., Speech emotion recognition research based on the stacked generalization ensemble neural network for robot pet, in: 2009 Chinese conference on pattern recognition, IEEE, 2009, pp. 1–5.
[42]
Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing data streams. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 97–106).
[43]
Issa D., Demirci M.F., Yazici A., Speech emotion recognition with deep convolutional neural networks, Biomedical Signal Processing and Control 59 (2020).
[44]
James G., Witten D., Hastie T., Tibshirani R., An introduction to statistical learning, vol. 112, Springer, 2013.
[45]
John, G. H., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In Eleventh conference on uncertainty in artificial intelligence (pp. 338–345).
[46]
Kanwal S., Asghar S., Hussain A., Rafique A., Identifying the evidence of speech emotional dialects using artificial intelligence: A cross-cultural study, PLoS One 17 (3) (2022) Publisher: Public Library of Science.
[47]
Khammassi C., Krichen S., A GA-LR wrapper approach for feature selection in network intrusion detection, Computers & Security 70 (2017) 255–277.
[48]
Kittler J., Hatef M., Duin R.P., Matas J., On combining classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (3) (1998) 226–239.
[49]
Kohavi R., The power of decision tables, in: European conference on machine learning, Springer, 1995, pp. 174–189.
[50]
Landwehr N., Hall M., Frank E., Logistic model trees, Machine Learning 59 (1–2) (2005) 161–205.
[51]
Le Cessie S., Van Houwelingen J.C., Ridge estimators in logistic regression, Journal of the Royal Statistical Society. Series C. Applied Statistics 41 (1) (1992) 191–201.
[52]
Li D., Liu J., Yang Z., Sun L., Wang Z., Speech emotion recognition using recurrent neural networks with directional self-attention, Expert Systems with Applications 173 (2021).
[53]
Li H.-C., Pan T., Lee M.-H., Chiu H.-W., Make patient consultation warmer: A clinical application for speech emotion recognition, Applied Sciences 11 (11) (2021).
[54]
Liu Z.-T., Rehman A., Wu M., Cao W.-H., Hao M., Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence, Information Sciences 563 (2021) 309–325.
[55]
Livingstone S.R., Russo F.A., The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, PLoS One 13 (5) (2018).
[56]
Lu C., Zong Y., Zheng W., Li Y., Tang C., Schuller B.W., Domain invariant feature learning for speaker-independent speech emotion recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022) 2217–2230. Conference Name: IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[57]
Madanian S., Chen T., Adeleye O., Templeton J.M., Poellabauer C., Parry D., et al., Speech emotion recognition using machine learning — A systematic review, Intelligent Systems with Applications 20 (2023) 200266.
[58]
McHugh M.L., Interrater reliability: the kappa statistic, Biochemia Medica 22 (3) (2012) 276–282.
[59]
Meyer D., Leisch F., Hornik K., The support vector machine under test, Neurocomputing 55 (1–2) (2003) 169–186.
[60]
Morrison D., Wang R., De Silva L.C., Ensemble methods for spoken emotion recognition in call-centres, Speech Communication 49 (2) (2007) 98–112.
[61]
Mustaqeem D., Kwon S., CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network, Mathematics 8 (12) (2020) 2133.
[62]
Nassif A.B., Shahin I., Elnagar A., Velayudhan D., Alhudhaif A., Polat K., Emotional speaker identification using a novel capsule nets model, Expert Systems with Applications 193 (2022).
[63]
Nezami O.M., Lou P.J., Karami M., ShEMO: a large-scale validated database for Persian speech emotion detection, Language Resources and Evaluation 53 (1) (2019) 1–16.
[64]
Ng A.J.B., Liu K.-H., The investigation of different loss functions with capsule networks for speech emotion recognition, Scientific Programming 2021 (2021).
[65]
Opitz D., Maclin R., Popular ensemble methods: An empirical study, Journal of Artificial Intelligence Research 11 (1999) 169–198.
[66]
Partridge D., Yates W.B., Engineering multiversion neural-net systems, Neural Computation 8 (4) (1996) 869–893.
[67]
Patnaik S., Speech emotion recognition by using complex MFCC and deep sequential model, Multimedia Tools and Applications (2022).
[68]
Pérez-Espinosa H., Gutiérrez-Serafín B., Martínez-Miranda J., Espinosa-Curiel I.E., Automatic children’s personality assessment from emotional speech, Expert Systems with Applications 187 (2022).
[69]
Pfister T., Robinson P., Real-time recognition of affective states from nonverbal features of speech and its application for public speaking skill analysis, IEEE Transactions on Affective Computing 2 (2) (2011) 66–78.
[70]
Quinlan J.R., C4. 5: Programs for machine learning, Elsevier, 2014.
[71]
Reyzin L., Schapire R.E., How boosting the margin can also boost classifier complexity, in: Proceedings of the 23rd international conference on machine learning, Association for Computing Machinery, New York, NY, USA, 2006, pp. 753–760.
[72]
Rokach L., Ensemble-based classifiers, Artificial Intelligence Review 33 (1) (2010) 1–39.
[73]
Rosenblatt F., Principles of neurodynamics. perceptrons and the theory of brain mechanisms, Cornell Aeronautical Lab Inc Buffalo NY, 1961.
[74]
Ruta D., Gabrys B., Classifier selection for majority voting, Information Fusion 6 (1) (2005) 63–81.
[75]
Schapire R.E., Explaining AdaBoost, in: Schölkopf B., Luo Z., Vovk V. (Eds.), Empirical inference: Festschrift in honor of Vladimir N. Vapnik, Springer, Berlin, Heidelberg, 2013, pp. 37–52.
[76]
Schapire R.E., Singer Y., Improved boosting algorithms using confidence-rated predictions, in: Proceedings of the eleventh annual conference on computational learning theory, ACM, Madison Wisconsin USA, 1998, pp. 80–91.
[77]
Scherer K.R., Vocal communication of emotion: A review of research paradigms, Speech Communication 40 (1–2) (2003) 227–256.
[78]
Schuller B.W., Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends, Communications of the ACM 61 (5) (2018) 90–99.
[79]
Schuller B., Batliner A., Steidl S., Seppi D., Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge, Speech Communication 53 (9–10) (2011) 1062–1087.
[80]
Schuller B., Reiter S., Muller R., Al-Hames M., Lang M., Rigoll G., Speaker independent speech emotion recognition by ensemble classification, in: 2005 IEEE international conference on multimedia and expo, IEEE, 2005, pp. 864–867.
[81]
Schuller B., Steidl S., Batliner A., Burkhardt F., Devillers L., MüLler C., et al., Paralinguistics in speech and language—state-of-the-art and the challenge, Computer Speech and Language 27 (1) (2013) 4–39.
[82]
Schuller, B., Zhang, Z., Weninger, F., & Rigoll, G. (2011). Using multiple databases for training in emotion recognition: To unite or to vote?. In Twelfth annual conference of the international speech communication association (pp. 1553–1556).
[83]
Seknedy, M. E., & Fawzi, S. (2021). Speech Emotion Recognition System for Human Interaction Applications. In 2021 tenth international conference on intelligent computing and information systems (pp. 361–368).
[84]
Shah Fahad M., Ranjan A., Yadav J., Deepak A., A survey of speech emotion recognition in natural environment, Digital Signal Processing 110 (2021).
[85]
Siadat S.R., Voronkov I.M., Kharlamov A.A., Emotion recognition from Persian speech with 1D Convolution neural network, in: 2022 fourth international conference neurotechnologies and neurointerfaces, 2022, pp. 152–157.
[86]
Sönmez Y.Ü., Varol A., A speech emotion recognition model based on multi-level local binary and local ternary patterns, IEEE Access 8 (2020) 190784–190796.
[87]
Stone M., Cross-validatory choice and assessment of statistical predictions, Journal of the Royal Statistical Society. Series B. Statistical Methodology 36 (2) (1974) 111–133.
[88]
Stuhlsatz A., Meyer C., Eyben F., Zielke T., Meier G., Schuller B., Deep neural networks for acoustic emotion recognition: Raising the benchmarks, in: 2011 IEEE international conference on acoustics, speech and signal processing, IEEE, 2011, pp. 5688–5691.
[89]
Sultana S., Rahman M.S., Selim M.R., Iqbal M.Z., SUST Bangla Emotional Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla, PLoS One 16 (4) (2021).
[90]
Suzuki M., Qi J., Improvement of multilingual emotion recognition method based on normalized acoustic features using CRNN, Procedia Computer Science 207 (2022) 684–691.
[91]
Swain M., Routray A., Kabisatpathy P., Databases, features and classifiers for speech emotion recognition: a review, International Journal of Speech Technology 21 (1) (2018) 93–120.
[92]
Verbitskiy S., Berikov V., Vyshegorodtsev V., ERANNs: Efficient residual audio neural networks for audio pattern recognition, Pattern Recognition Letters 161 (2022) 38–44.
[93]
Ververidis D., Kotropoulos C., Emotional speech recognition: Resources, features, and methods, Speech Communication 48 (9) (2006) 1162–1181.
[94]
Wang X., Kondratyuk D., Christiansen E., Kitani K.M., Alon Y., Eban E., Wisdom of committees: an overlooked approach to faster and more accurate models, 2020, arXiv preprint arXiv:2012.01988.
[95]
Wolpert D.H., Stacked generalization, Neural Networks 5 (2) (1992) 241–259.
[96]
Wu C.-H., Liang W.-B., Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels, IEEE Transactions on Affective Computing 2 (1) (2010) 10–21.
[97]
Xu X., Li D., Zhou Y., Wang Z., Multi-type features separating fusion learning for Speech Emotion Recognition, Applied Soft Computing 130 (2022).
[98]
Xu M., Zhang F., Zhang W., Head fusion: Improving the accuracy and robustness of speech emotion recognition on the IEMOCAP and RAVDESS dataset, IEEE Access 9 (2021) 74539–74549.
[99]
Yang P., Hwa Yang Y., B. Zhou B., Y. Zomaya A., A review of ensemble methods in bioinformatics, Current Bioinformatics 5 (4) (2010) 296–308.
[100]
Yildirim S., Kaya Y., Kılıç F., A modified feature selection method based on metaheuristic algorithms for speech emotion recognition, Applied Acoustics 173 (2021).
[101]
Zvarevashe K., Olugbara O.O., Recognition of cross-language acoustic emotional valence using stacked ensemble learning, Algorithms 13 (10) (2020) 246.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal
Expert Systems with Applications: An International Journal  Volume 243, Issue C
Jun 2024
1588 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 June 2024

Author Tags

  1. Speech
  2. Emotion recognition
  3. Classifier set selection

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media