skip to main content
10.1145/3628797.3628887acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

SER-Fuse: An Emotion Recognition Application Utilizing Multi-Modal, Multi-Lingual, and Multi-Feature Fusion

Published: 07 December 2023 Publication History

Abstract

Speech emotion recognition (SER) is a crucial aspect of affective computing and human-computer interaction, yet effectively identifying emotions in different speakers and languages remains challenging. This paper introduces SER-Fuse, a multi-modal SER application that is designed to address the complexities of multiple speakers and languages. Our approach leverages diverse audio/speech embeddings and text embeddings to extract optimal features for multi-modal SER. We subsequently employ multi-feature fusion to integrate embedding features across modalities and languages. Experimental results archived on the English-Chinese emotional speech (ECES) dataset reveal that SER-Fuse attains competitive performance in the multi-lingual approach compared to the single-lingual approaches. Furthermore, we provide the implementation of SER-Fuse for download at https://github.com/nhattruongpham/SER-Fuse to support reproducibility and local deployment.

References

[1]
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42 (2008), 335–359.
[2]
Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, Helen Meng, Bo Xu, and Thomas Fang Zheng (Eds.). ISCA, 3830–3834.
[3]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186.
[4]
Ievgen Iosifov, Olena Iosifova, Oleh Romanovskyi, Volodymyr Sokolov, and Ihor Sukailo. 2022. Transferability evaluation of speech emotion recognition between different languages. In International Conference on Computer Science, Engineering and Education Applications. Springer, 413–426.
[5]
Bubai Maji and Monorama Swain. 2022. Advanced fusion-based speech emotion recognition system using a dual-attention mechanism with conv-caps and bi-gru features. Electronics 11, 9 (2022), 1328.
[6]
Bubai Maji and Monorama Swain. 2022. SITB-OSED: An Odia Speech Emotion Dataset. In 2022 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE). IEEE, 1–5.
[7]
Sy Dzung Nguyen, Vu Song Thuy Nguyen, and Nhat Truong Pham. 2022. Determination of the optimal number of clusters: a fuzzy-set based method. IEEE Transactions on Fuzzy Systems 30, 9 (2022), 3514–3526.
[8]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
[9]
Nhat Truong Pham, Duc Ngoc Minh Dang, Ngoc Duy Nguyen, Thanh Thi Nguyen, Hai Nguyen, Balachandran Manavalan, Chee Peng Lim, and Sy Dzung Nguyen. 2023. Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition. Expert Systems with Applications (2023), 120608.
[10]
Nhat Truong Pham, Duc Ngoc Minh Dang, Bich Ngoc Hong Pham, and Sy Dzung Nguyen. 2023. SERVER: Multi-modal Speech Emotion Recognition using Transformer-based and Vision-based Embeddings. In Proceedings of the 2023 8th International Conference on Intelligent Information Technology. 234–238.
[11]
Nhat Truong Pham, Anh-Tuan Tran, Bich Ngoc Hong Pham, Hanh Dang-Ngoc, Sy Dzung Nguyen, and Duc Ngoc Minh Dang. 2022. Speech Emotion Recognition: A Brief Review of Multi-modal Multi-task Learning Approaches. In AETA 2022-Recent Advances in Electrical Engineering and Related Sciences: Theory and Application. Springer.
[12]
Orchid Chetia Phukan, Arun Balaji Buduru, and Rajesh Sharma. 2023. A Comparative Study of Pre-trained Speech and Audio Embeddings for Speech Emotion Recognition. arXiv preprint arXiv:2304.11472 (2023).
[13]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022).
[14]
Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, 2021. SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624 (2021).
[15]
Muhammad Sajjad, Soonil Kwon, 2020. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE access 8 (2020), 79861–79875.
[16]
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5329–5333.
[17]
Monorama Swain, Bubai Maji, Mustaqeem Khan, Abdulmotaleb El Saddik, and Wail Gueaieb. 2023. Multilevel Feature Representation for Hybrid Transformers-based Emotion Recognition. In 2023 5th International Conference on Bio-engineering for Smart Technologies (BioSMART). IEEE, 1–5.
[18]
Phuong-Nam Tran, Thuy-Duong Thi Vu, Duc Ngoc Minh Dang, Nhat Truong Pham, and Anh-Khoa Tran. 2023. Multi-modal Speech Emotion Recognition: Improving Accuracy through Fusion of VGGish and BERT Features with Multi-head Attention. In International Conference on Industrial Networks and Intelligent Systems. Springer.
[19]
Phuong-Nam Tran, Thuy-Duong Thi Vu, Nhat Truong Pham, Hanh Dang-Ngoc, and Duc Ngoc Minh Dang. 2023. Comparative analysis of multi-loss functions for enhanced multi-modal speech emotion recognition. In 2023 14th International Conference on Information and Communication Technology Convergence (ICTC). IEEE.
[20]
Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4052–4056.
[21]
Chunyi Wang, Ying Ren, Na Zhang, Fuwei Cui, and Shiying Luo. 2022. Speech emotion recognition based on multi-feature and multi-lingual fusion. Multimedia Tools and Applications 81, 4 (2022), 4897–4907.
[22]
Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. 2021. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 920–924.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology
December 2023
1058 pages
ISBN:9798400708916
DOI:10.1145/3628797
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Affective computing
  2. human-computer interaction
  3. multi-feature fusion
  4. multi-lingual analysis
  5. multi-modal analysis
  6. speech emotion recognition

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SOICT 2023

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)3
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media