research-article

SER-Fuse: An Emotion Recognition Application Utilizing Multi-Modal, Multi-Lingual, and Multi-Feature Fusion

Authors:

Nhat Truong Pham,

Duc Ngoc Minh Dang,

Balachandran ManavalanAuthors Info & Claims

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

Pages 870 - 877

https://doi.org/10.1145/3628797.3628887

Published: 07 December 2023 Publication History

Abstract

Speech emotion recognition (SER) is a crucial aspect of affective computing and human-computer interaction, yet effectively identifying emotions in different speakers and languages remains challenging. This paper introduces SER-Fuse, a multi-modal SER application that is designed to address the complexities of multiple speakers and languages. Our approach leverages diverse audio/speech embeddings and text embeddings to extract optimal features for multi-modal SER. We subsequently employ multi-feature fusion to integrate embedding features across modalities and languages. Experimental results archived on the English-Chinese emotional speech (ECES) dataset reveal that SER-Fuse attains competitive performance in the multi-lingual approach compared to the single-lingual approaches. Furthermore, we provide the implementation of SER-Fuse for download at https://github.com/nhattruongpham/SER-Fuse to support reproducibility and local deployment.

References

[1]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42 (2008), 335–359.

[2]

Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, Helen Meng, Bo Xu, and Thomas Fang Zheng (Eds.). ISCA, 3830–3834.

[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186.

[4]

Ievgen Iosifov, Olena Iosifova, Oleh Romanovskyi, Volodymyr Sokolov, and Ihor Sukailo. 2022. Transferability evaluation of speech emotion recognition between different languages. In International Conference on Computer Science, Engineering and Education Applications. Springer, 413–426.

[5]

Bubai Maji and Monorama Swain. 2022. Advanced fusion-based speech emotion recognition system using a dual-attention mechanism with conv-caps and bi-gru features. Electronics 11, 9 (2022), 1328.

[6]

Bubai Maji and Monorama Swain. 2022. SITB-OSED: An Odia Speech Emotion Dataset. In 2022 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE). IEEE, 1–5.

[7]

Sy Dzung Nguyen, Vu Song Thuy Nguyen, and Nhat Truong Pham. 2022. Determination of the optimal number of clusters: a fuzzy-set based method. IEEE Transactions on Fuzzy Systems 30, 9 (2022), 3514–3526.

[8]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).

[9]

Nhat Truong Pham, Duc Ngoc Minh Dang, Ngoc Duy Nguyen, Thanh Thi Nguyen, Hai Nguyen, Balachandran Manavalan, Chee Peng Lim, and Sy Dzung Nguyen. 2023. Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition. Expert Systems with Applications (2023), 120608.

[10]

Nhat Truong Pham, Duc Ngoc Minh Dang, Bich Ngoc Hong Pham, and Sy Dzung Nguyen. 2023. SERVER: Multi-modal Speech Emotion Recognition using Transformer-based and Vision-based Embeddings. In Proceedings of the 2023 8th International Conference on Intelligent Information Technology. 234–238.

Digital Library

[11]

Nhat Truong Pham, Anh-Tuan Tran, Bich Ngoc Hong Pham, Hanh Dang-Ngoc, Sy Dzung Nguyen, and Duc Ngoc Minh Dang. 2022. Speech Emotion Recognition: A Brief Review of Multi-modal Multi-task Learning Approaches. In AETA 2022-Recent Advances in Electrical Engineering and Related Sciences: Theory and Application. Springer.

[12]

Orchid Chetia Phukan, Arun Balaji Buduru, and Rajesh Sharma. 2023. A Comparative Study of Pre-trained Speech and Audio Embeddings for Speech Emotion Recognition. arXiv preprint arXiv:2304.11472 (2023).

[13]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022).

[14]

Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, 2021. SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624 (2021).

[15]

Muhammad Sajjad, Soonil Kwon, 2020. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE access 8 (2020), 79861–79875.

[16]

David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5329–5333.

Digital Library

[17]

Monorama Swain, Bubai Maji, Mustaqeem Khan, Abdulmotaleb El Saddik, and Wail Gueaieb. 2023. Multilevel Feature Representation for Hybrid Transformers-based Emotion Recognition. In 2023 5th International Conference on Bio-engineering for Smart Technologies (BioSMART). IEEE, 1–5.

[18]

Phuong-Nam Tran, Thuy-Duong Thi Vu, Duc Ngoc Minh Dang, Nhat Truong Pham, and Anh-Khoa Tran. 2023. Multi-modal Speech Emotion Recognition: Improving Accuracy through Fusion of VGGish and BERT Features with Multi-head Attention. In International Conference on Industrial Networks and Intelligent Systems. Springer.

[19]

Phuong-Nam Tran, Thuy-Duong Thi Vu, Nhat Truong Pham, Hanh Dang-Ngoc, and Duc Ngoc Minh Dang. 2023. Comparative analysis of multi-loss functions for enhanced multi-modal speech emotion recognition. In 2023 14th International Conference on Information and Communication Technology Convergence (ICTC). IEEE.

[20]

Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4052–4056.

[21]

Chunyi Wang, Ying Ren, Na Zhang, Fuwei Cui, and Shiying Luo. 2022. Speech emotion recognition based on multi-feature and multi-lingual fusion. Multimedia Tools and Applications 81, 4 (2022), 4897–4907.

Digital Library

[22]

Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. 2021. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 920–924.

Cited By

Luo NWang Z(2024)The Use of Multi-Feature Fusion in the Evaluation of Emotional Expressions in Spoken EnglishApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-23429:1Online publication date: 3-Sep-2024
https://doi.org/10.2478/amns-2024-2342
Li XWang QWang ZJin ZJia J(2024)SoulSkipper: A Voice-Controlled Emotional Adaptive Game to Complement Therapy for Social Anxiety DisorderExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650822(1-7)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3650822

Index Terms

SER-Fuse: An Emotion Recognition Application Utilizing Multi-Modal, Multi-Lingual, and Multi-Feature Fusion
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Neural networks

Recommendations

Speech emotion recognition based on multi‐feature and multi‐lingual fusion
Abstract
A speech emotion recognition algorithm based on multi-feature and Multi-lingual fusion is proposed in order to resolve low recognition accuracy caused bylack of large speech dataset and low robustness of acoustic features in the recognition of ...
Speech Emotion Recognition Based on Multi-feature Fusion and DCNN
EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering

In order to get more effective information and improve the recognition accuracy, this paper proposes a speech emotion recognition model based on multi-feature fusion and deep convolutional neural network. First, the speech emotion data is preprocessed ...
Multi-head attention fusion networks for multi-modal speech emotion recognition
Highlights
- Multimodal categories enriched by the inclusion of action data.
- New feature extraction model improves the performance of feature extraction.
- Transformer encoder was changed to learn the inter-modal similarity.
- Deep Residual ...
Abstract
Multi-modal speech emotion recognition is a study to predict emotion categories by combining speech data with other types of data, such as video, speech text transcription, body action, or facial expression when speaking, which will involve the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

December 2023

1058 pages

ISBN:9798400708916

DOI:10.1145/3628797

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SOICT 2023

SOICT 2023: The 12th International Symposium on Information and Communication Technology

December 7 - 8, 2023

Ho Chi Minh, Vietnam

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
63
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)3

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Luo NWang Z(2024)The Use of Multi-Feature Fusion in the Evaluation of Emotional Expressions in Spoken EnglishApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-23429:1Online publication date: 3-Sep-2024
https://doi.org/10.2478/amns-2024-2342
Li XWang QWang ZJin ZJia J(2024)SoulSkipper: A Voice-Controlled Emotional Adaptive Game to Complement Therapy for Social Anxiety DisorderExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650822(1-7)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3650822

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents