skip to main content
research-article

Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition?

Published: 01 January 2023 Publication History

Abstract

Self-supervised learning has attracted plenty of recent research interest. However, most works for self-supervision in speech are typically unimodal and there has been limited work that studies the interaction between audio and visual modalities for cross-modal self-supervision. This article (1) investigates visual self-supervision via face reconstruction to guide the learning of audio representations; (2) proposes an audio-only self-supervision approach for speech representation learning; (3) shows that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features that are more robust in noisy conditions; (4) shows that self-supervised pretraining can outperform fully supervised training and is especially useful to prevent overfitting on smaller sized datasets. We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition. We outperform existing self-supervised methods for all tested downstream tasks. Our results demonstrate the potential of visual self-supervision for audio feature learning and suggest that joint visual and audio self-supervision leads to more informative audio representations for speech and emotion recognition.

References

[1]
H. Alwassel, D. Mahajan, L. Torresani, B. Ghanem, and D. Tran, “Self-supervised learning by cross-modal audio-video clustering,” 2019,.
[2]
M. Peterset al., “Deep contextualized word representations,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2018, pp. 2227–2237.
[3]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2019, pp. 4171–4186.
[4]
S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” in Proc. Int. Conf. Learn. Representations, 2018.
[5]
C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1422–1430.
[6]
B. Fernando, H. Bilen, E. Gavves, and S. Gould, “Self-supervised video representation learning with odd-one-out networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5729–5738.
[7]
M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” 2020,.
[8]
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9726–9735.
[9]
I. Misra and L. van der Maaten, “Self-supervised learning of pretext-invariant representations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6706–6716.
[10]
J. Donahue and K. Simonyan, “Large scale adversarial representation learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 10 541–10 551.
[11]
X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence from the cycle-consistency of time,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2561—2571.
[12]
M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 132–149.
[13]
Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves ImageNet classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10684–10695.
[14]
X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, “S4L: Self-supervised semi-supervised learning,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1476–1485.
[15]
A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting self-supervised visual representation learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1920–1929.
[16]
A. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” 2018,.
[17]
Y. Chung, W. Hsu, H. Tang, and J. Glass, “An unsupervised autoregressive model for speech representation learning,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2019, pp. 146–150.
[18]
M. Ravanelli and Y. Bengio, “Learning speaker representations with mutual information,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2019, pp. 1153–1157.
[19]
S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2019, pp. 3465–3469.
[20]
M. Tagliasacchi, B. Gfeller, F. Quitry, and D. Roblek, “Self-supervised audio representation learning for mobile devices,” 2019,.
[21]
S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2019, pp. 161–165.
[22]
A. Kumar and V. K. Ithapu, “SeCoST:: Sequential co-supervision for large scale weakly labeled audio event detection,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2020, pp. 666–670.
[23]
F. D. C. Quitry, M. Tagliasacchi, and D. Roblek, “Learning audio representations via phase prediction,” 2019,.
[24]
J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Unsupervised speech representation learning using WaveNet autoencoders,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, no. 12, pp. 2041–2053, Dec. 2019.
[25]
M. Rivière, A. Joulin, P.-E. Mazaré, and E. Dupoux, “Unsupervised pretraining transfers well across languages,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2020, pp. 7414–7418.
[26]
J. Chung and A. Zisserman, “Lip reading in the wild,” in Proc. Asian Conf. Comput. Vis., 2016, pp. 87–103.
[27]
A. Owens and A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 639–658.
[28]
H. Pham, P. Liang, T. Manzini, L. Morency, and B. Póczos, “Found in translation: Learning robust joint representations by cyclic translations between modalities,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 6892–6899.
[29]
A. Owens, J. Wu, J. McDermott, W. Freeman, and A. Torralba, “Learning sight from sound: Ambient sound provides supervision for visual learning,” Int. J. Comput. Vis., vol. 126, no. 10, pp. 1120–1137, 2018.
[30]
S. Petridis and M. Pantic, “Prediction-based audiovisual fusion for classification of non-linguistic vocalisations,” IEEE Trans. Affective Comput., vol. 7, no. 1, pp. 45–58, First Quarter 2016.
[31]
A. J. Piergiovanni, A. Angelova, and M. S. Ryoo, “Evolving losses for unsupervised video representation learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 130–139.
[32]
Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” 2019,.
[33]
M. Patrick, Y. M. Asano, R. Fong, J. F. Henriques, G. Zweig, and A. Vedaldi, “Multi-modal self-supervision from generalized data transformations,” 2020,.
[34]
P. Morgado, N. Vasconcelos, and I. Misra, “Audio-visual instance discrimination with cross-modal agreement,” 2020,.
[35]
H. Zhu, M. Luo, R. Wang, A. Zheng, and R. He, “Deep audio-visual learning: A survey,” 2020,.
[36]
C. Bregler, M. Covell, and M. Slaney, “Video rewrite: Driving visual speech with audio,” in Proc. 24th Annu. Conf. Comput. Graph. Interactive Techn., 1997, pp. 353–360.
[37]
T. Ezzat, G. Geiger, and T. Poggio, “Trainable videorealistic speech animation,” ACM Trans. Graph., vol. 21, no. 3, pp. 388–398, 2002.
[38]
S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizing obama: Learning lip sync from audio,” ACM Trans. Graph., vol. 36, no. 4, pp. 1–13, 2017.
[39]
H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, no. 5588, 1976, Art. no.
[40]
S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2018, pp. 6548–6552.
[41]
J. F. Cohnet al., “Detecting depression from facial actions and vocal prosody,” in Proc. 3rd Int. Conf. Affective Comput. Intell. Interact. Workshops, 2009, pp. 1–7.
[42]
A. Shukla, S. S. Gullapuram, H. Katti, K. Yadati, M. Kankanhalli, and R. Subramanian, “Evaluating content-centric vs. user-centric ad affect recognition,” in Proc. 19th ACM Int. Conf. Multimodal Interact., 2017, pp. 402–410.
[43]
A. Shukla, H. Katti, M. Kankanhalli, and R. Subramanian, “Looking beyond a clever narrative: Visual context and attention are primary drivers of affect in video advertisements,” in Proc. 20th ACM Int. Conf. Multimodal Interact., 2018, pp. 210–219.
[44]
A. Shukla, S. S. Gullapuram, H. Katti, M. Kankanhalli, S. Winkler, and R. Subramanian, “Recognition of advertisement emotions with application to computational advertising,” IEEE Trans. Affective Comput., to be published.
[45]
D. Kolliaset al., “Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,” Int. J. Comput. Vis., vol. 127, pp. 907–929, 2019.
[46]
A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics, 2018, pp. 2236–2246.
[47]
A. Shukla, S. S. Gullapuram, H. Katti, K. Yadati, M. Kankanhalli, and R. Subramanian, “Affect recognition in ads with application to computational advertising,” in Proc. 25th ACM Int. Conf. Multimedia, 2017, pp. 1148–1156.
[48]
K. Vougioukas, S. Petridis, and M. Pantic, “End-to-end speech-driven facial animation with temporal GANs,” in Proc. Brit. Conf. Mach. Vis., 2018, Art. no.
[49]
A. Shukla, K. Vougioukas, P. Ma, S. Petridis, and M. Pantic, “Visually guided self supervised learning of speech representations,” in Proc. Int. Conf. Acoust. Speech Signal Process., 2020, pp. 6299–6303.
[50]
A. Shukla, S. Petridis, and M. Pantic, “Visual self-supervision by facial reconstruction for speech representation learning,” in Proc. Sight Sound Workshop CVPR, 2020.
[51]
A. Shukla, S. Petridis, and M. Pantic, “Learning speech representations from raw audio by joint audiovisual self-supervision,” in Proc. Workshop Self-Supervision Audio Speech, 2020.
[52]
O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assisted Intervention, 2015, pp. 234–241.
[53]
K. Vougioukas, S. Petridis, and M. Pantic, “Realistic speech-driven facial animation with GANs,” Int. J. Comput. Vis., vol. 128, pp. 1398–1413, 2020.
[54]
A. Paszkeet al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 8024–8035.
[55]
H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-sourced emotional multimodal actors dataset,” IEEE Trans. Affective Comput., vol. 5, no. 4, pp. 377–390, Fourth Quarter 2014.
[56]
S. Livingstone and F. Russo, “The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English,” PloS One, vol. 13, no. 5, 2018, Art. no.
[57]
F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions,” in Proc. 10th IEEE Int. Conf. Workshops Autom. Face Gesture Recognit., 2013, pp. 1–8.
[58]
J. Kossaifiet al., “SEWA DB: A rich database for audio-visual emotion and sentiment research in the wild,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 3, pp. 1022–1040, Mar. 2021.
[59]
C. Bussoet al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Lang. Resour. Eval., vol. 42, no. 4, 2008, Art. no.
[60]
N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, and S. Poria, “Multimodal sentiment analysis using hierarchical fusion with context modeling,” Knowl.-Based Syst., vol. 161, pp. 124–133, 2018.
[61]
P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” 2018,.
[62]
R. Arandjelovic and A. Zisserman, “Objects that sound,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 435–451.
[63]
B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of audio and video models from self-supervised synchronization,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst., 2018, pp. 7763–7774.
[64]
F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in openSMILE, the munich open-source multimedia feature extractor,” in Proc. 21st ACM Int. Conf. Multimedia, 2013, pp. 835–838.
[65]
M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Light gated recurrent units for speech recognition,” IEEE Trans. Emerg. Topics Comput. Intell., vol. 2, no. 2, pp. 92–102, Apr. 2018.
[66]
A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Commun., vol. 12, no. 3, pp. 247–251, 1993.
[67]
A. Ephratet al., “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Trans. Graph., vol. 37, 2018, Art. no.
[68]
M. Ravanelli and Y. Bengio, “Interpretable convolutional filters with SincNet,” in Proc. IEEE SLT Workshop, 2018.
[69]
K. Vougioukas, P. Ma, S. Petridis, and M. Pantic, “Video-driven speech reconstruction using generative adversarial networks,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2019, pp. 4125–4129.
[70]
S. Ma, Z. Zeng, D. McDuff, and Y. Song, “Learning audio-visual representations with active contrastive coding,” 2020,.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Affective Computing
IEEE Transactions on Affective Computing  Volume 14, Issue 1
Jan.-March 2023
863 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 January 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media