skip to main content
10.1007/978-3-030-58589-1_42guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation

Published: 23 August 2020 Publication History

Abstract

The synthesis of natural emotional reactions is an essential criterion in vivid talking-face video generation. This criterion is nevertheless seldom taken into consideration in previous works due to the absence of a large-scale, high-quality emotional audio-visual dataset. To address this issue, we build the Multi-view Emotional Audio-visual Dataset (MEAD), a talking-face video corpus featuring 60 actors and actresses talking with eight different emotions at three different intensity levels. High-quality audio-visual clips are captured at seven different view angles in a strictly-controlled environment. Together with the dataset, we release an emotional talking-face generation baseline that enables the manipulation of both emotion and its intensity. Our dataset could benefit a number of different research fields including conditional generation, cross-modal understanding and expression recognition. Code, model and data are publicly available on our project page https://wywu.github.io/projects/MEAD/MEAD.html.

References

[1]
Alghamdi N, Maddock S, Marxer R, Barker J, and Brown GJ A corpus of audio-visual Lombard speech with frontal and profile views J. Acoust. Soc. Am. 2018 143 6 EL523
[2]
Anina, I., Zhou, Z., Zhao, G., Pietikäinen, M.: Ouluvs2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–5. IEEE (2015)
[3]
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: Ninth European Conference on Speech Communication and Technology (2005)
[4]
Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, and Verma R Crema-d: crowd-sourced emotional multimodal actors dataset IEEE Trans. Affect. Comput. 2014 5 4 377-390
[5]
Cao Y, Tien WC, Faloutsos P, and Pighin F Expressive speech-driven facial animation ACM Trans. Graph. (TOG) 2005 24 4 1283-1302
[6]
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7832–7841 (2019)
[7]
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: INTERSPEECH (2018)
[8]
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? In: British Machine Vision Conference (2017)
[9]
Cooke M, Cunningham S, and Shao X An audio-visual corpus for speech perception and automatic speech recognition J. Acoust. Soc. Am. 2006 120 2421
[10]
Cowie AP and Gimson A Oxford Advanced Learner’s Dictionary of Current English 1992 Oxford Oxford University Press
[11]
Cowie R et al. Emotion recognition in human-computer interaction IEEE Signal Process. Mag. 2001 18 1 32-80
[12]
Ding, H., Sricharan, K., Chellappa, R.: Exprgan: facial expression editing with controllable expression intensity. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
[13]
Edwards P, Landreth C, Fiume E, and Singh K Jali: an animator-centric viseme model for expressive lip synchronization ACM Trans. Graph. (TOG) 2016 35 4 127
[14]
Ezzat, T., Geiger, G., Poggio, T.: Trainable videorealistic speech animation, vol. 21. ACM (2002)
[15]
Fried O et al. Text-based editing of talking-head video ACM Trans. Graph. (TOG) 2019 38 1-14
[16]
Garofolo, J.S.: Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium (1993)
[17]
Harte N and Gillen E Tcd-timit: an audio-visual corpus of continuous speech IEEE Trans. Multimedia 2015 17 5 603-615
[18]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
[19]
Healy EW, Yoho SE, Wang Y, and Wang D An algorithm to improve speech recognition in noise for hearing-impaired listeners J. Acoust. Soc. Am. 2013 134 4 3029-3038
[20]
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)
[21]
Hochreiter S and Schmidhuber J Long short-term memory Neural Comput. 1997 9 8 1735-1780
[22]
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189 (2018)
[23]
Jackson, P., Haq, S.: Surrey audio-visual expressed emotion (savee) database. http://kahlan.eps.surrey.ac.uk
[24]
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV (2016)
[25]
Karras T, Aila T, Laine S, Herva A, and Lehtinen J Audio-driven facial animation by joint end-to-end learning of pose and emotion ACM Trans. Graph. (TOG) 2017 36 4 94
[26]
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[27]
Klautau, A.: Arpabet and the timit alphabet (2001)
[28]
Kossaifi, J., et al.: Sewa db: A rich database for audio-visual emotion and sentiment research in the wild. arXiv preprint arXiv:1901.02839 (2019)
[29]
Langner O, Dotsch R, Bijlstra G, Wigboldus DH, Hawk ST, and Van Knippenberg A Presentation and validation of the radboud faces database Cogn. Emot. 2010 24 8 1377-1388
[30]
Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Diverse image-to-image translation via disentangled representations. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51 (2018)
[31]
Lewis J Automated lip-sync: background and techniques J. Vis. Comput. Animation 1991 2 4 118-122
[32]
Livingstone ST and Russo FA The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north American English PLoS One 2018 13 e0196391
[33]
Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 94–101. IEEE (2010)
[34]
Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802 (2017)
[35]
Mariooryad, S., Busso, C.: Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pp. 85–90. IEEE (2013)
[36]
Mattheyses W and Verhelst W Audiovisual speech synthesis: an overview of the state-of-the-art Speech Commun. 2015 66 182-217
[37]
Meng, D., Peng, X., Wang, K., Qiao, Y.: frame attention networks for facial expression recognition in videos. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 3866–3870. IEEE (2019). https://github.com/Open-Debin/Emotion-FAN
[38]
Mollahosseini A, Hasani B, and Mahoor MH Affectnet: a database for facial expression, valence, and arousal computing in the wild IEEE Trans. Affect. Comput. 2017 10 1 18-31
[39]
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
[40]
Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7092–7096. IEEE (2013)
[41]
Petridis, S., Shen, J., Cetin, D., Pantic, M.: Visual-only recognition of normal, whispered and silent speech. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6219–6223. IEEE (2018)
[42]
Pumarola, A., Agudo, A., Martinez, A.M., Sanfeliu, A., Moreno-Noguer, F.: Ganimation: anatomically-aware facial animation from a single image. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 818–833 (2018)
[43]
Qian, S., et al.: Make a face: towards arbitrary high fidelity face manipulation. In: ICCV (2019)
[44]
Ronneberger O, Fischer P, and Brox T Navab N, Hornegger J, Wells WM, and Frangi AF U-Net: convolutional networks for biomedical image segmentation Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 2015 Cham Springer 234-241
[45]
Russell JA A circumplex model of affect J. Pers. Soc. Psychol. 1980 39 6 1161
[46]
Sadoughi, N., Busso, C.: Speech-driven expressive talking lips with conditional sequential generative adversarial networks. IEEE Trans. Affect. Comput. (2019)
[47]
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
[48]
Song, L., Wu, W., Qian, C., Qian, C., Loy, C.C.: Everybody’s talkin’: let me talk as you want. arXiv preprint arXiv:2001.05201 (2020)
[49]
Song, Y., Zhu, J., Wang, X., Qi, H.: Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786 (2018)
[50]
Srinivasan S, Roman N, and Wang D Binary and ratio time-frequency masks for robust speech recognition Speech Commun. 2006 48 11 1486-1501
[51]
Suwajanakorn S, Seitz SM, and Kemelmacher-Shlizerman I Synthesizing obama: learning lip sync from audio ACM Trans. Graph. (TOG) 2017 36 4 95
[52]
Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with gans. Int. J. Comput. Vis., 1–16 (2019)
[53]
Wang, T.C., et al.: Video-to-video synthesis. In: NeurIPS (2018)
[54]
Williams CE and Stevens KN Emotions and speech: some acoustical correlates J. Acoust. Soc. Am. 1972 52 4B 1238-1250
[55]
Wu, W., Cao, K., Li, C., Qian, C., Loy, C.C.: Transgaga: geometry-aware unsupervised image-to-image translation. In: CVPR (2019)
[56]
Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., Zhou, Q.: Look at boundary: a boundary-aware face alignment algorithm. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2129–2138 (2018)
[57]
Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9459–9468 (2019)
[58]
Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. In: The Association for the Advancement of Artificial Intelligence Conference (2019)
[59]
Zhou Y, Xu Z, Landreth C, Kalogerakis E, Maji S, and Singh K Visemenet: audio-driven animator-centric speech animation ACM Trans. Graph. (TOG) 2018 37 4 161
[60]
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)

Cited By

View all

Index Terms

  1. MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI
        Aug 2020
        831 pages
        ISBN:978-3-030-58588-4
        DOI:10.1007/978-3-030-58589-1

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 23 August 2020

        Author Tags

        1. Video generation
        2. Generative adversarial networks
        3. Representation disentanglement

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 28 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion ModelSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687571(1-11)Online publication date: 3-Dec-2024
        • (2024)GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681675(3548-3557)Online publication date: 28-Oct-2024
        • (2024)Cross-Modal Meta Consensus for Heterogeneous Federated LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681510(975-984)Online publication date: 28-Oct-2024
        • (2024)MMHead: Towards Fine-grained Multi-modal 3D Facial AnimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681366(7966-7975)Online publication date: 28-Oct-2024
        • (2024)ECAvatar: 3D Avatar Facial Animation with Controllable Identity and EmotionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681328(10468-10476)Online publication date: 28-Oct-2024
        • (2024)FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681238(3411-3420)Online publication date: 28-Oct-2024
        • (2024)ListenFormer: Responsive Listening Head Generation with Non-autoregressive TransformersProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681182(7094-7103)Online publication date: 28-Oct-2024
        • (2024)Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression ManipulationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681017(3800-3808)Online publication date: 28-Oct-2024
        • (2024)FacialFlowNet: Advancing Facial Optical Flow Estimation with a Diverse Dataset and a Decomposed ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680921(2194-2203)Online publication date: 28-Oct-2024
        • (2024)Expressive 3D Facial Animation Generation Based on Local-to-Global Latent DiffusionIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345621330:11(7397-7407)Online publication date: 1-Nov-2024
        • Show More Cited By

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media