skip to main content
10.1145/3664647.3681631acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Temporal Enhancement for Video Affective Content Analysis

Published: 28 October 2024 Publication History

Abstract

With the popularity and advancement of the Internet and video-sharing platforms, video affective content analysis has greatly developed. Temporal information is crucial for this task. Nevertheless, existing methods often overlook the fact that there is substantial irrelevant information in videos and that the importance of modalities is uneven for emotional tasks. This could result in noise from both temporal fragments and modalities, reducing the model's ability to identify crucial temporal fragments and recognize emotions. To tackle the above issues, we propose a Temporal Enhancement (TE) method in this paper. Specifically, we utilize three encoders for extracting features at various levels and employ temporal sampling to enhance the temporal data, thereby enriching video representation and improving the model's robustness to noise. Subsequently, we design a cross-modal temporal enhancement module to enhance temporal information for every modal feature. This module interacts with multiple modalities simultaneously to emphasize critical temporal fragments while suppressing irrelevant ones. The experimental results on four benchmark datasets show that the proposed temporal enhancement method achieves state-of-the-art video affective content analysis performance. Moreover, the effectiveness of each module is confirmed through ablation experiments.

Supplemental Material

MP4 File - Demonstration video for ?Temporal Enhancement in Video Affective Content Analysis?
The video showcases our research titled "Temporal Enhancement for Video Affective Content Analysis." In it, we delve into the task definition, identify key challenges, discuss noise issues, and review recent advancements in the field. Additionally, we highlight the motivation behind our study, outline our framework, and present the conclusive experimental results.

References

[1]
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence (Feb 2019), 423--443. https://doi.org/10.1109/tpami.2018.2798607
[2]
Yoann Baveye, Emmanuel Dellandrea, Christel Chamaret, and Liming Chen. 2015. LIRIS-ACCEDE: A video database for affective content analysis. IEEE Transactions on Affective Computing, Vol. 6, 1 (2015), 43--55.
[3]
Junyan Cheng, Iordanis Fostiropoulos, Barry Boehm, and Mohammad Soleymani. 2021. Multimodal phased transformer for sentiment analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2447--2458.
[4]
Quan Gan, Shangfei Wang, Longfei Hao, and Qiang Ji. 2017. A multimodal deep regression bayesian network for affective video content analyses. In Proceedings of the IEEE International Conference on Computer Vision. 5113--5122.
[5]
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 776--780.
[6]
Wenzhong Guo, Jianwen Wang, and Shiping Wang. 2019. Deep Multimodal Representation Learning: A Survey. IEEE Access, Vol. 7 (2019), 63373--63394. https://doi.org/10.1109/ACCESS.2019.2916887
[7]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546--6555.
[8]
Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. Cornell University - arXiv,Cornell University - arXiv (May 2020).
[9]
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 131--135. https://doi.org/10.1109/ICASSP.2017.7952132
[10]
Yu-Gang Jiang, Baohan Xu, and Xiangyang Xue. 2014. Predicting emotions in user-generated videos. In Proceedings of the AAAI conference on artificial intelligence, Vol. 28.
[11]
Li-Jia Li, Hao Su, Li Fei-Fei, and Eric Xing. 2010. Object bank: A high-level image representation for scene classification & semantic feature sparsification. Advances in neural information processing systems, Vol. 23 (2010).
[12]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064 (2018).
[13]
Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2554--2562.
[14]
Emmanuel Della ndréa, Liming Chen, Yoann Baveye, Mats Sjöberg, and Christel Chamaret. 2016. The MediaEval 2016 Emotional Impact of Movies Task.
[15]
Yangjun Ou, Zhenzhong Chen, and Feng Wu. 2020. Multimodal local-global attention network for affective video content analysis. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 31, 5 (2020), 1901--1914.
[16]
Jicai Pan, Shangfei Wang, and Lin Fang. 2022. Representation learning through multimodal attention and time-sync comments for affective video content analysis. In Proceedings of the 30th ACM International Conference on Multimedia. 42--50.
[17]
Haonan Qiu, Liang He, and Feng Wang. 2020. Dual Focus Attention Network For Video Emotion Recognition. In 2020 IEEE International Conference on Multimedia and Expo (ICME). https://doi.org/10.1109/icme46284.2020.9102808
[18]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[19]
Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020).
[20]
Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort, and Marian Bartlett. 2013. Multiple kernel learning for emotion recognition in the wild. In Proceedings of the 15th ACM on International conference on multimodal interaction. https://doi.org/10.1145/2522848.2531741
[21]
Jiajia Tang, Kang Li, Ming Hou, Xuanyu Jin, Wanzeng Kong, Yu Ding, and Qibin Zhao. 2022. MMT: Multi-way Multi-modal Transformer for Multimodal Learning. In IJCAI. 3458--3465.
[22]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for computational linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.
[23]
Shangfei Wang, Longfei Hao, and Qiang Ji. 2019. Knowledge-augmented multimodal deep regression bayesian networks for emotion video tagging. IEEE Transactions on Multimedia, Vol. 22, 4 (2019), 1084--1097.
[24]
Shangfei Wang and Qiang Ji. 2015. Video Affective Content Analysis: A Survey of State-of-the-Art Methods. IEEE Transactions on Affective Computing (Oct 2015), 410--430. https://doi.org/10.1109/taffc.2015.2432791
[25]
Shangfei Wang, Xin Li, Feiyi Zheng, Jicai Pan, Xuewei Li, Yanan Chang, Qiong Li, Jiahe Wang, Yufei Xiao, et al. 2024. VAD: A Video Affective Dataset with Danmu. IEEE Transactions on Affective Computing (2024).
[26]
Jie Wei, Xinyu Yang, and Yizhuo Dong. 2021. User-generated video emotion recognition based on key frames. Multimedia Tools and Applications, Vol. 80 (2021), 14343--14361.
[27]
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On Layer Normalization in the Transformer Architecture. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 10524--10533. https://proceedings.mlr.press/v119/xiong20b.html
[28]
Baohan Xu, Yanwei Fu, Yu-Gang Jiang, Boyang Li, and Leonid Sigal. 2016. Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. IEEE Transactions on Affective Computing, Vol. 9, 2 (2016), 255--270.
[29]
Xueming Yan, Haiwei Xue, Shengyi Jiang, and Ziang Liu. 2022. Multimodal sentiment analysis using multi-tensor fusion network with cross-modal modeling. Applied Artificial Intelligence, Vol. 36, 1 (2022), 2000688.
[30]
Dingkang Yang, Zhaoyu Chen, Yuzheng Wang, Shunli Wang, Mingcheng Li, Siao Liu, Xiao Zhao, Shuai Huang, Zhiyan Dong, Peng Zhai, and Lihua Zhang. 2023. Context De-Confounded Emotion Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19005--19015.
[31]
Yun Yi and Hanli Wang. 2019. Multi-modal learning for affective content analysis in movies. Multimedia Tools and Applications, Vol. 78 (2019), 13331--13350.
[32]
Yun Yi, Hanli Wang, and Qinyu Li. 2019. Affective video content analysis with adaptive fusion recurrent network. IEEE Transactions on Multimedia, Vol. 22, 9 (2019), 2454--2466.
[33]
Yun Yi, Hanli Wang, and Qinyu Li. 2020. Affective Video Content Analysis With Adaptive Fusion Recurrent Network. IEEE Transactions on Multimedia, Vol. 22, 9 (2020), 2454--2466. https://doi.org/10.1109/TMM.2019.2955300
[34]
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017).
[35]
Haimin Zhang and Min Xu. 2018. Recognition of Emotions in User-Generated Videos With Kernelized Features. IEEE Transactions on Multimedia, Vol. 20, 10 (Oct 2018), 2824--2835. https://doi.org/10.1109/tmm.2018.2808760
[36]
Haimin Zhang and Min Xu. 2021. Recognition of Emotions in User-generated Videos with Transferred Emotion Intensity Learning. IEEE Transactions on Multimedia (2021).
[37]
Zhicheng Zhang, Lijuan Wang, Jufeng Yang, and Tmcc Tmcc. [n.,d.]. Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network. ( [n.,d.]).
[38]
Zhicheng Zhang and Jufeng Yang. 2022. Temporal Sentiment Localization: Listen and Look in Untrimmed Videos. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 199--208. https://doi.org/10.1145/3503161.3548007
[39]
Sicheng Zhao, Yunsheng Ma, Yang Gu, Jufeng Yang, Tengfei Xing, Pengfei Xu, Runbo Hu, Hua Chai, and Kurt Keutzer. 2020. An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos. Proceedings of the AAAI Conference on Artificial Intelligence (Jun 2020), 303--311. https://doi.org/10.1609/aaai.v34i01.5364
[40]
Sicheng Zhao, Shangfei Wang, Mohammad Soleymani, Dhiraj Joshi, and Qiang Ji. 2019. Affective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey. ACM Transactions on Multimedia Computing, Communications, and Applications (Nov 2019), 1--32. https://doi.org/10.1145/3363560

Index Terms

  1. Temporal Enhancement for Video Affective Content Analysis

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. crossmodal attention
    2. temporal enhancement
    3. video affective content analysis

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 65
      Total Downloads
    • Downloads (Last 12 months)65
    • Downloads (Last 6 weeks)38
    Reflects downloads up to 30 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media