skip to main content
10.1145/3474085.3479214acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

Published: 17 October 2021 Publication History

Abstract

To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.

Supplementary Material

MP4 File (MM21-gch3339.mp4)
A joint learning method to predict relationship and interaction simultaneously. Based on the relationship and interaction knowledge graph, we can answer different types of queries about deep video understanding, such as filling in the part of graph, multiple choice questions and find target video to match descriptions. Due to the low-number, long-time videos of development set, our method also apply to low shot learning.

References

[1]
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European Conference on Computer Vision. 404--417.
[2]
Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. 2019. Tracking without bells and whistles. In IEEE International Conference on Computer Vision. 941--951.
[3]
Shixing Chen, Xiaohan Nie, David Fan, Dongqing Zhang, Vimal Bhat, and Raffay Hamid. 2021. Shot Contrastive Self-Supervised Learning for Scene Boundary Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 9796--9805.
[4]
Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2020. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do. In International Conference on Multimedia Retrieval. 355--361.
[5]
Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. 2020. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition. 5203--5212.
[6]
J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.
[7]
Abhinav Gupta, Aniruddha Kembhavi, and Larry S. Davis. 2009. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, 10 (2009), 1775--1789.
[8]
Allan D. Jepson, David J. Fleet, and Thomas F. El-Maraghi. 2003. Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 25, 10 (2003), 1296--1311.
[9]
Adarsh Kowdle and Tsuhan Chen. 2012. Learning to segment a video to clips based on scene and camera motion. In European Conference on Computer Vision. Springer, 272--286.
[10]
Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. 2020. Learning Interactions and Relationships between Movie Characters. In IEEE Conference on Computer Vision and Pattern Recognition. 9849--9858.
[11]
Ali Rahimi, Louis-Philippe Morency, and Trevor Darrell. 2008. Reducing drift in differential tracking. Computer Vision and Image Understanding, Vol. 109, 2 (2008), 97--111.
[12]
Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A Local-to-Global Approach to Multi-modal Movie Scene Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. 10146--10155.
[13]
Mrigank Rochan and Yang Wang. 2019. Video Summarization by Learning from Unpaired Data. In IEEE Conference on Computer Vision and Pattern Recognition. 7894--7903.
[14]
Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In ACM international conference on Multimedia. 1300--1308.
[15]
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In ACM International Conference on Multimedia. 423--432.
[16]
Qianru Sun, Bernt Schiele, and Mario Fritz. 2017. A domain based approach to social relation recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 3481--3490.
[17]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In IEEE International Conference on Computer Vision. 4489--4497.
[18]
Jiangyue Xia, Anyi Rao, Qingqiu Huang, Linning Xu, Jiangtao Wen, and Dahua Lin. 2020. Online multi-modal person search in videos. In European Conference on Computer Vision. Springer, 174--190.
[19]
Fan Yu, DanDan Wang, Beibei Zhang, and Tongwei Ren. 2020. Deep Relationship Analysis in Video with Multimodal Feature Fusion. In ACM International Conference on Multimedia. 4640--4644.
[20]
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters (2016), 1499--1503.
[21]
Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. 2020. Tracking objects as points. In European Conference on Computer Vision. 474--490.

Cited By

View all

Index Terms

  1. Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. deep video understanding
    2. interaction analysis
    3. multimodal feature fusion
    4. relationship analysis

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)67
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 24 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media