skip to main content
research-article

InteractNet: Social Interaction Recognition for Semantic-rich Videos

Published: 12 June 2024 Publication History

Abstract

The overwhelming surge of online video platforms has raised an urgent need for social interaction recognition techniques. Compared with simple short-term actions, long-term social interactions in semantic-rich videos could reflect more complicated semantics such as character relationships or emotions, which will better support various downstream applications, e.g., story summarization and fine-grained clip retrieval. However, considering the longer duration of social interactions with severe mutual overlap, involving multiple characters, dynamic scenes, and multi-modal cues, among other factors, traditional solutions for short-term action recognition may probably fail in this task. To address these challenges, in this article, we propose a hierarchical graph-based system, named InteractNet, to recognize social interactions in a multi-modal perspective. Specifically, our approach first generates a semantic graph for each sampled frame with integrating multi-modal cues and then learns the node representations as short-term interaction patterns via an adapted GCN module. Along this line, global interaction representations are accumulated through a sub-clip identification module, effectively filtering out irrelevant information and resolving temporal overlaps between interactions. In the end, the association among simultaneous interactions will be captured and modelled by constructing a global-level character-pair graph to predict the final social interactions. Comprehensive experiments on publicly available datasets demonstrate the effectiveness of our approach compared with state-of-the-art baseline methods.

References

[1]
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).
[2]
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV’18). 132–149.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
[4]
Wenlong Dong, Zhongchen Ma, Qing Zhu, and Qirong Mao. 2023. Two-stage multi-instance multi-label learning model for video social relationship recognition. In Proceedings of the 4th International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI’23). IEEE, 84–88.
[5]
Yazan Abu Farha and Jurgen Gall. 2019. MS-TCN: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3575–3584.
[6]
Pedro Felzenszwalb, David McAllester, and Deva Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8.
[7]
Jason M. Grant and Patrick J. Flynn. 2017. Crowd scene understanding from video: A survey. ACM Trans. Multim. Comput., Commun. Applic. 13, 2 (2017), 1–23.
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[9]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.
[10]
Yibo Hu, Chenyu Cao, Fangtao Li, Chenghao Yan, Jinsheng Qi, and Bin Wu. 2023. Overall-distinctive GCN for social relation recognition on videos. In Proceedings of the International Conference on Multimedia Modeling. Springer, 57–68.
[11]
Qingqiu Huang, Wentao Liu, and Dahua Lin. 2018. Person search in videos with one portrait through visual and temporal links. In Proceedings of the European Conference on Computer Vision (ECCV’18). 425–441.
[12]
Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. 2020. MovieNet: A holistic dataset for movie understanding. In Proceedings of the European Conference on Computer Vision (ECCV’20). Springer, 709–727.
[13]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[14]
Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[15]
Yu Kong and Yun Fu. 2022. Human action recognition and prediction: A survey. Int. J. Comput. Vis. 130, 5 (2022), 1366–1401.
[16]
Pavel Korshunov and Wei Tsang Ooi. 2011. Video quality for face detection, recognition, and tracking. ACM Trans. Multim. Comput., Commun. Applic. 7, 3 (2011), 1–21.
[17]
Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. 2020. Learning interactions and relationships between movie characters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9849–9858.
[18]
Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 156–165.
[19]
Dan Li, Tong Xu, Peilun Zhou, Weidong He, Yanbin Hao, Yi Zheng, and Enhong Chen. 2021. Social context-aware person search in videos via multi-modal cues. ACM Trans. Inf. Syst. 40, 3 (2021), 1–25.
[20]
Shi-Jie Li, Yazan AbuFarha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. 2023. MS-TCN++: Multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45, 6 (2023), 6647--6658.
[21]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980–2988.
[22]
Rui Liu and Yahong Han. 2022. Instance-sequence reasoning for video question answering. Front. Comput. Sci. 16, 6 (2022), 166708.
[23]
Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, Chenggang Yan, and Tao Mei. 2019. Social relation recognition from videos via multi-scale spatial-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).
[24]
Jinna Lv, Wu Liu, Lili Zhou, Bin Wu, and Huadong Ma. 2018. Multi-stream fusion model for social relation recognition from videos. In Proceedings of the International Conference on Multimedia Modeling. Springer, 355–368.
[25]
Lokesh Nandanwar, Palaiahnakote Shivakumara, Divya Krishnani, Raghavendra Ramachandra, Tong Lu, Umapada Pal, and Mohan Kankanhalli. 2021. A new foreground-background based method for behavior-oriented social media image classification. ACM Trans. Multim. Comput., Commun. Applic. 17, 4 (2021), 1–25.
[26]
Alonso Patron-Perez, Marcin Marszalek, Ian Reid, and Andrew Zisserman. 2012. Structured learning of human interactions in TV shows. IEEE Trans. Pattern Anal. Mach. Intell. 34, 12 (2012), 2441–2453.
[27]
Shengsheng Qian, Tianzhu Zhang, Changsheng Xu, and M. Shamim Hossain. 2015. Social event classification via boosted multimodal supervised latent Dirichlet allocation. ACM Trans. Multim. Comput., Commun. Applic. 11, 2 (2015), 1–22.
[28]
Penggang Qin, Shiwei Wu, Tong Xu, Yanbin Hao, Fuli Feng, Chen Zhu, and Enhong Chen. 2023. When I fall in love: Capturing video-oriented social relationship evolution via attentive GNN. IEEE Trans. Circ. Syst. Vid. Technol. (2023). (Early Access).
[29]
Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813 (2020).
[30]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015).
[31]
Michael S. Ryoo and J. K. Aggarwal. 2010. UT-interaction dataset, ICPR contest on semantic description of human activities (SDHA). In Proceedings of the IEEE International Conference on Pattern Recognition Workshops, Vol. 2. 4.
[32]
Michael S. Ryoo and Jake K. Aggarwal. 2009. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In Proceedings of the IEEE 12th International Conference on Computer Vision. IEEE, 1593–1600.
[33]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27 (2014).
[34]
Yiyang Teng, Chenguang Song, and Bin Wu. 2022. Learning social relationship from videos via pre-trained multimodal transformer. IEEE Sig. Process. Lett. 29 (2022), 1377–1381.
[35]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).
[37]
Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. 2018. MovieGraphs: Towards understanding human-centric situations from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
[38]
Haorui Wang, Yibo Hu, Yangfu Zhu, Jinsheng Qi, and Bin Wu. 2023. Shifted GCN-GAT and cumulative-transformer based social relation recognition for long videos. In Proceedings of the 31st ACM International Conference on Multimedia. 67–76.
[39]
Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305–4314.
[40]
Chao-Yuan Wu and Philipp Krahenbuhl. 2021. Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1884–1894.
[41]
Shiwei Wu, Joya Chen, Tong Xu, Liyi Chen, Lingfei Wu, Yao Hu, and Enhong Chen. 2021. Linking the characters: Video-oriented social graph generation via hierarchical-cumulative GCN. In Proceedings of the 29th ACM International Conference on Multimedia. 4716–4724.
[42]
Yu Xiong, Qingqiu Huang, Lingfeng Guo, Hang Zhou, Bolei Zhou, and Dahua Lin. 2019. A graph-based framework to bridge movies and synopses. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4592–4601.
[43]
En Xu, Zhiwen Yu, Nuo Li, Helei Cui, Lina Yao, and Bin Guo. 2023. Quantifying predictability of sequential recommendation via logical constraints. Front. Comput. Sci. 17, 5 (2023), 175612.
[44]
Tong Xu, Peilun Zhou, Linkang Hu, Xiangnan He, Yao Hu, and Enhong Chen. 2021. Socializing the videos: A multimodal approach for social relation recognition. ACM Trans. Multim. Comput., Commun. Applic. 17, 1 (2021), 1–23.
[45]
Yuanlu Xu, Bingpeng Ma, Rui Huang, and Liang Lin. 2014. Person search in a scene by jointly modeling people commonness and person uniqueness. In Proceedings of the 22nd ACM International Conference on Multimedia. 937–940.
[46]
Chenghao Yan, Zihe Liu, Fangtao Li, Chenyu Cao, Zheng Wang, and Bin Wu. 2021. Social relation analysis from videos via multi-entity reasoning. In Proceedings of the International Conference on Multimedia Retrieval. 358–366.
[47]
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2018. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2018), 1452--1464.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 8
August 2024
726 pages
EISSN:1551-6865
DOI:10.1145/3618074
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2024
Online AM: 03 May 2024
Accepted: 24 April 2024
Revised: 28 January 2024
Received: 31 July 2023
Published in TOMM Volume 20, Issue 8

Check for updates

Author Tags

  1. Multi-modal analysis
  2. video-and-language understanding
  3. graph convolutional network

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 259
    Total Downloads
  • Downloads (Last 12 months)259
  • Downloads (Last 6 weeks)32
Reflects downloads up to 21 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media