skip to main content
research-article

Social Context-aware Person Search in Videos via Multi-modal Cues

Published: 22 November 2021 Publication History

Abstract

Person search has long been treated as a crucial and challenging task to support deeper insight in personalized summarization and personality discovery. Traditional methods, e.g., person re-identification and face recognition techniques, which profile video characters based on visual information, are often limited by relatively fixed poses or small variation of viewpoints and suffer from more realistic scenes with high motion complexity (e.g., movies). At the same time, long videos such as movies often have logical story lines and are composed of continuously developmental plots. In this situation, different persons usually meet on a specific occasion, in which informative social cues are performed. We notice that these social cues could semantically profile their personality and benefit person search task in two aspects. First, persons with certain relationships usually co-occur in short intervals; in case one of them is easier to be identified, the social relation cues extracted from their co-occurrences could further benefit the identification for the harder ones. Second, social relations could reveal the association between certain scenes and characters (e.g., classmate relationship may only exist among students), which could narrow down candidates into certain persons with a specific relationship. In this way, high-level social relation cues could improve the effectiveness of person search. Along this line, in this article, we propose a social context-aware framework, which fuses visual and social contexts to profile persons in more semantic perspectives and better deal with person search task in complex scenarios. Specifically, we first segment videos into several independent scene units and abstract out social contexts within these scene units. Then, we construct inner-personal links through a graph formulation operation for each scene unit, in which both visual cues and relation cues are considered. Finally, we perform a relation-aware label propagation to identify characters’ occurrences, combining low-level semantic cues (i.e., visual cues) and high-level semantic cues (i.e., relation cues) to further enhance the accuracy. Experiments on real-world datasets validate that our solution outperforms several competitive baselines.

References

[1]
Pradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multim. Syst. 16, 6 (2010), 345–379.
[2]
Slawomir Bak and Peter Carr. 2016. Person re-identification using deformable patch metric learning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1–9.
[3]
Daphne Blunt Bugental. 2000. Acquisition of the algorithms of social life: A domain-based approach.Psychol. Bull. 126, 2 (2000), 187.
[4]
Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6154–6162.
[5]
Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2017. VGGFace2: A dataset for recognising faces across pose and age. https://ieeexplore.ieee.org/abstract/document/8373813.
[6]
Xiaobin Chang, Timothy M. Hospedales, and Tao Xiang. 2018. Multi-level factorisation net for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2109–2118.
[7]
Di Chen, Shanshan Zhang, Wanli Ouyang, Jian Yang, and Ying Tai. 2018. Person search via a mask-guided two-stream CNN model. https://openaccess.thecvf.com/content_ECCV_2018/html/Di_Chen_Person_Search_via_ECCV_2018_paper.html.
[8]
De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and Nanning Zheng. 2016. Person re-identification by multi-channel parts-based CNN with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1335–1344.
[9]
Jia Deng, Wei Dong, Richard Socher, Li Jia Li, and Fei Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[10]
Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos Zafeiriou. 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[11]
Shengyong Ding, Liang Lin, Guangrun Wang, and Hongyang Chao. 2015. Deep feature learning with relative distance comparison for person re-identification. Pattern Recog. 48, 10 (2015), 2993–3003.
[12]
E. Eidinger, R. Enbar, and T. Hassner. 2014. Age and gender estimation of unfiltered faces. IEEE Trans. Inf. Forens. Secur. 9, 12 (2014), 2170–2179. DOI:https://doi.org/10.1109/TIFS.2014.2359646
[13]
Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2007. The PASCAL visual object classes challenge 2007 (VOC2007) results. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.6547&rep=rep1&type=pdf.
[14]
Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, and Mohan Kankanhalli. 2018. Multi-modal preference modeling for product search. In Proceedings of the 26th ACM International Conference on Multimedia. 1865–1873.
[15]
Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. MS-Celeb-1M: A dataset and benchmark for large-scale face recognition. In Proceedings of the European Conference on Computer Vision. Springer, 87–102.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[17]
Qingqiu Huang, Wentao Liu, and Dahua Lin. 2018. Person search in videos with one portrait through visual and temporal links. In Proceedings of the European Conference on Computer Vision (ECCV). 425–441.
[18]
Vijay Kumar, Anoop M. Namboodiri, and C. V. Jawahar. 2014. Face recognition in videos by label propagation. In Proceedings of the 22nd International Conference on Pattern Recognition. IEEE, 303–308.
[19]
Gil Levi and Tal Hassner. 2015. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 34–42.
[20]
Chenchen Li, Jialin Wang, Hongwei Wang, Miao Zhao, Wenjie Li, and Xiaotie Deng. 2019. Visual-texual emotion analysis with deep coupled video and Danmu neural networks. IEEE Trans. Multim. 22, 6 (2019), 1634–1646.
[21]
Hongyang Li, Huchuan Lu, Zhe Lin, Xiaohui Shen, and Brian Price. 2015. Inner and inter label propagation: Salient object detection in the wild. IEEE Trans. Image Process. 24, 10 (2015), 3176–3186.
[22]
Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. 2014. DeepReID: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 152–159.
[23]
Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z. Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2197–2206.
[24]
Guangyi Lv, Tong Xu, Enhong Chen, Qi Liu, and Yi Zheng. 2016. Reading the videos: Temporal labeling for crowdsourced time-sync videos based on semantic embedding. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
[25]
Harry T. Reis, W. Andrew Collins, and Ellen Berscheid. 2000. The relationship context of human behavior and development.Psychol. Bull. 126, 6 (2000), 844.
[26]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 815–823.
[27]
Yantao Shen, Tong Xiao, Hongsheng Li, Shuai Yi, and Xiaogang Wang. 2018. End-to-end deep Kronecker-product matching for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6886–6895.
[28]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[29]
Nitish Srivastava and Ruslan Salakhutdinov. 2012. Learning representations for multimodal data with deep belief nets. In Proceedings of the International Conference on Machine learning Workshop.
[30]
Nitish Srivastava and Russ R. Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 2222–2230.
[31]
Qianru Sun, Bernt Schiele, and Mario Fritz. 2017. A domain based approach to social relation recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3481–3490.
[32]
Subarna Tripathi, Serge Belongie, Youngbae Hwang, and Truong Nguyen. 2016. Detecting temporally consistent objects in videos through object class label propagation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1–9.
[33]
Tinne Tuytelaars, Marie-Francine Moens et al. 2011. Naming people in news videos with label propagation. IEEE Multim.3 (2011), 44–55.
[34]
Fei Wang and Changshui Zhang. 2007. Label propagation through linear neighborhoods. IEEE Trans. Knowl. Data Eng. 20, 1 (2007), 55–67.
[35]
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6609–6618.
[36]
Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. 2017. Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3415–3424.
[37]
Xiaojin Zhu and Zoubin Ghahramani. 2002. Learning from labeled and unlabeled data with label propagation. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.8280.
[38]
Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
[39]
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23, 10 (2016), 1499–1503.
[40]
Ning Zhang, Manohar Paluri, Yaniv Taigman, Rob Fergus, and Lubomir Bourdev. 2015. Beyond frontal faces: Improving person recognition using multiple cues. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4804–4813.
[41]
Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. 2017. CityPersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3213–3221.
[42]
Shanshan Zhang, Jian Yang, and Bernt Schiele. 2018. Occluded pedestrian detection through guided attention in CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6995–7003.
[43]
Wei-Shi Zheng, Xiang Li, Tao Xiang, Shengcai Liao, Jianhuang Lai, and Shaogang Gong. 2015. Partial person re-identification. In Proceedings of the IEEE International Conference on Computer Vision. 4678–4686.
[44]
Zhedong Zheng, Liang Zheng, and Yi Yang. 2018. Pedestrian alignment network for large-scale person re-identification. IEEE Trans. Circ. Syst. Vid. Technol.ogy 29, 10 (2018), 3037–3045.
[45]
Olga Zoidi, Anastasios Tefas, Nikos Nikolaidis, and Ioannis Pitas. 2014. Person identity label propagation in stereo videos. IEEE Trans. Multim. 16, 5 (2014), 1358–1368.

Cited By

View all

Index Terms

  1. Social Context-aware Person Search in Videos via Multi-modal Cues

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 40, Issue 3
      July 2022
      650 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/3498357
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 November 2021
      Accepted: 01 August 2021
      Revised: 01 June 2021
      Received: 01 November 2020
      Published in TOIS Volume 40, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Person search
      2. graph modeling
      3. user profile
      4. label propagation
      5. social relation
      6. neural network

      Qualifiers

      • Research-article
      • Refereed

      Funding Sources

      • National Key Research and Development Program of China
      • National Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)101
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 20 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media