skip to main content
10.1145/3664647.3680774acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment Localization

Published: 28 October 2024 Publication History

Abstract

Video moment localization (VML) aims to identify the temporal boundary semantically matching the given query. Point-supervised VML balances localization accuracy and annotation cost but is still immature due to granularity alignment and scale perception issues. To this end, we propose a Semantic Granularity and Scale Correspondence Integration (SG-SCI) framework aimed at leveraging limited single-frame annotation for correspondence learning. It explicitly models semantic relations of different feature granularities and adaptively mines the implicit semantic scale, thereby enhancing feature representations of varying granularities and scales. SG-SCI uses granularity correspondence alignment to align semantics via latent prior knowledge and a scale correspondence learning to identify and address semantic scale differences. Extensive experiments on benchmark datasets have demonstrated the promising performance of our model over several state-of-the-art competitors.

References

[1]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803--5812.
[2]
Peijun Bao, Qian Zheng, and Yadong Mu. 2021. Dense events grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 920--928.
[3]
Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. 2016. What's the point: Semantic segmentation with point supervision. In European conference on computer vision. Springer, 549--565.
[4]
Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, and Yuexian Zou. 2021. On Pursuit of Designing Multi-modal Transformer for Video Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9810--9823.
[5]
Meng Cao, Fangyun Wei, Can Xu, Xiubo Geng, Long Chen, Can Zhang, Yuexian Zou, Tao Shen, and Daxin Jiang. 2023. Iterative proposal refinement for weakly-supervised video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6524--6534.
[6]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[7]
Houlun Chen, Xin Wang, Xiaohan Lan, Hong Chen, Xuguang Duan, Jia Jia, and Wenwu Zhu. 2023. Curriculum-listener: Consistency-and complementarity-aware audio-enhanced temporal sentence grounding. In Proceedings of the 31st ACM International Conference on Multimedia. 3117--3128.
[8]
Jiaming Chen, Weixin Luo, Wei Zhang, and Lin Ma. 2022. Explore inter-contrast between videos via composition for weakly supervised temporal sentence grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 267--275.
[9]
Long Chen, Yulei Niu, Brian Chen, Xudong Lin, Guangxing Han, Christopher Thomas, Hammad Ayyubi, Heng Ji, and Shih-Fu Chang. 2022. Weakly-Supervised Temporal Article Grounding. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 9402--9413.
[10]
Ran Cui, Tianwen Qian, Pai Peng, Elena Daskalaki, Jingjing Chen, Xiaowei Guo, Huyang Sun, and Yu-Gang Jiang. 2022. Video moment retrieval from text queries via single frame annotation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1033--1043.
[11]
Xinpeng Ding, Nannan Wang, Shiwei Zhang, De Cheng, Xiaomeng Li, Ziyuan Huang, Mingqian Tang, and Xinbo Gao. 2021. Support-set based cross-supervision for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11573--11582.
[12]
Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo, and Liqiang Nie. 2023. Multi-queue Momentum Contrast for Microvideo-Product Retrieval. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, (Singapore, Singapore,). 1003--1011. https://doi.org/10.1145/3539597.3570405
[13]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision. 5267--5275.
[14]
Junyu Gao and Changsheng Xu. 2021. Fast video moment retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1523--1532.
[15]
Mingfei Gao, Larry S Davis, Richard Socher, and Caiming Xiong. 2019. Wslln: Weakly supervised natural language localization networks. arXiv preprint arXiv:1909.00239 (2019).
[16]
Zijian Gao, Jingyu Liu, Weiqi Sun, Sheng Chen, Dedan Chang, and Lili Zhao. 2021. Clip2tv: Align, match and distill for video-text retrieval. arXiv preprint arXiv:2111.05610 (2021).
[17]
Wenjia Geng, Yong Liu, Lei Chen, Sujia Wang, Jie Zhou, and Yansong Tang. 2024. Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 1896--1904.
[18]
Tengda Han, Weidi Xie, and Andrew Zisserman. 2022. Temporal alignment networks for long-term video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2906--2916.
[19]
Zhijian Hou, Wanjun Zhong, Lei Ji, DIFEI GAO, Kun Yan, WK Chan, Chong-Wah Ngo, Mike Zheng Shou, and Nan Duan. 2023. CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding. In The 61st Annual Meeting Of The Association For Computational Linguistics.
[20]
Yupeng Hu, Meng Liu, Xiaobin Su, Zan Gao, and Liqiang Nie. 2021. Video moment localization via deep cross-modal hashing. IEEE Transactions on Image Processing, Vol. 30 (2021), 4667--4677.
[21]
Yupeng Hu, Liqiang Nie, Meng Liu, Kun Wang, Yinglong Wang, and Xian-Sheng Hua. 2021. Coarse-to-fine semantic alignment for cross-modal moment localization. IEEE Transactions on Image Processing, Vol. 30 (2021), 5933--5943.
[22]
Yupeng Hu, Kun Wang, Meng Liu, Haoyu Tang, and Liqiang Nie. 2023. Semantic collaborative learning for cross-modal moment localization. ACM Transactions on Information Systems, Vol. 42, 2 (2023), 1--26.
[23]
Zhenyu Huang, Guocheng Niu, Xiao Liu, Wenbiao Ding, Xinyan Xiao, Hua Wu, and Xi Peng. 2021. Learning with noisy correspondence for cross-modal matching. Advances in Neural Information Processing Systems, Vol. 34 (2021), 29406--29419.
[24]
Xun Jiang, Zailei Zhou, Xing Xu, Yang Yang, Guoqing Wang, and Heng Tao Shen. 2023. Faster Video Moment Retrieval with Point-Level Supervision. arXiv preprint arXiv:2305.14017 (2023).
[25]
Wang Jing, Aixin Sun, Hao Zhang, and Xiaoli Li. 2023. MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1387--1400.
[26]
Sunoh Kim, Jungchan Cho, Joonsang Yu, Young Joon Yoo, and Jin Young Choi. 2024. Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 2795--2803.
[27]
Pilhyeon Lee and Hyeran Byun. 2021. Learning action completeness from points for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF international conference on computer vision. 13648--13657.
[28]
Hanjun Li, Xiujun Shu, Sunan He, Ruizhi Qiao, Wei Wen, Taian Guo, Bei Gan, and Xing Sun. 2023. D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13734--13746.
[29]
Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, and Huasheng Liu. 2020. Weakly-supervised video moment retrieval via semantic completion network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11539--11546.
[30]
Fan Liu, Huilin Chen, Zhiyong Cheng, Anan Liu, Liqiang Nie, and Mohan Kankanhalli. 2023. Disentangled Multimodal Representation Learning for Recommendation. IEEE Transactions on Multimedia, Vol. 25, 11 (2023), 7149--7159.
[31]
Fan Liu, Huilin Chen, Zhiyong Cheng, Liqiang Nie, and Mohan Kankanhalli. 2023. Semantic-Guided Feature Distillation for Multimodal Recommendation. In Proceedings of the 31st ACM International Conference on Multimedia. ACM, 6567--6575.
[32]
Fan Liu, Zhiyong Cheng, Lei Zhu, Zan Gao, and Liqiang Nie. 2021. Interest-Aware Message-Passing GCN for Recommendation. In Proceedings of the Web Conference 2021. ACM, 1296--1305.
[33]
Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In The 41st international ACM SIGIR conference on research & development in information retrieval. 15--24.
[34]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[35]
Zhuoyan Luo, Yicheng Xiao, Yong Liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, and Yujiu Yang. 2024. Soc: Semantic-assisted object cluster for referring video object segmentation. Advances in Neural Information Processing Systems, Vol. 36 (2024).
[36]
Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, and Zheng Shou. 2020. Sf-net: Single-frame supervision for temporal action localization. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV 16. Springer, 420--437.
[37]
Minuk Ma, Sunjae Yoon, Junyeong Kim, Youngjoon Lee, Sunghun Kang, and Chang D Yoo. 2020. Vlanet: Video-language alignment network for weakly-supervised video moment retrieval. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXVIII 16. Springer, 156--171.
[38]
Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9879--9889.
[39]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[40]
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, Vol. 1 (2013), 25--36.
[41]
Marcus Rohrbach, Michaela Regneri, Mykhaylo Andriluka, Sikandar Amin, Manfred Pinkal, and Bernt Schiele. 2012. Script data for attribute-based recognition of composite activities. In Computer Vision-ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I 12. Springer, 144--157.
[42]
Xingyu Shen, Long Lan, Huibin Tan, Xiang Zhang, Xurui Ma, and Zhigang Luo. 2022. Joint modality synergy and spatio-temporal cue purification for moment localization. In Proceedings of the 2022 International Conference on Multimedia Retrieval. 369--379.
[43]
Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14. Springer, 510--526.
[44]
Haoyu Tang, Yupeng Hu, Yunxiao Wang, Shuaike Zhang, Mingzhu Xu, Jihua Zhu, and Qinghai Zheng. 2024. Listen as you wish: Fusion of audio and text for cross-modal event detection in smart cities. Information Fusion, Vol. 110 (2024), 102460.
[45]
Haoyu Tang, Jihua Zhu, Meng Liu, Zan Gao, and Zhiyong Cheng. 2022. Frame-Wise Cross-Modal Matching for Video Moment Retrieval. IEEE Transactions on Multimedia, Vol. 24, 105 (2022), 1338--1349.
[46]
Haoyu Tang, Jihua Zhu, Lin Wang, Qinghai Zheng, and Tianwei Zhang. 2022. Multi-Level Query Interaction for Temporal Language Grounding. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, 12 (2022), 25479--25488.
[47]
Zineng Tang, Jie Lei, and Mohit Bansal. 2021. Decembert: Learning from noisy instructional videos via dense captions and entropy minimization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2415--2426.
[48]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.
[49]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[50]
Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6629--6638.
[51]
Yunxiao Wang, Meng Liu, Yinwei Wei, Zhiyong Cheng, Yinglong Wang, and Liqiang Nie. 2022. Siamese alignment network for weakly supervised video moment retrieval. IEEE Transactions on Multimedia (2022).
[52]
Zhe Xu, Kun Wei, Xu Yang, and Cheng Deng. 2022. Point-supervised video temporal grounding. IEEE Transactions on Multimedia (2022).
[53]
Shuo Yang and Xinxiao Wu. 2022. Entity-aware and motion-aware transformers for language-driven action localization in videos. arXiv preprint arXiv:2205.05854 (2022).
[54]
Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. 2024. Self-chained image-language model for video localization and question answering. Advances in Neural Information Processing Systems, Vol. 36 (2024).
[55]
Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9159--9166.
[56]
Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2023. Temporal sentence grounding in videos: A survey and future directions. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[57]
Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12870--12877.
[58]
Songyang Zhang, Jinsong Su, and Jiebo Luo. 2019. Exploiting temporal relationships in video moment localization with natural language. In Proceedings of the 27th ACM International Conference on Multimedia. 1230--1238.
[59]
Zhu Zhang, Zhou Zhao, Zhijie Lin, Xiuqiang He, et al. 2020. Counterfactual contrastive learning for weakly-supervised vision-language grounding. Advances in Neural Information Processing Systems, Vol. 33 (2020), 18123--18134.
[60]
Minghang Zheng, Yanjie Huang, Qingchao Chen, and Yang Liu. 2022. Weakly supervised video moment localization with contrastive negative sample mining. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3517--3525.
[61]
Minghang Zheng, Yanjie Huang, Qingchao Chen, Yuxin Peng, and Yang Liu. 2022. Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15555--15564.

Index Terms

  1. Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment Localization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
      October 2024
      11719 pages
      ISBN:9798400706868
      DOI:10.1145/3664647
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 October 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. correspondence learning
      2. cross-modal moment localization
      3. cross-modal retrieval

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      MM '24
      Sponsor:
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne VIC, Australia

      Acceptance Rates

      MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 80
        Total Downloads
      • Downloads (Last 12 months)80
      • Downloads (Last 6 weeks)36
      Reflects downloads up to 18 Jan 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media