research-article

Sequential Action Retrieval for Generating Narratives from Long Videos

Authors:

Satoshi Yamazaki,

Mohan KankanhalliAuthors Info & Claims

NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos

Pages 25 - 29

https://doi.org/10.1145/3607540.3617143

Published: 29 October 2023 Publication History

Abstract

In this paper, we propose a novel event retrieval method called Sequential Action Retrieval, which is a work in progress, towards generating video and text narratives of long-term events from long videos. Summarizing events of user interest from long videos is a challenging topic. Our proposed method aims at detecting long-term human activities defined by a sequence of action elements. By searching action elements from a semantic video graph that structures the appeared objects and their relationships in videos, our method is able to recognize complex action events such as object owner changes involving two or more people. We conducted an initial evaluation of event-related person detection on the Narrative dataset. We introduce a new evaluation metrics, KP-IDF1 to evaluate the accuracy of the appearances of related persons. Our proposed method achieves 76.4% of KP-IDF1 for the case of bicycle theft.

References

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, Vol. 35 (2022), 23716--23736.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR. 6077--6086.

[3]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In ICCV. 2425--2433.

[4]

Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. 2018. Zero-shot object detection. In Proceedings of the European conference on computer vision (ECCV). 384--400.

Digital Library

[5]

Skylar Bayer and Annaliese Hettinger. 2019. Storytelling: A Natural Tool to Weave the Threads of Science and Community Together. The Bulletin of the Ecological Society of America, Vol. 100, 2 (2019), e01542.

[6]

Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. 2016. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP). IEEE, 3464--3468.

[7]

Jo a o Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In CVPR. 4724--4733.

[8]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast Networks for Video Recognition. In ICCV. 6201--6210.

[9]

Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 759--768.

[10]

Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6047--6056.

[11]

Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3090--3098.

[12]

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10236--10247.

[13]

Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In ICML (Proceedings of Machine Learning Research, Vol. 139). PMLR, 5583--5594.

[14]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).

[15]

Zelun Luo, Wanze Xie, Siddharth Kapoor, Yiyun Liang, Michael Cooper, Juan Carlos Niebles, Ehsan Adeli, and Fei-Fei Li. 2021. MOMA: Multi-Object Multi-Actor Activity Parsing. Advances in Neural Information Processing Systems, Vol. 34 (2021), 17939--17955.

[16]

Danila Potapov, Matthijs Douze, Zaid Harchaoui, and Cordelia Schmid. 2014. Category-specific video summarization. In European conference on computer vision. Springer, 540--555.

[17]

Nishant Rai, Haofeng Chen, Jingwei Ji, Rishi Desai, Kazuki Kozuka, Shun Ishizaka, Ehsan Adeli, and Juan Carlos Niebles. 2021. Home action genome: Cooperative compositional action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11184--11193.

[18]

Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision. Springer, 17--35.

[19]

Aidean Sharghi, Jacob S Laurel, and Boqing Gong. 2017. Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4788--4797.

[20]

Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision. Springer, 510--526.

[21]

Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5179--5187.

[22]

Karen Stephen, Rishabh Sheoran, and Satoshi Yamazaki. 2022. Narrative Dataset: Towards Goal-Driven Narrative Generation. In Proceedings of the 1st Workshop on User-centric Narrative Summarization of Long Videos. 7--12.

Digital Library

[23]

Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, Vol. 34 (2021), 200--212.

[24]

Yongkang Wong, Shaojing Fan, Yangyang Guo, Ziwei Xu, Karen Stephen, Rishabh Sheoran, Anusha Bhamidipati, Vivek Barsopia, Jianquan Liu, and Mohan Kankanhalli. 2022. Compute to Tell the Tale: Goal-Driven Narrative Generation. In ACM Multimedia (to be published).

[25]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.

[26]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5831--5840.

[27]

Hong-Bo Zhang, Yi-Xiang Zhang, Bineng Zhong, Qing Lei, Lijie Yang, Ji-Xiang Du, and Duan-Sheng Chen. 2019. A comprehensive survey of vision-based human action recognition methods. Sensors, Vol. 19, 5 (2019), 1005.

[28]

Zhimeng Zhang, Jianan Wu, Xuan Zhang, and Chi Zhang. 2017. Multi-target, multi-camera tracking by hierarchical clustering: Recent progress on dukemtmc project. arXiv preprint arXiv:1712.09531 (2017).

[29]

Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao Xiang. 2019. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision. 3702--3712.

[30]

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Kr"ahenbühl, and Ishan Misra. 2022. Detecting twenty-thousand classes using image-level supervision. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part IX. Springer, 350--368. io

Cited By

Xu WWang MZhou WLi HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday TaskProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680661(6969-6978)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680661

Index Terms

Sequential Action Retrieval for Generating Narratives from Long Videos

Recommendations

Narrative Dataset: Towards Goal-Driven Narrative Generation
NarSUM '22: Proceedings of the 1st Workshop on User-centric Narrative Summarization of Long Videos

In this paper, we propose a new dataset called the Narrative dataset, which is a work in progress, towards generating video and text narratives of complex daily events from long videos, captured from multiple cameras. As most of the existing datasets ...
Explainable Video Action Reasoning via Prior Knowledge and State Transitions
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Human action analysis and understanding in videos is an important and challenging task. Although substantial progress has been made in past years, the explainability of existing methods is still limited. In this work, we propose a novel action reasoning ...
The Chronicles of ChatGPT: Generating and Evaluating Visual Novel Narratives on Climate Change Through ChatGPT
Interactive Storytelling
Abstract
This paper explores the potential of utilizing ChatGPT, a large language model (LLM), for generating and evaluating visual novel (VN) game stories in the context of global warming awareness through a VN game. The study involves generating two ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos

October 2023

82 pages

ISBN:9798400702778

DOI:10.1145/3607540

General Chairs:
Mohan S. Kankanhalli
National University of Singapore
,
Ioannis (Yiannis) Patras
Queen Mary University of London
,
Program Chairs:
Jianquan Liu
NEC Corporation, Japan
,
Yongkang Wong
National University of Singapore
,
Takahiro Komamizu
Nagoya University, Japan

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29, 2023

Ottawa ON, Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
65
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)4

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu WWang MZhou WLi HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday TaskProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680661(6969-6978)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680661

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents