skip to main content
10.1145/3607540.3617142acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Narrative Graph for Narrative Generation from Long Videos

Published: 29 October 2023 Publication History

Abstract

Advancements in camera technology and cloud storage have led to a surge in video content creation, making videos more accessible. However, consuming raw, unprocessed, and lengthy videos can be unengaging. While videos with human-authored narratives (such as videos on YouTube) are captivating, creating such video requires a tremendous amount of effort and skill, and its scalability remains a bottleneck. To address this, we propose an algorithmic narrator that generates topic-specific narratives in real-time from raw videos, inspired by ChatGPT's natural language processing capabilities. Specifically, we proposed a novel narrative graph structure that captures narrative-worthy and semantically enriched factual information, as well as establishes temporal and causal links between narrative segments. The narrative graph is then fed to the algorithmic narrator to generate a textual narrative summary. Our comprehensive empirical study demonstrates the potential of algorithmic narrators and narrative graphs in creating engaging and coherent narratives, offering insights for the future of video content consumption.

References

[1]
Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet Project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1 (ACL '98/COLING '98). Association for Computational Linguistics, 86--90.
[2]
Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract Meaning Representation for Sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, LAW-ID@ACL 2013, August 8--9, 2013, Sofia, Bulgaria. The Association for Computer Linguistics, 178--186.
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, 65--72.
[4]
Xiaoyi Bao, Wang Zhongqing, Xiaotong Jiang, Rong Xiao, and Shoushan Li. 2022. Aspect-based Sentiment Analysis with Opinion Tree Generation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence Organization, 4044--4050.
[5]
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).
[6]
Vinay Chaudhri, Chaitanya Baru, Naren Chittar, Xin Dong, Michael Genesereth, James Hendler, Aditya Kalyanpur, Douglas Lenat, Juan Sequeda, Denny Vrande?i?, and Kuansan Wang. 2022. Knowledge Graphs: Introduction, History and, Perspectives., Vol. 43 (2022), 17--29.
[7]
Vincent S Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher Re, and Li Fei-Fei. 2019. Scene graph prediction with limited labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2580--2590.
[8]
Dorrit Cohn. 2000. The distinction of fiction. JHU Press.
[9]
Ann Copestake, Dan Flickinger, Carl Pollard, and Ivan Sag. 2005. Minimal Recursion Semantics: An Introduction. Research On Language And Computation, Vol. 3 (2005), 281--332.
[10]
Harper Eric, Majumdar Somshubra, Kuchaiev Oleksii, Jason Li, Zhang Yang, Bakhturina Evelina, Noroozi Vahid, Subramanian Sandeep, Nithin Koluguri, Jocelyn Huang, Jia Fei, Balam Jagadeesh, Yang Xuesong, Livne Micha, Dong Yi, Naren Sean, and Ginsburg Boris. 2022. NeMo: a toolkit for Conversational AI and Large Language Models. https://nvidia.github.io/NeMo/
[11]
Charles J Fillmore, Christopher R Johnson, and Miriam RL Petruck. 2003. Background to framenet. International journal of lexicography, Vol. 16, 3 (2003), 235--250.
[12]
Mingfei Han, David Junhao Zhang, Yali Wang, Rui Yan, Lina Yao, Xiaojun Chang, and Yu Qiao. 2022. Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18--24, 2022. IEEE, 2980--2989.
[13]
Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2017. Deep Semantic Role Labeling: What Works and What's Next. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 473--483.
[14]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 961--970.
[15]
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics, Vol. 8 (2020), 64--77.
[16]
Daniel Jurafsky and James H. Martin. 2023. Speech and Language Processing (3rd Edition Draft). https://web.stanford.edu/ jurafsky/slp3/ed3book_jan72023.pdf
[17]
Di Kang, Zheng Ma, and Antoni B. Chan. 2019. Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks - Counting, Detection, and Tracking. IEEE Trans. Circuits Syst. Video Technol., Vol. 29, 5 (2019), 1408--1422.
[18]
Pavan Kapanipathi, Ibrahim Abdelaziz, Srinivas Ravishankar, Salim Roukos, Alexander Gray, Ramon Astudillo, Maria Chang, Cristina Cornelio, Saswati Dana, Achille Fokoue, et al. 2020. Leveraging abstract meaning representation for knowledge base question answering. arXiv preprint arXiv:2012.01707 (2020).
[19]
Insoo Kim, Seungju Han, Seong-Jin Park, Ji-Won Baek, Jinwoo Shin, Jae-Joon Han, and Changkyu Choi. 2020. DiscFace: Minimum Discrepancy Learning for Deep Face Recognition. In Proceedings of the Asian Conference on Computer Vision (ACCV).
[20]
Minchul Kim, Anil K. Jain, and Xiaoming Liu. 2022. AdaFace: Quality Adaptive Margin for Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18750--18759.
[21]
Paul Kingsbury and Martha Palmer. 2003. Propbank: the next level of treebank. In Proceedings of Treebanks and lexical Theories, Vol. 3. Citeseer.
[22]
Jose Angel Garcia Landa. 2005. Narrative theory. University of Zaragoza. On Line Edition (2005).
[23]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online.
[24]
Changmao Li and Jeffrey Flanigan. 2022. Improving Neural Machine Translation with the Abstract Meaning Representation by Combining Graph and Sequence Transformers. In Proceedings of the 2nd Workshop on Deep Learning on Graphs for Natural Language Processing (DLG4NLP 2022). Association for Computational Linguistics, 12--21.
[25]
Tianshan Liu and Kin-Man Lam. 2022. A Hybrid Egocentric Activity Anticipation Framework via Memory-Augmented Recurrent and One-Shot Representation Forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13904--13913.
[26]
Weizhe Liu, Nikita Durasov, and Pascal Fua. 2022. Leveraging Self-Supervision for Cross-Domain Crowd Counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5341--5352.
[27]
Manuel Mager, Ramón Fernandez Astudillo, Tahira Naseem, Md Arafat Sultan, Young-Suk Lee, Radu Florian, and Salim Roukos. 2020. GPT-too: A Language-Model-First Approach for AMR-to-Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1846--1852.
[28]
Diego Marcheggiani and Ivan Titov. 2017. Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1506--1515.
[29]
Gardner Matt, Grus Joel, Neumann Mark, Tafjord Oyvind, Dasigi Pradeep, Liu Nelson, Peters Matthew, Schmitz Michael, and Zettlemoyer Luke. [n.,d.]. AllenNLP: A Deep Semantic Natural Language Processing Platform. https://github.com/allenai/allennlp
[30]
Niall McLaughlin, Jesus Martinez del Rincon, and Paul Miller. 2016. Recurrent Convolutional Network for Video-Based Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[31]
Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixé, and Christoph Feichtenhofer. 2022. TrackFormer: Multi-Object Tracking with Transformers. In CVPR. 8844--8854.
[32]
Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers. Association for Computational Linguistics, 186--191.
[33]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res., Vol. 21 (2020), 140:1--140:67.
[34]
Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher D Manning. 2010. A multi-pass sieve for coreference resolution. In Proceedings of the 2010 conference on empirical methods in natural language processing. 492--501.
[35]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[36]
Leonardo F. R. Ribeiro, Martin Schmitt, Hinrich Schütze, and Iryna Gurevych. 2021. Investigating Pretrained Language Models for Graph-to-Text Generation. In Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI. Association for Computational Linguistics.
[37]
Brian Richardson. 2000. Recent concepts of narrative and the narratives of narrative theory. Style, Vol. 34, 2 (2000), 168--175.
[38]
Josef Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, and Jan Scheffczyk. 2006. FrameNet II: Extended theory and practice.
[39]
Xindi Shang, Zehuan Yuan, Anran Wang, and Changhu Wang. 2021. Multimodal Video Summarization via Time-Aware Transformers. In ACM Multimedia. 1756--1765.
[40]
Peng Shi and Jimmy J. Lin. 2019. Simple BERT Models for Relation Extraction and Semantic Role Labeling. ArXiv, Vol. abs/1904.05255 (2019).
[41]
Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2016. Sparsifying Neural Network Connections for Face Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[42]
Sellam Thibault, Das Dipanjan, and Parikh Ankur. 2020. BLEURT: Learning Robust Metrics for Text Generation. In ACL.
[43]
Sanh Victor, Debut Lysandre, Chaumond Julien, and Wolf Thomas. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, Vol. abs/1910.01108 (2019).
[44]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 38--45.
[45]
Yongkang Wong, Shaojing Fan, Yangyang Guo, Ziwei Xu, Karen Stephen, Rishabh Sheoran, Anusha Bhamidipati, Vivek Barsopia, Jianquan Liu, and Mohan S. Kankanhalli. 2022. Compute to Tell the Tale: Goal-Driven Narrative Generation. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 6875--6882.
[46]
Shuangjie Xu, Yu Cheng, Kang Gu, Yang Yang, Shiyu Chang, and Pan Zhou. 2017. Jointly Attentive Spatial-Temporal Pooling Networks for Video-Based Person Re-Identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[47]
Zhihua Yan and Xijin Tang. 2023. Narrative Graph: Telling Evolving Stories Based on Event-centric Temporal Knowledge Graph. Journal of Systems Science and Systems Engineering, Vol. 32, 2 (2023), 206--221.
[48]
Fan Yang, Xin Chang, Sakriani Sakti, Yang Wu, and Satoshi Nakamura. 2021. ReMOT: A model-agnostic refinement for multiple object tracking. Image and Vision Computing, Vol. 106 (2021), 104091.
[49]
Jinrui Yang, Wei-Shi Zheng, Qize Yang, Ying-Cong Chen, and Qi Tian. 2020. Spatial-Temporal Graph Convolutional Network for Video-Based Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50]
Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization. In CVPR. 982--990.
[51]
Shuai Yi, Xiaogang Wang, Cewu Lu, and Jiaya Jia. 2014. L0 Regularized Stationary Time Estimation for Crowd Group Analysis. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23--28, 2014. IEEE Computer Society, 2219--2226.
[52]
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization. In CVPR. 7405--7414.
[53]
Luowei Zhou, Chenliang Xu, and Jason J Corso. 2017. Towards Automatic Learning of Procedures from Web Instructional Videos. arXiv preprint arXiv:1703.09788 (2017). io

Index Terms

  1. Narrative Graph for Narrative Generation from Long Videos

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos
        October 2023
        82 pages
        ISBN:9798400702778
        DOI:10.1145/3607540
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 29 October 2023

        Check for updates

        Author Tags

        1. computational narrative generation
        2. narrative graph
        3. storytelling

        Qualifiers

        • Research-article

        Funding Sources

        • National Research Foundation, Singapore

        Conference

        MM '23
        Sponsor:

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 333
          Total Downloads
        • Downloads (Last 12 months)277
        • Downloads (Last 6 weeks)37
        Reflects downloads up to 14 Jan 2025

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media