skip to main content
10.1145/3503161.3549202acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Compute to Tell the Tale: Goal-Driven Narrative Generation

Published: 10 October 2022 Publication History

Abstract

Man is by nature a social animal. One important facet of human evolution is through narrative imagination, be it fictional or factual, and to tell the tale to other individuals. The factual narrative, such as news, journalism, field report, etc., is based on real-world events and often requires extensive human efforts to create. In the era of big data where video capture devices are commonly available everywhere, a massive amount of raw videos (including life-logging, dashcam or surveillance footage) are generated daily. As a result, it is rather impossible for humans to digest and analyze these video data. This paper reviews the problem of computational narrative generation where a goal-driven narrative (in the form of text with or without video) is generated from a single or multiple long videos. Importantly, the narrative generation problem makes itself distinguished from the existing literature by its focus on a comprehensive understanding of user goal, narrative structure and open-domain input. We tentatively outline a general narrative generation framework and discuss the potential research problems and challenges in this direction. Informed by the real-world impact of narrative generation, we then illustrate several practical use cases in Video Logging as a Service platform which enables users to get more out of the data through a goal-driven intelligent storytelling AI agent.

Supplementary Material

MP4 File (mmbni17.mp4)
This work introduces a novel goal-driven computational factual narrative generation task for long videos. We delineate inspiration from social science literature, as well as discuss the challenges of the proposed task. This video provides a brief overview of the paper.

References

[1]
Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. 2011. Analyzing User Modeling on Twitter for Personalized News Recommendations. In UMAP (Lecture Notes in Computer Science, Vol. 6787). Springer, 1--12.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR. 6077--6086.
[3]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In ICCV. 2425--2433.
[4]
Markus Appel and Tobias Richter. 2007. Persuasive effects of fictional narratives increase over time. Media Psychology 10, 1 (2007), 113--134.
[5]
Skylar Bayer and Annaliese Hettinger. 2019. Storytelling: A Natural Tool to Weave the Threads of Science and Community Together. The Bulletin of the Ecological Society of America 100, 2 (2019), e01542.
[6]
Sébastien Caquard and Daniel Naud. 2014. Aspatial typology of cinematographic narratives. Modern Cartography Series 5 (2014), 161--174.
[7]
João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In CVPR. 4724--4733.
[8]
Gregory D. Castañón, Yuting Chen, Ziming Zhang, and Venkatesh Saligrama. 2015. Efficient Activity Retrieval through Semantic Graph Queries. In ACM Multimedia. 391--400.
[9]
Hong Chen, Yifei Huang, Hiroya Takamura, and Hideki Nakayama. 2021. Commonsense Knowledge Aware Concept Selection For Diverse and Informative Visual Storytelling. In AAAI. 999--1008.
[10]
Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning. In CVPR. 10635--10644.
[11]
Dorrit Cohn. 2000. The distinction of fiction. JHU Press.
[12]
John M. Conroy and Dianne P. O'Leary. 2001. Text Summarization via Hidden Markov Models. In SIGIR. 406--407.
[13]
Martin Cortazzi. 1994. Narrative analysis. Language Teaching 27, 3 (1994), 157--170.
[14]
Mariam Daoud, Lynda Tamine-Lechani, Mohand Boughanem, and Bilal Chebaro. 2009. A session based personalized search using an ontological user profile. In SAC. 1732--1736.
[15]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In CVPR. 1080--1089.
[16]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. ArcFace: Additive angular margin loss for deep face recognition. In CVPR. 4690--4699.
[17]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.
[18]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
[19]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast Networks for Video Recognition. In ICCV. 6201--6210.
[20]
Tian Gan, Yongkang Wong, Daqing Zhang, and Mohan S. Kankanhalli. 2013. Temporal encoded F-formation system for social interaction detection. In ACM Multimedia. 937--946.
[21]
Susan Gauch, Mirco Speretta, Aravind Chandramouli, and Alessandro Micarelli. 2007. User Profiles for Personalized Information Access. In The Adaptive Web (Lecture Notes in Computer Science, Vol. 4321). Springer, 54--89.
[22]
James Paul Gee and Francois Grosjean. 1984. Empirical evidence for narrative structure. Cognitive Science 8, 1 (1984), 59--85.
[23]
John C Georgesen and Cecilia H Solano. 1999. The effects of motivation on narrative content and structure. Journal of Language and Social Psychology 18, 2 (1999), 175--194.
[24]
Eleonora Giunchiglia and Thomas Lukasiewicz. 2021. Multi-Label Classification Neural Networks with Hard Logical Constraints. Journal of Artificial Intelligence Research 72 (2021), 759--818.
[25]
Melanie C Green. 2006. Narratives and cancer communication. Journal of Communication 56 (2006), S163--S183.
[26]
Jian Guan and Minlie Huang. 2020. UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation. In EMNLP. 9157--9166.
[27]
Jian Guan, Yansen Wang, and Minlie Huang. 2019. Story Ending Generation with Incremental Encoding and Commonsense Knowledge. In AAAI. 6473--6480.
[28]
Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, and Mohan S. Kankanhalli. 2018. Multi-modal Preference Modeling for Product Search. In ACM Multimedia. 1865--1873.
[29]
Yangyang Guo, Liqiang Nie, Yongkang Wong, Yibing Liu, Zhiyong Cheng, and Mohan Kankanhalli. 2022. A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA. In ACM Multimedia.
[30]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2020. Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 2 (2020), 386--397.
[31]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.
[32]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. 173--182.
[33]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735--1780.
[34]
Hans Hoeken and Jop Sinkeldam. 2014. The role of identification and perception of just outcome in evoking emotions in narrative persuasion. Journal of Communication 64, 5 (2014), 935--955.
[35]
Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Oliver Wu, Jianfeng Wang, and Xiaodong He. 2019. Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation. In AAAI. 8465--8472.
[36]
Qingbao Huang, Chuan Huang, Linzhang Mo, Jielong Wei, Yi Cai, Ho-fung Leung, and Qing Li. 2021. IgSEG: Image-guided Story Ending Generation. In ACL/IJCNLP (Findings). 3114--3123.
[37]
JenaD. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi. 2021. (Comet-) Atomic 2020: On Symbolic and Neural Commonsense Knowledge Graphs. In AAAI. 6384--6392.
[38]
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs. In CVPR. 10233--10244.
[39]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128--3137.
[40]
Gibak Kim, Yang Lu, Y. Hu, and Philipos C. Loizou. 2009. An algorithm that improves speech intelligibility in noise for normal-hearing listeners. The Journal of the Acoustical Society of America 126, 3 (2009), 1486--1494.
[41]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In ICML (Proceedings of Machine Learning Research, Vol. 139). PMLR, 5583--5594.
[42]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123, 1 (2017), 32--73.
[43]
Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. Evaluating the Factual Consistency of Abstractive Text Summarization. In EMNLP. 9332--9346.
[44]
Jose Angel Garcia Landa. 2005. Narrative theory. University of Zaragoza. On Line Edition (2005).
[45]
Quoc V. Le and Tomás Mikolov. 2014. Distributed Representations of Sentences and Documents. In ICML (JMLR Workshop and Conference Proceedings, Vol. 32). JMLR.org, 1188--1196.
[46]
Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2020. Action- Centric Relation Transformer Network for Video Question Answering. In CVPR. 9972--9981.
[47]
Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2017. Dual- Glance Model for Deciphering Social Relationships. In ICCV. 2669--2678.
[48]
Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2020. Video Storytelling: Textual Summaries for Events. IEEE Transactions on Multimedia 22, 2 (2020), 554--565.
[49]
Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2020. Visual Social Relationship Recognition. International Journal of Computer Vision 128, 6 (2020), 1750--1764.
[50]
Liang Li, Xingyu Gao, Jincan Deng, Yunbin Tu, Zheng-Jun Zha, and Qingming Huang. 2022. Long Short-Term Relation Transformer With Global Gating for Video Captioning. In IEEE Transactions on Image Processing, Vol. 31. 2726--2738.
[51]
Cewu Lu, Ranjay Krishna, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual Relationship Detection with Language Priors. In ECCV (Lecture Notes in Computer Science, Vol. 9905). Springer, 852--869.
[52]
Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. 2017. The More You Know: Using Knowledge Graphs for Image Classification. In CVPR. 20--28.
[53]
Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixé, and Christoph Feichtenhofer. 2022. TrackFormer: Multi-Object Tracking with Transformers. In CVPR. 8844--8854.
[54]
Manel Mezghani, Corinne Amel Zayani, Ikram Amous, and Faïez Gargouri. 2012. A user profile modelling using social annotations: a survey. In WWW (Companion Volume). 969--976.
[55]
Tomás Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. 3111--3119.
[56]
Emily Moyer-Gusé and Robin L Nabi. 2010. Explaining the effects of narrative in an entertainment television program: Overcoming resistance to persuasion. Human communication research 36, 1 (2010), 26--52.
[57]
J Murphy, S McDonough, R van Haren, B Triglone, and J Salinas. 2001. Practical strategies: STELLA narratives. Literacy Learning: the Middle Years 9, 2 (2001).
[58]
Medhini Narasimhan, Svetlana Lazebnik, and Alexander G. Schwing. 2018. Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering. In NeurIPS. 2659--2670.
[59]
Ira L Panzer, Alan D Sharpley, and William D Voiers. 1993. A comparison of subjective methods for evaluating speech quality. In Speech and audio coding for wireless and network applications. Springer, 59--65.
[60]
Cesc C. Park and Gunhee Kim. 2015. Expressing an Image Stream with a Sequence of Natural Sentences. In NIPS. 73--81.
[61]
Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang. 2020. Two Causal Principles for Improving Visual Dialog. In CVPR. 10857--10866.
[62]
Brian Richardson. 2000. Recent concepts of narrative and the narratives of narrative theory. Style 34, 2 (2000), 168--175.
[63]
Mark O. Riedl and Robert Michael Young. 2010. Narrative Planning: Balancing Plot and Character. Journal of Artificial Intelligence Research 39 (2010), 217--268.
[64]
Devendra Singh Sachan, Siva Reddy, William L. Hamilton, Chris Dyer, and Dani Yogatama. 2021. End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering. In NeurIPS.
[65]
Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. 2019. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. In AAAI. 3027--3035.
[66]
Anna Senina, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and Bernt Schiele. 2014. Coherent Multi- Sentence Video Description with Variable Level of Detail. In German Conference on Pattern Recognition. 184--195.
[67]
Xindi Shang, Zehuan Yuan, AnranWang, and ChanghuWang. 2021. Multimodal Video Summarization via Time-Aware Transformers. In ACM Multimedia. 1756--1765.
[68]
Yang Shao and DeLiang Wang. 2008. Robust speaker identification using auditory features and computational auditory scene analysis. In ICASSP. 1589--1592.
[69]
Danqing Shi, Xinyue Xu, Fuling Sun, Yang Shi, and Nan Cao. 2021. Calliope: Automatic Visual Data Story Generation from a Spreadsheet. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2021), 453--463.
[70]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR.
[71]
Josef Sivic and Andrew Zisserman. 2003. Video Google: A Text Retrieval Approach to Object Matching in Videos. In ICCV. 1470--1477.
[72]
Jeffrey R. Smitten and Ann Daghistany. 1981. Spatial Form in Narrative. Cornell Univ Press.
[73]
Richard Socher, Cliff Chiung-Yu Lin, AndrewY. Ng, and Christopher D. Manning. 2011. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In ICML. 129--136.
[74]
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In AAAI. 4444--4451.
[75]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In NIPS. 3104--3112.
[76]
Reuben Tan, Huijuan Xu, Kate Saenko, and Bryan A. Plummer. 2021. LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval. In WACV. 2082--2091.
[77]
Thomas Pellissier Tanon, Denny Vrandecic, Sebastian Schaffert, Thomas Steiner, and Lydia Pintscher. 2016. From Freebase to Wikidata: The Great Migration. In WWW. 1419--1428.
[78]
Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018. RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems. In AAAI. 722--729.
[79]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In CVPR. 6450--6459.
[80]
Tao Tu, Qing Ping, Govindarajan Thattai, Gökhan Tür, and Prem Natarajan. 2021. Learning Better Visual Dialog Agents With Pretrained Visual-Linguistic Representation. In CVPR. 5622--5631.
[81]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS. 5998--6008.
[82]
Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, and Bastian Leibe. 2019. MOTS: Multi- Object Tracking and Segmentation. In CVPR. 7942--7951.
[83]
DeLiang Wang and Jitong Chen. 2018. Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM Transactions on Audio, Speech and Language Processing 26, 10 (2018), 1702--1726.
[84]
Peng Wang, Qi Wu, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2018. FVQA: Fact-Based Visual Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 10 (2018), 2413--2427.
[85]
YuxuanWang, Kun Han, and DeLiangWang. 2013. Exploring Monaural Features for Classification-Based Speech Segregation. IEEE/ACM Transactions on Audio, Speech and Language Processing 21, 2 (2013), 270--279.
[86]
Stephen G. Ware, R. Michael Young, Brent Harrison, and David L. Roberts. 2014. A Computational Model of Plan-Based Narrative Conflict at the Fabula Level. IEEE Transactions on Computational Intelligence and AI in Games 6, 3 (2014), 271--288.
[87]
Shuang Wu, Shaojing Fan, Zhiqi Shen, Mohan S. Kankanhalli, and Anthony K. H. Tung. 2020. Who You Are Decides How You Tell. In ACM Multimedia. 4013--4022.
[88]
Shuang Wu, Mohan S. Kankanhalli, and Anthony K. H. Tung. 2022. Superclassaware network for few-shot learning. Computer Vision and Image Understanding 216 (2022), 103349.
[89]
Yu Xiang, Alexandre Alahi, and Silvio Savarese. 2015. Learning to Track: Online Multi-object Tracking by Decision Making. In ICCV. 4705--4713.
[90]
Yaqi Xie, Ziwei Xu, Kuldeep S. Meel, Mohan S. Kankanhalli, and Harold Soh. 2019. Embedding Symbolic Knowledge into Deep Networks. In NeurIPS. 4235--4245.
[91]
Bingjie Xu, Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2020. Interact as You Intend: Intention-Driven Human-Object Interaction Detection. IEEE Transactions on Multimedia 22, 6 (2020), 1423--1432.
[92]
Bingjie Xu, Yongkang Wong, Junnan Li, Qi Zhao, and Mohan S. Kankanhalli. 2019. Learning to Detect Human-Object Interactions With Knowledge. In CVPR. 2019--2028.
[93]
Chunpu Xu, Min Yang, Chengming Li, Ying Shen, Xiang Ao, and Ruifeng Xu. 2021. Imagine, Reason and Write: Visual Storytelling with Graph Knowledge and Relational Reasoning. In AAAI. 3022--3029.
[94]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR. 5288--5296.
[95]
Ning Xu, An-An Liu, Yongkang Wong, Yongdong Zhang, Weizhi Nie, Yuting Su, and Mohan S. Kankanhalli. 2019. Dual-Stream Recurrent Neural Network for Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology 29, 8 (2019), 2482--2493.
[96]
Ziwei Xu, Xudong Shen, Yongkang Wong, and Mohan S. Kankanhalli. 2021. Unsupervised Motion Representation Learning with Capsule Autoencoders. In NeurIPS. 3205--3217.
[97]
Su Yan, Xin Chen, Ran Huo, Xu Zhang, and Leyu Lin. 2020. Learning to Build User-tag Profile in Recommendation System. In CIKM. 2877--2884.
[98]
Fan Yang, Xin Chang, Sakriani Sakti, Yang Wu, and Satoshi Nakamura. 2021. ReMOT: A model-agnostic refinement for multiple object tracking. Image and Vision Computing 106 (2021), 104091.
[99]
Xu Yang, Chongyang Gao, Hanwang Zhang, and Jianfei Cai. 2020. Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning. In ACM Multimedia. 4181--4189.
[100]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS. 5754--5764.
[101]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing Videos by Exploiting Temporal Structure. In ICCV. 4507--4515.
[102]
Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization. In CVPR. 982--990.
[103]
Serena Yeung, Alireza Fathi, and Li Fei-Fei. 2014. In VideoSET: Video Summary Evaluation through Text. CVPR Workshop.
[104]
Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, and Zheng Qin. 2021. Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval. In CVPR. 2215--2224.
[105]
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013. Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval. In AAAI. 1198--1204.
[106]
Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S. Davis. 2019. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment. In CVPR. 1247--1257.
[107]
Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object Relational Graph With Teacher-Recommended Learning for Video Captioning. In CVPR. 13278--13288.
[108]
Ke Zhou, Shuang-Hong Yang, and Hongyuan Zha. 2011. Functional matrix factorizations for cold-start recommendation. In SIGIR. 315--324.
[109]
Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering Temporal Context for Video Question and Answering. In International Journal of Computer Vision, Vol. 124. Springer, 409--421.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. computational narrative generation
  2. video analytic

Qualifiers

  • Research-article

Funding Sources

  • National Research Foundation Singapore

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)561
  • Downloads (Last 6 weeks)44
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media