research-article

TeViS: Translating Text Synopses to Video Storyboards

Authors:

Xu Gu,

Xiang CaoAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 4968 - 4979

https://doi.org/10.1145/3581783.3612417

Published: 27 October 2023 Publication History

Get Access

Abstract

A video storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards, however, remains challenging which not only requires cross-modal association between high-level texts and images but also demands long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis. We construct a MovieNet-TeViS dataset based on the public MovieNet dataset [17]. It contains 10K text synopses each paired with keyframes manually selected from corresponding movies by considering both relevance and cinematic coherence. To benchmark the task, we present strong CLIP-based baselines and a novel VQ-Trans model. VQ-Trans first encodes text synopsis and images into a joint embedding space and uses vector quantization (VQ) to improve the visual representation. Then, it auto-regressively generates a sequence of visual features for retrieval and ordering. Experimental results demonstrate that VQ-Trans significantly outperforms prior methods and the CLIP-based baselines. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work. The code and data are available at: https://ruc-aimind.github.io/projects/TeViS/

Supplemental Material

MP4 File

Introductory video about the MM'23 paper "TeViS: Translating Text Synopses to Video Storyboards" presented by Xu Gu, Renmin University of China.

Download
18.53 MB

References

[1]

Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. 2020. Condensed Movies: Story Based Retrieval with Contextual Embeddings. In Computer Vision - ACCV 2020: 15th Asian Conference on Computer Vision, Kyoto, Japan, November 30 - December 4, 2020, Revised Selected Papers, Part V (Kyoto, Japan). Springer-Verlag, Berlin, Heidelberg, 460--479. https://doi.org/10.1007/978-3-030-69541-5_28

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

ScaMo: Towards Text to Video Storyboard Generation Using Scale and Movement of Shots

Multi-Device Storyboards for Cinematic Narratives in VR

CollageVis: Rapid Previsualization Tool for Indie Filmmaking using Video Collages

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations