Make-A-Story: Visual Memory Conditioned Consistent Story Generation

Rahman, Tanzila; Lee, Hsin-Ying; Ren, Jian; Tulyakov, Sergey; Mahajan, Shweta; Sigal, Leonid

Computer Science > Computer Vision and Pattern Recognition

arXiv:2211.13319 (cs)

[Submitted on 23 Nov 2022 (v1), last revised 6 May 2023 (this version, v3)]

Title:Make-A-Story: Visual Memory Conditioned Consistent Story Generation

Authors:Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, Leonid Sigal

View PDF

Abstract:There has been a recent explosion of impressive generative models that can produce high quality images (or videos) conditioned on text descriptions. However, all such approaches rely on conditional sentences that contain unambiguous descriptions of scenes and main actors in them. Therefore employing such models for more complex task of story visualization, where naturally references and co-references exist, and one requires to reason about when to maintain consistency of actors and backgrounds across frames/scenes, and when not to, based on story progression, remains a challenge. In this work, we address the aforementioned challenges and propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context across the generated frames. Sentence-conditioned soft attention over the memories enables effective reference resolution and learns to maintain scene and actor consistency when needed. To validate the effectiveness of our approach, we extend the MUGEN dataset and introduce additional characters, backgrounds and referencing in multi-sentence storylines. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background.

Comments:	11 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2211.13319 [cs.CV]
	(or arXiv:2211.13319v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2211.13319

Submission history

From: Tanzila Rahman [view email]
[v1] Wed, 23 Nov 2022 21:38:51 UTC (13,269 KB)
[v2] Sat, 31 Dec 2022 10:17:03 UTC (39,680 KB)
[v3] Sat, 6 May 2023 02:25:31 UTC (25,278 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Make-A-Story: Visual Memory Conditioned Consistent Story Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Make-A-Story: Visual Memory Conditioned Consistent Story Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators