DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

Fu, Yongjie; Jain, Anmol; Di, Xuan; Chen, Xu; Mo, Zhaobin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.16647 (cs)

[Submitted on 29 Aug 2024]

Title:DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

Authors:Yongjie Fu, Anmol Jain, Xuan Di, Xu Chen, Zhaobin Mo

View PDF HTML (experimental)

Abstract:The advancement of autonomous driving technologies necessitates increasingly sophisticated methods for understanding and predicting real-world scenarios. Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. In this paper, we propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them. To achieve this, we employ a video generation framework grounded in denoising diffusion probabilistic models (DDPM) aimed at predicting real-world video sequences. We then explore the adequacy of our generated videos for use in VLMs by employing a pre-trained model known as Efficient In-context Learning on Egocentric Videos (EILEV). The diffusion model is trained with the Waymo open dataset and evaluated using the Fréchet Video Distance (FVD) score to ensure the quality and realism of the generated videos. Corresponding narrations are provided by EILEV for these generated videos, which may be beneficial in the autonomous driving domain. These narrations can enhance traffic scene understanding, aid in navigation, and improve planning capabilities. The integration of video generation with VLMs in the DriveGenVLM framework represents a significant step forward in leveraging advanced AI models to address complex challenges in autonomous driving.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2408.16647 [cs.CV]
	(or arXiv:2408.16647v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.16647

Submission history

From: Yongjie Fu [view email]
[v1] Thu, 29 Aug 2024 15:52:56 UTC (1,623 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators