Visual Subtitle Feature Enhanced Video Outline Generation

Lv, Qi; Cao, Ziqiang; Xie, Wenrui; Wang, Derui; Wang, Jingwen; Hu, Zhiwei; Zhang, Tangkun; Yuan, Ba; Li, Yuanhang; Cao, Min; Li, Wenjie; Li, Sujian; Fu, Guohong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2208.11307 (cs)

[Submitted on 24 Aug 2022 (v1), last revised 1 Sep 2022 (this version, v2)]

Title:Visual Subtitle Feature Enhanced Video Outline Generation

Authors:Qi Lv, Ziqiang Cao, Wenrui Xie, Derui Wang, Jingwen Wang, Zhiwei Hu, Tangkun Zhang, Ba Yuan, Yuanhang Li, Min Cao, Wenjie Li, Sujian Li, Guohong Fu

View PDF

Abstract:With the tremendously increasing number of videos, there is a great demand for techniques that help people quickly navigate to the video segments they are interested in. However, current works on video understanding mainly focus on video content summarization, while little effort has been made to explore the structure of a video. Inspired by textual outline generation, we introduce a novel video understanding task, namely video outline generation (VOG). This task is defined to contain two sub-tasks: (1) first segmenting the video according to the content structure and then (2) generating a heading for each segment. To learn and evaluate VOG, we annotate a 10k+ dataset, called DuVOG. Specifically, we use OCR tools to recognize subtitles of videos. Then annotators are asked to divide subtitles into chapters and title each chapter. In videos, highlighted text tends to be the headline since it is more likely to attract attention. Therefore we propose a Visual Subtitle feature Enhanced video outline generation model (VSENet) which takes as input the textual subtitles together with their visual font sizes and positions. We consider the VOG task as a sequence tagging problem that extracts spans where the headings are located and then rewrites them to form the final outlines. Furthermore, based on the similarity between video outlines and textual outlines, we use a large number of articles with chapter headings to pretrain our model. Experiments on DuVOG show that our model largely outperforms other baseline methods, achieving 77.1 of F1-score for the video segmentation level and 85.0 of ROUGE-L_F0.5 for the headline generation level.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2208.11307 [cs.CV]
	(or arXiv:2208.11307v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2208.11307

Submission history

From: Qi Lv [view email]
[v1] Wed, 24 Aug 2022 05:26:26 UTC (19,813 KB)
[v2] Thu, 1 Sep 2022 07:26:33 UTC (19,813 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Subtitle Feature Enhanced Video Outline Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Subtitle Feature Enhanced Video Outline Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators