research-article

Toward Human Perception-Centric Video Thumbnail Generation

Authors:

Changwen ChenAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 6653 - 6664

https://doi.org/10.1145/3581783.3612434

Published: 27 October 2023 Publication History

Abstract

Video thumbnail plays an essential role in summarizing video content into a compact and concise image for users to browse efficiently. However, automatically generating attractive and informative video thumbnails remains an open problem due to the difficulty of formulating human aesthetic perception and the scarcity of paired training data. This work proposes a novel Human Perception-Centric Video Thumbnail Generation (HPCVTG) to address these challenges. Specifically, our framework first generates a set of thumbnails using a principle-based system, which conforms to established aesthetic and human perception principles, such as visual balance in the layout and avoiding overlapping elements. Then rather than designing from scratch, we ask human annotators to evaluate some of these thumbnails and select their preferred ones. A Transformer-based Variational Auto-Encoder (VAE) model is firstly pre-trained with Model-Agnostic Meta-Learning (MAML) and then fine-tuned on these human-selected thumbnails. The exploration of combining the MAML pre-training paradigm with human feedback in training can reduce human involvement and make the training process more efficient. Extensive experimental results show that our HPCVTG framework outperforms existing methods in objective and subjective evaluations, highlighting its potential to improve the user experience when browsing videos and inspire future research in human perception-centric content generation tasks. The code and dataset will be released via https://github.com/yangtao2019yt/HPCVTG.

References

[1]

Diego Martin Arroyo, Janis Postels, and Federico Tombari. 2021. Variational transformer networks for layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13642--13652.

[2]

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086 (2016).

[3]

Michael Bauerly and Yili Liu. 2006. Computational modeling and experimental investigation of effects of compositional elements on interface and design aesthetics. International journal of human-computer studies 64, 8 (2006), 670--682.

Digital Library

[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.

[5]

Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6154--6162.

[6]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).

[7]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic metalearning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126--1135.

[8]

Kamal Gupta, Justin Lazarow, Alessandro Achille, Larry S Davis, Vijay Mahadevan, and Abhinav Shrivastava. 2021. Layouttransformer: Layout generation and completion with self-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1004--1014.

[9]

Donald Joseph Hejna III and Dorsa Sadigh. 2023. Few-shot preference learning for human-in-the-loop RL. In Conference on Robot Learning. PMLR, 2014--2025.

[10]

Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. 2007. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07--49. University of Massachusetts, Amherst.

[11]

Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. 2018. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems 31 (2018).

[12]

Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2023. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation. arXiv preprint arXiv:2303.08137 (2023).

[13]

Zhaoyun Jiang, Shizhao Sun, Jihua Zhu, Jian-Guang Lou, and Dongmei Zhang. 2022. Coarse-to-Fine Generative Modeling for Graphic Layouts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1096--1103.

[14]

Chuhao Jin, Hongteng Xu, Ruihua Song, and Zhiwu Lu. 2022. Text2Poster: Laying Out Stylized Texts on Retrieved Images. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4823--4827.

[15]

Hadi Kazemi, Fariborz Taherkhani, and Nasser Nasrabadi. 2020. Preferencebased image generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3404--3413.

[16]

Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2021. Constrained graphic layout generation via latent optimization. In Proceedings of the 29th ACM International Conference on Multimedia. 88--96.

Digital Library

[17]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).

[18]

Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, and Irfan Essa. 2022. BLT: bidirectional layout transformer for controllable layout generation. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XVII. Springer, 474--490.

[19]

Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. 2018. Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958 (2018).

[20]

Chien-Yin Lai, Pai-Hsun Chen, Sheng-Wen Shih, Yili Liu, and Jen-Shin Hong. 2010. Computational models and experimental investigations of effects of balance and symmetry on the aesthetics of text-overlaid images. International journal of human-computer studies 68, 1--2 (2010), 41--56.

Digital Library

[21]

Hsin-Ying Lee, Lu Jiang, Irfan Essa, Phuong B Le, Haifeng Gong, Ming-Hsuan Yang, and Weilong Yang. 2020. Neural design network: Graphic layout generation with constraints. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16. Springer, 491--506.

Digital Library

[22]

Kimin Lee, Laura Smith, and Pieter Abbeel. 2021. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091 (2021).

[23]

Jinyu Li, Shujin Lin, Fan Zhou, and Ruomei Wang. 2022. NewsThumbnail: Automatic Generation of News Video Thumbnail. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 1383--1388.

[24]

Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang, and Tingfa Xu. 2019. Layoutgan: Generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767 (2019).

[25]

Jianan Li, Jimei Yang, Jianming Zhang, Chang Liu, Christina Wang, and Tingfa Xu. 2020. Attribute-conditioned layout gan for automatic graphic design. IEEE Transactions on Visualization and Computer Graphics 27, 10 (2020), 4039--4048.

Digital Library

[26]

Zhiwei Li, Shuming Shi, and Lei Zhang. 2008. Improving relevance judgment of web search results with image excerpts. In Proceedings of the 17th international conference on World Wide Web. 21--30.

Digital Library

[27]

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11976--11986.

[28]

Shuang Ma and Chang Wen Chen. 2016. Automatic creation of magazine-pagelike social media visual summary for mobile browsing. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 469--473.

[29]

James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, Guan Wang, David L Roberts, Matthew E Taylor, and Michael L Littman. 2017. Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning. PMLR, 2285--2294.

[30]

Roberto Martínez-Cruz, Alvaro J López-López, and José Portela. 2023. ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task. arXiv preprint arXiv:2304.14177 (2023).

[31]

Tao Mei and Xian-Sheng Hua. 2010. Contextual internet multimedia advertising. Proc. IEEE 98, 8 (2010), 1416--1433.

[32]

Tao Mei, Xian-Sheng Hua, and Shipeng Li. 2008. Contextual in-image advertising. In Proceedings of the 16th ACM international conference on Multimedia. 439--448.

Digital Library

[33]

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021).

[34]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730--27744.

[35]

Rik Pieters and Michel Wedel. 2004. Attention capture and transfer in advertising: Brand, pictorial, and text-size effects. Journal of marketing 68, 2 (2004), 36--50.

[36]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[37]

DM Rocke. 2000. Genetic Algorithms Data Structures= Evolution programs (3rd. J. Amer. Statist. Assoc. 95, 449 (2000), 347.

[38]

Mingyang Song, Haiyun Jiang, Shuming Shi, Songfang Yao, Shilong Lu, Yi Feng, Huafeng Liu, and Liping Jing. 2023. Is ChatGPT A Good Keyphrase Generator? A Preliminary Study. arXiv preprint arXiv:2303.13001 (2023).

[39]

Jaime Teevan, Edward Cutrell, Danyel Fisher, Steven M Drucker, Gonzalo Ramos, Paul André, and Chang Hu. 2009. Visual snippets: summarizing web pages for search and revisitation. In Proceedings of the SIGCHI conference on human factors in computing systems. 2023--2032.

Digital Library

[40]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[41]

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. 2018. Unified Perceptual Parsing for Scene Understanding. In European Conference on Computer Vision. Springer.

[42]

Binbin Xie, Jia Song, Liangying Shao, Suhang Wu, Xiangpeng Wei, Baosong Yang, Huan Lin, Jun Xie, and Jinsong Su. 2023. From statistical methods to deep learning, automatic keyphrase prediction: A survey. Information Processing & Management 60, 4 (2023), 103382.

Digital Library

[43]

Yi Xu, Fan Bai, Yingxuan Shi, Qiuyu Chen, Longwen Gao, Kai Tian, Shuigeng Zhou, and Huyang Sun. 2021. Gif thumbnails: Attract more clicks to your videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3074--3082.

[44]

Kota Yamaguchi. 2021. Canvasvae: learning to generate vector graphic documents. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5481--5489.

[45]

Xuyong Yang, Tao Mei, Ying-Qing Xu, Yong Rui, and Shipeng Li. 2016. Automatic generation of visual-textual presentation layout. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 12, 2 (2016), 1--22.

Digital Library

[46]

Wenyuan Yin, Tao Mei, and Chang Wen Chen. 2013. Automatic generation of social media snippets for mobile browsing. In Proceedings of the 21st ACM international conference on Multimedia. 927--936.

Digital Library

[47]

Ning Yu, Chia-Chih Chen, Zeyuan Chen, Rui Meng, Gang Wu, Paul Josel, Juan Carlos Niebles, Caiming Xiong, and Ran Xu. 2022. LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer. arXiv preprint arXiv:2212.09877 (2022).

[48]

Junyi Zhang, Jiaqi Guo, Shizhao Sun, Jian-Guang Lou, and Dongmei Zhang. 2023. LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models. arXiv preprint arXiv:2303.11589 (2023).

[49]

Yunke Zhang, Kangkang Hu, Peiran Ren, Changyuan Yang, Weiwei Xu, and Xian-Sheng Hua. 2017. Layout style modeling for automating banner design. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017. 451--459.

Digital Library

[50]

Baoquan Zhao, Hanhui Li, Ruomei Wang, and Xiaonan Luo. 2020. Automatic generation of informative video thumbnail. In 2020 8th International Conference on Digital Home (ICDH). IEEE, 254--259.

[51]

Baoquan Zhao, Shujin Lin, Xin Qi, Zhiquan Zhang, Xiaonan Luo, and Ruomei Wang. 2017. Automatic generation of visual-textual web video thumbnail. In SIGGRAPH Asia 2017 Posters. 1--2.

[52]

Ting Zhao and Xiangqian Wu. 2019. Pyramid Feature Attention Network for Saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]

Min Zhou, Chenchen Xu, Ye Ma, Tiezheng Ge, Yuning Jiang, and Weiwei Xu. 2022. Composition-aware Graphic Layout GAN for Visual-textual Presentation Designs. arXiv preprint arXiv:2205.00303 (2022).

[54]

Wangchunshu Zhou and Ke Xu. 2020. Learning to compare for better training and evaluation of open domain natural language generation models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9717--9724.

Cited By

Ali MIm EKim DKim T(2024)Harnessing Meta-Learning for Improving Full-Frame Video Stabilization2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01198(12605-12614)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01198

Index Terms

Toward Human Perception-Centric Video Thumbnail Generation
1. Applied computing
  1. Arts and humanities
    1. Media arts
2. Human-centered computing
  1. Visualization
    1. Visualization design and evaluation methods

Recommendations

Sentence Specified Dynamic Video Thumbnail Generation
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

With the tremendous growth of videos over the Internet, video thumbnails, providing video content previews, are becoming increasingly crucial to influencing users' online searching experiences. Conventional video thumbnails are generated once purely ...
A Novel Framework for Web Video Thumbnail Generation
IIH-MSP '12: Proceedings of the 2012 Eighth International Conference on Intelligent Information Hiding and Multimedia Signal Processing

When user uploads a video clip to the video sharing websites, a video thumbnail needs to be generated as the cover to represent the video content. In this paper, a novel video thumbnail generation framework is presented. For generating a good thumbnail, ...
Automatic generation of visual-textual web video thumbnail
SA '17: SIGGRAPH Asia 2017 Posters

Thumbnails provide an efficient way to perceive video content and give online viewers instant gratification of making relevance judgements. In this paper, we proposed an automatic approach to generate magazine-cover-like thumbnail using the salient ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Tencent PCG

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
256
Total Downloads

Downloads (Last 12 months)190
Downloads (Last 6 weeks)19

Reflects downloads up to 26 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ali MIm EKim DKim T(2024)Harnessing Meta-Learning for Improving Full-Frame Video Stabilization2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01198(12605-12614)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01198

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents