skip to main content
10.1145/3581783.3612434acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Toward Human Perception-Centric Video Thumbnail Generation

Published: 27 October 2023 Publication History

Abstract

Video thumbnail plays an essential role in summarizing video content into a compact and concise image for users to browse efficiently. However, automatically generating attractive and informative video thumbnails remains an open problem due to the difficulty of formulating human aesthetic perception and the scarcity of paired training data. This work proposes a novel Human Perception-Centric Video Thumbnail Generation (HPCVTG) to address these challenges. Specifically, our framework first generates a set of thumbnails using a principle-based system, which conforms to established aesthetic and human perception principles, such as visual balance in the layout and avoiding overlapping elements. Then rather than designing from scratch, we ask human annotators to evaluate some of these thumbnails and select their preferred ones. A Transformer-based Variational Auto-Encoder (VAE) model is firstly pre-trained with Model-Agnostic Meta-Learning (MAML) and then fine-tuned on these human-selected thumbnails. The exploration of combining the MAML pre-training paradigm with human feedback in training can reduce human involvement and make the training process more efficient. Extensive experimental results show that our HPCVTG framework outperforms existing methods in objective and subjective evaluations, highlighting its potential to improve the user experience when browsing videos and inspire future research in human perception-centric content generation tasks. The code and dataset will be released via https://github.com/yangtao2019yt/HPCVTG.

References

[1]
Diego Martin Arroyo, Janis Postels, and Federico Tombari. 2021. Variational transformer networks for layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13642--13652.
[2]
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086 (2016).
[3]
Michael Bauerly and Yili Liu. 2006. Computational modeling and experimental investigation of effects of compositional elements on interface and design aesthetics. International journal of human-computer studies 64, 8 (2006), 670--682.
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.
[5]
Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6154--6162.
[6]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
[7]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic metalearning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126--1135.
[8]
Kamal Gupta, Justin Lazarow, Alessandro Achille, Larry S Davis, Vijay Mahadevan, and Abhinav Shrivastava. 2021. Layouttransformer: Layout generation and completion with self-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1004--1014.
[9]
Donald Joseph Hejna III and Dorsa Sadigh. 2023. Few-shot preference learning for human-in-the-loop RL. In Conference on Robot Learning. PMLR, 2014--2025.
[10]
Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. 2007. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07--49. University of Massachusetts, Amherst.
[11]
Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. 2018. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems 31 (2018).
[12]
Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2023. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation. arXiv preprint arXiv:2303.08137 (2023).
[13]
Zhaoyun Jiang, Shizhao Sun, Jihua Zhu, Jian-Guang Lou, and Dongmei Zhang. 2022. Coarse-to-Fine Generative Modeling for Graphic Layouts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1096--1103.
[14]
Chuhao Jin, Hongteng Xu, Ruihua Song, and Zhiwu Lu. 2022. Text2Poster: Laying Out Stylized Texts on Retrieved Images. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4823--4827.
[15]
Hadi Kazemi, Fariborz Taherkhani, and Nasser Nasrabadi. 2020. Preferencebased image generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3404--3413.
[16]
Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2021. Constrained graphic layout generation via latent optimization. In Proceedings of the 29th ACM International Conference on Multimedia. 88--96.
[17]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[18]
Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, and Irfan Essa. 2022. BLT: bidirectional layout transformer for controllable layout generation. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XVII. Springer, 474--490.
[19]
Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. 2018. Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958 (2018).
[20]
Chien-Yin Lai, Pai-Hsun Chen, Sheng-Wen Shih, Yili Liu, and Jen-Shin Hong. 2010. Computational models and experimental investigations of effects of balance and symmetry on the aesthetics of text-overlaid images. International journal of human-computer studies 68, 1--2 (2010), 41--56.
[21]
Hsin-Ying Lee, Lu Jiang, Irfan Essa, Phuong B Le, Haifeng Gong, Ming-Hsuan Yang, and Weilong Yang. 2020. Neural design network: Graphic layout generation with constraints. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16. Springer, 491--506.
[22]
Kimin Lee, Laura Smith, and Pieter Abbeel. 2021. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091 (2021).
[23]
Jinyu Li, Shujin Lin, Fan Zhou, and Ruomei Wang. 2022. NewsThumbnail: Automatic Generation of News Video Thumbnail. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 1383--1388.
[24]
Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang, and Tingfa Xu. 2019. Layoutgan: Generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767 (2019).
[25]
Jianan Li, Jimei Yang, Jianming Zhang, Chang Liu, Christina Wang, and Tingfa Xu. 2020. Attribute-conditioned layout gan for automatic graphic design. IEEE Transactions on Visualization and Computer Graphics 27, 10 (2020), 4039--4048.
[26]
Zhiwei Li, Shuming Shi, and Lei Zhang. 2008. Improving relevance judgment of web search results with image excerpts. In Proceedings of the 17th international conference on World Wide Web. 21--30.
[27]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11976--11986.
[28]
Shuang Ma and Chang Wen Chen. 2016. Automatic creation of magazine-pagelike social media visual summary for mobile browsing. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 469--473.
[29]
James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, Guan Wang, David L Roberts, Matthew E Taylor, and Michael L Littman. 2017. Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning. PMLR, 2285--2294.
[30]
Roberto Martínez-Cruz, Alvaro J López-López, and José Portela. 2023. ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task. arXiv preprint arXiv:2304.14177 (2023).
[31]
Tao Mei and Xian-Sheng Hua. 2010. Contextual internet multimedia advertising. Proc. IEEE 98, 8 (2010), 1416--1433.
[32]
Tao Mei, Xian-Sheng Hua, and Shipeng Li. 2008. Contextual in-image advertising. In Proceedings of the 16th ACM international conference on Multimedia. 439--448.
[33]
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021).
[34]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730--27744.
[35]
Rik Pieters and Michel Wedel. 2004. Attention capture and transfer in advertising: Brand, pictorial, and text-size effects. Journal of marketing 68, 2 (2004), 36--50.
[36]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[37]
DM Rocke. 2000. Genetic Algorithms Data Structures= Evolution programs (3rd. J. Amer. Statist. Assoc. 95, 449 (2000), 347.
[38]
Mingyang Song, Haiyun Jiang, Shuming Shi, Songfang Yao, Shilong Lu, Yi Feng, Huafeng Liu, and Liping Jing. 2023. Is ChatGPT A Good Keyphrase Generator? A Preliminary Study. arXiv preprint arXiv:2303.13001 (2023).
[39]
Jaime Teevan, Edward Cutrell, Danyel Fisher, Steven M Drucker, Gonzalo Ramos, Paul André, and Chang Hu. 2009. Visual snippets: summarizing web pages for search and revisitation. In Proceedings of the SIGCHI conference on human factors in computing systems. 2023--2032.
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[41]
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. 2018. Unified Perceptual Parsing for Scene Understanding. In European Conference on Computer Vision. Springer.
[42]
Binbin Xie, Jia Song, Liangying Shao, Suhang Wu, Xiangpeng Wei, Baosong Yang, Huan Lin, Jun Xie, and Jinsong Su. 2023. From statistical methods to deep learning, automatic keyphrase prediction: A survey. Information Processing & Management 60, 4 (2023), 103382.
[43]
Yi Xu, Fan Bai, Yingxuan Shi, Qiuyu Chen, Longwen Gao, Kai Tian, Shuigeng Zhou, and Huyang Sun. 2021. Gif thumbnails: Attract more clicks to your videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3074--3082.
[44]
Kota Yamaguchi. 2021. Canvasvae: learning to generate vector graphic documents. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5481--5489.
[45]
Xuyong Yang, Tao Mei, Ying-Qing Xu, Yong Rui, and Shipeng Li. 2016. Automatic generation of visual-textual presentation layout. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 12, 2 (2016), 1--22.
[46]
Wenyuan Yin, Tao Mei, and Chang Wen Chen. 2013. Automatic generation of social media snippets for mobile browsing. In Proceedings of the 21st ACM international conference on Multimedia. 927--936.
[47]
Ning Yu, Chia-Chih Chen, Zeyuan Chen, Rui Meng, Gang Wu, Paul Josel, Juan Carlos Niebles, Caiming Xiong, and Ran Xu. 2022. LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer. arXiv preprint arXiv:2212.09877 (2022).
[48]
Junyi Zhang, Jiaqi Guo, Shizhao Sun, Jian-Guang Lou, and Dongmei Zhang. 2023. LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models. arXiv preprint arXiv:2303.11589 (2023).
[49]
Yunke Zhang, Kangkang Hu, Peiran Ren, Changyuan Yang, Weiwei Xu, and Xian-Sheng Hua. 2017. Layout style modeling for automating banner design. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017. 451--459.
[50]
Baoquan Zhao, Hanhui Li, Ruomei Wang, and Xiaonan Luo. 2020. Automatic generation of informative video thumbnail. In 2020 8th International Conference on Digital Home (ICDH). IEEE, 254--259.
[51]
Baoquan Zhao, Shujin Lin, Xin Qi, Zhiquan Zhang, Xiaonan Luo, and Ruomei Wang. 2017. Automatic generation of visual-textual web video thumbnail. In SIGGRAPH Asia 2017 Posters. 1--2.
[52]
Ting Zhao and Xiangqian Wu. 2019. Pyramid Feature Attention Network for Saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[53]
Min Zhou, Chenchen Xu, Ye Ma, Tiezheng Ge, Yuning Jiang, and Weiwei Xu. 2022. Composition-aware Graphic Layout GAN for Visual-textual Presentation Designs. arXiv preprint arXiv:2205.00303 (2022).
[54]
Wangchunshu Zhou and Ke Xu. 2020. Learning to compare for better training and evaluation of open domain natural language generation models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9717--9724.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. few-shot learning
  2. human preference
  3. variational auto-encoder
  4. video thumbnail

Qualifiers

  • Research-article

Funding Sources

  • Tencent PCG

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)190
  • Downloads (Last 6 weeks)19
Reflects downloads up to 26 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media