skip to main content
research-article

Deep Unsupervised Key Frame Extraction for Efficient Video Classification

Published: 25 February 2023 Publication History

Abstract

Video processing and analysis have become an urgent task, as a huge amount of videos (e.g., YouTube, Hulu) are uploaded online every day. The extraction of representative key frames from videos is important in video processing and analysis since it greatly reduces computing resources and time. Although great progress has been made recently, large-scale video classification remains an open problem, as the existing methods have not well balanced the performance and efficiency simultaneously. To tackle this problem, this work presents an unsupervised method to retrieve the key frames, which combines the convolutional neural network and temporal segment density peaks clustering. The proposed temporal segment density peaks clustering is a generic and powerful framework, and it has two advantages compared with previous works. One is that it can calculate the number of key frames automatically. The other is that it can preserve the temporal information of the video. Thus, it improves the efficiency of video classification. Furthermore, a long short-term memory network is added on the top of the convolutional neural network to further elevate the performance of classification. Moreover, a weight fusion strategy of different input networks is presented to boost performance. By optimizing both video classification and key frame extraction simultaneously, we achieve better classification performance and higher efficiency. We evaluate our method on two popular datasets (i.e., HMDB51 and UCF101), and the experimental results consistently demonstrate that our strategy achieves competitive performance and efficiency compared with the state-of-the-art approaches.

References

[1]
Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould. 2016. Dynamic image networks for action recognition. In Proceedings of CVPR.
[2]
Zhuowei Cai, Limin Wang, Xiaojiang Peng, and Yu Qiao. 2014. Multi-view super vector for action recognition. In Proceedings of CVPR.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. In Proceedings of CVPR.
[4]
Zuzana Cernekova, Ioannis Pitas, and Christophoros Nikou. 2006. Information theory-based shot cut/fade detection and video summarization. IEEE Transactions on Circuits and Systems for Video Technology 16, 1 (2006), 82–91.
[5]
Vasileios Choutas, Philippe Weinzaepfel, Jérôme Revaud, and Cordelia Schmid. 2018. Potion: Pose motion representation for action recognition. In Proceedings of CVPR.
[6]
Yang Cong, Junsong Yuan, and Jiebo Luo. 2012. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2012), 66–75.
[7]
Sandra Eliza Fontes De Avila, Ana Paula Brandão Lopes, Antonio da Luz, and Arnaldo de Albuquerque Araújo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56–68.
[8]
César Roberto de Souza, Adrien Gaidon, Eleonora Vig, and Antonio Manuel López. 2016. Sympathy for the details: Dense trajectories and hybrid classification architectures for action recognition. In Proceedings of ECCV.
[9]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of CVPR.
[10]
Ionut Cosmin Duta, Bogdan Ionescu, Kiyoharu Aizawa, and Nicu Sebe. 2017. Spatio-temporal vector of locally max pooled features for action recognition in videos. In Proceedings of CVPR.
[11]
Naveed Ejaz, Tayyab Bin Tariq, and Sung Wook Baik. 2012. Adaptive key frame extraction for video summarization using an aggregation mechanism. Journal of Visual Communication and Image Representation 23, 7 (2012), 1031–1040.
[12]
Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016. Spatiotemporal residual networks for video action recognition. In Proceedings of NIPS.
[13]
Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of CVPR.
[14]
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, and Andrew Zisserman. 2018. What have we learned from deep representations for action recognition? In Proceedings of CVPR.
[15]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of CVPR.
[16]
Basura Fernando, Efstratios Gavves, Jose M. Oramas, Amir Ghodrati, and Tinne Tuytelaars. 2015. Modeling video evolution for action recognition. In Proceedings of CVPR.
[17]
Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. 2013. Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 11 (2013), 2782–2795.
[18]
Ruohan Gao, Bo Xiong, and Kristen Grauman. 2018. Im2Flow: Motion hallucination from static images for action recognition. In Proceedings of CVPR.
[19]
Hana Gharbi, Sahbi Bahroun, Mohamed Massaoudi, and Ezzeddine Zagrouba. 2017. Key frames extraction using graph modularity clustering for efficient video summarization. In Proceedings of ICASSP.
[20]
Genliang Guan, Zhiyong Wang, Shiyang Lu, Jeremiah Da Deng, and David Dagan Feng. 2013. Keypoint-based keyframe selection. IEEE Transactions on Circuits and Systems for Video Technology 23, 4 (2013), 729–734.
[21]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNS and ImageNet? In Proceedings of CVPR.
[22]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of CVPR.
[23]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 221–231.
[24]
Amlan Kar, Nishant Rai, Karan Sikka, and Gaurav Sharma. 2017. AdaScan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In Proceedings of CVPR.
[25]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of CVPR.
[26]
Sanjay K. Kuanar, Rameswar Panda, and Ananda S. Chowdhury. 2013. Video key frame extraction through dynamic Delaunay clustering with a structural constraint. Journal of Visual Communication and Image Representation 24, 7 (2013), 1212–1227.
[27]
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 2011. HMDB: A large video database for human motion recognition. In Proceedings of ICCV.
[28]
Sourabh Kulhare, Shagan Sah, Suhas Pillai, and Raymond Ptucha. 2016. Key frame extraction for salient activity recognition. In Proceedings of ICPR.
[29]
Zhengzhong Lan, Ming Lin, Xuanchong Li, Alex G. Hauptmann, and Bhiksha Raj. 2015. Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of CVPR.
[30]
Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Proceedings of CVPR.
[31]
Hong Liu, Hao Tang, Wei Xiao, ZiYi Guo, Lu Tian, and Yuan Gao. 2016. Sequential bag-of-words model for human action classification. CAAI Transactions on Intelligence Technology 1, 2 (2016), 125–136.
[32]
Hong Liu, Lu Tian, Mengyuan Liu, and Hao Tang. 2015. SDM-BSM: A fusing depth scheme for human action recognition. In Proceedings of ICIP.
[33]
Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In Proceedings of AAAI.
[34]
James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Lucien M. Le Cam and Jerzy Neyman (Eds.). University of California Press, 281–297.
[35]
Shaohui Mei, Genliang Guan, Zhiyong Wang, Mingyi He, Xian-Sheng Hua, and David Dagan Feng. 2014. L2, 0 constrained sparse dictionary selection for video summarization. In Proceedings of ICME.
[36]
Shaohui Mei, Genliang Guan, Zhiyong Wang, Shuai Wan, Mingyi He, and David Dagan Feng. 2015. Video summarization via minimum sparse reconstruction. Pattern Recognition 48, 2 (2015), 522–533.
[37]
Jingjing Meng, Hongxing Wang, Junsong Yuan, and Yap-Peng Tan. 2016. From keyframes to key objects: Video summarization by representative object proposal selection. In Proceedings of CVPR.
[38]
Bingbing Ni, Pierre Moulin, Xiaokang Yang, and Shuicheng Yan. 2015. Motion part regularization: Improving action recognition via trajectory selection. In Proceedings of CVPR.
[39]
Costas Panagiotakis, Anastasios Doulamis, and Georgios Tziritas. 2009. Equivalent key frames selection based on iso-content principles. IEEE Transactions on Circuits and Systems for Video Technology 19, 3 (2009), 447–451.
[40]
Rameswar Panda, Sanjay K. Kuanar, and Ananda S. Chowdhury. 2014. Scalable video summarization using skeleton graph and random walk. In Proceedings of ICPR.
[41]
Xiaojiang Peng, Limin Wang, Xingxing Wang, and Yu Qiao. 2016. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding 150 (2016), 109–125.
[42]
Xiaojiang Peng, Changqing Zou, Yu Qiao, and Qiang Peng. 2014. Action recognition with stacked Fisher vectors. In Proceedings of ECCV. 581–595.
[43]
Cesar Roberto de Souza, Adrien Gaidon, Yohann Cabon, and Antonio Manuel Lopez. 2017. Procedural generation of videos to train deep action recognition networks. In Proceedings of CVPR.
[44]
Alex Rodriguez and Alessandro Laio. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014), 1492–1496.
[45]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
[46]
Zheng Shou, Xudong Lin, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Shih-Fu Chang, and Zhicheng Yan. 2019. DMC-Net: Generating discriminative motion cues for fast compressed video action recognition. In Proceedings of CVPR.
[47]
Gunnar A. Sigurdsson, Xinlei Chen, and Abhinav Gupta. 2016. Learning visual storylines with skipping recurrent neural networks. In Proceedings of ECCV.
[48]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of NIPS.
[49]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
[50]
Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of CVPR.
[51]
Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E. Shi. 2015. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of ICCV.
[52]
Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, and Wei Zhang. 2018. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of CVPR.
[53]
Hao Tang, Hong Liu, and Wei Xiao. 2015. Gender classification using pyramid segmentation for unconstrained back-facing video sequences. In Proceedings of ACM MM.
[54]
Hao Tang, Hong Liu, Wei Xiao, and Nicu Sebe. 2019. Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion. Neurocomputing 331 (2019), 424–433.
[55]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of ICCV.
[56]
Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. 2017. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017).
[57]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of CVPR.
[58]
Gül Varol, Ivan Laptev, and Cordelia Schmid. 2017. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1510–1517.
[59]
Ricardo Vázquez-Martn and Antonio Bandera. 2013. Spatio-temporal feature-based keyframe detection from video shots using spectral clustering. Pattern Recognition Letters 34, 7 (2013), 770–779.
[60]
Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In Proceedings of NIPS.
[61]
Hongxing Wang, Yoshinobu Kawahara, Chaoqun Weng, and Junsong Yuan. 2017. Representative selection with structured sparsity. Pattern Recognition 63 (2017), 268–278.
[62]
Heng Wang, Dan Oneata, Jakob Verbeek, and Cordelia Schmid. 2016. A robust and efficient video representation for action recognition. International Journal of Computer Vision 119, 3 (2016), 219–238.
[63]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of ICCV.
[64]
Limin Wang, Yu Qiao, and Xiaoou Tang. 2014. Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing 23, 2 (2014), 810–822.
[65]
Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of CVPR.
[66]
Limin Wang, Yu Qiao, and Xiaoou Tang. 2016. MoFAP: A multi-level representation for action recognition. International Journal of Computer Vision 119, 3 (2016), 254–271.
[67]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of ECCV.
[68]
Xiaolong Wang, Ali Farhadi, and Abhinav Gupta. 2016. Actions transformations. In Proceedings of CVPR.
[69]
Yunbo Wang, Mingsheng Long, Jianmin Wang, and Philip S. Yu. 2017. Spatiotemporal pyramid network for video action recognition. In Proceedings of CVPR.
[70]
Yali Wang, Lei Zhou, and Yu Qiao. 2018. Temporal hallucinating for action recognition with few still images. In Proceedings of CVPR.
[71]
Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha, Alexander J. Smola, and Philipp Krähenbühl. 2018. Compressed video action recognition. In Proceedings of CVPR.
[72]
Jianwei Yang, Devi Parikh, and Dhruv Batra. 2016. Joint unsupervised learning of deep representations and image clusters. In Proceedings of CVPR.
[73]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of CVPR.
[74]
Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun Zeng. 2018. MiCT: Mixed 3D/2D convolutional tube for human action recognition. In Proceedings of CVPR.
[75]
Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition. In Proceedings of CVPR.
[76]
Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. 2018. Towards universal representation for unseen action recognition. In Proceedings of CVPR.
[77]
Yueting Zhuang, Yong Rui, Thomas S. Huang, and Sharad Mehrotra. 1998. Adaptive key frame extraction using unsupervised clustering. In Proceedings of ICIP.

Cited By

View all

Index Terms

  1. Deep Unsupervised Key Frame Extraction for Efficient Video Classification

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3
      May 2023
      514 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3582886
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 February 2023
      Online AM: 12 December 2022
      Accepted: 06 November 2022
      Revised: 17 September 2022
      Received: 18 April 2022
      Published in TOMM Volume 19, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Key frame extraction
      2. density peaks clustering
      3. LSTM
      4. weight fusion
      5. unsupervised learning
      6. video classification

      Qualifiers

      • Research-article

      Funding Sources

      • PRIN project PREVUE
      • EU H2020 project AI4Media

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)645
      • Downloads (Last 6 weeks)34
      Reflects downloads up to 30 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media