skip to main content

Video Skimming: Taxonomy and Comprehensive Survey

Published: 13 September 2019 Publication History


Video skimming, also known as dynamic video summarization, generates a temporally abridged version of a given video. Skimming can be achieved by identifying significant components either in uni-modal or multi-modal features extracted from the video. Being dynamic in nature, video skimming, through temporal connectivity, allows better understanding of the video from its summary. Having this obvious advantage, recently, video skimming has drawn the focus of many researchers benefiting from the easy availability of the required computing resources. In this article, we provide a comprehensive survey on video skimming focusing on the substantial amount of literature from the past decade. We present a taxonomy of video skimming approaches and discuss their evolution highlighting key advances. We also provide a study on the components required for the evaluation of a video skimming performance.

Supplementary Material

a106-k-suppl.pdf (
Supplemental movie, appendix, image and software files for, Video Skimming: Taxonomy and Comprehensive Survey


{n.d.}. MuVee autoproducer. Retrieved from
{n.d.}. Power director software. Retrieved from
Jurandy Almeida, Neucimar J. Leite, and Ricardo da S. Torres. 2013. Online video summarization on compressed domain. Journal of Visual Communication and Image Representation 24, 6 (2013), 729--738.
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A nucleus for a web of open data. In Proceedings of the 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference (ISWC’07/ASWC’07). Springer-Verlag, Berlin, 722--735.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3, Jan (2003), 993--1022.
Ali Borji and Laurent Itti. 2013. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 185--207.
H. Boukadida, S. A. Berrani, and P. Gros. 2016. Automatically creating adaptive video summaries using constraint satisfaction programming: Application to sport content. IEEE Transactions on Circuits and Systems for Video Technology 27, 4 (April 2017), 920--934.
P. Bouthemy, M. Gelgon, and F. Ganansia. 1999. A unified approach to shot change detection and camera motion characterization. IEEE Transactions on Circuits and Systems for Video Technology 9, 7 (Oct 1999), 1030--1044.
Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 335--336.
Bo-Wei Chen, Jia-Ching Wang, and Jhing-Fa Wang. 2009. A novel video summarization based on mining the story-structure and semantic relations among concept entities. IEEE Transactions on Multimedia 11, 2 (2009), 295--312.
F. Chen, C. De Vleeschouwer, H. D. Barrobés, J. G. Escalada, and D. Conejero. 2010. Automatic summarization of audio-visual soccer feeds. In Proceedings of the IEEE International Conference on Multimedia and Expo. 837--842.
F. Chen, C. De Vleeschouwer, and A. Cavallaro. 2014. Resource allocation for personalized video summarization. IEEE Transactions on Multimedia 16, 2 (Feb 2014), 455--469.
Liang-Hua Chen, Chih-Wen Su, Hong-Yuan Mark Liao, and Chun-Chieh Shih. 2003. On the preview of digital movies. Journal of Visual Communication and Image Representation 14, 3 (2003), 358--368.
Kai-Yin Cheng, Sheng-Jie Luo, Bing-Yu Chen, and Hao-Hua Chu. 2009. SmartPlayer: User-centric video fast-forwarding. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 789--798.
W. S. Chu, Yale Song, and A. Jaimes. 2015. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3584--3592.
Cheng-Tao Chung, Hsin-Kuan Hsiung, Cheng-Kuang Wei, and Lin-shan Lee. 2014. Personalized video summarization based on multi-layered probabilistic latent semantic analysis with shared topics. In Proceedings of the 9th International Symposium on Chinese Spoken Language Processing. 173--177.
Yang Cong, Junsong Yuan, and Jiebo Luo. 2012. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2012), 66--75.
Adele Cutler and Leo Breiman. 1994. Archetypal analysis. Technometrics 36, 4 (1994), 338--347.
Chinh T. Dang and Hayder Radha. 2014. Heterogeneity image patch index and its application to consumer video summarization. IEEE Transactions on Image Processing 23, 6 (2014), 2704--2718.
F. Daniyal and A. Cavallaro. 2011. Multi-camera scheduling for video production. In Proceedings of the 2011 Conference for Visual Media Production. 11--20.
Kaveh Darabi and Gheorghita Ghinea. 2015. Personalized video summarization using sift. In Proceedings of the 30th Annual ACM Symposium on Applied Computing. 1252--1256.
A. G. del Molino, C. Tan, J. H. Lim, and A. H. Tan. 2017. Summarization of egocentric videos: A comprehensive survey. IEEE Transactions on Human-Machine Systems 47, 1 (Feb 2017), 65--76.
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning. 647--655.
Pei Dong, Zhiyong Wang, Li Zhuo, and Dagan Feng. 2010. Video summarization with visual and semantic features. In Proceedings of the Advances in Multimedia Information Processing. Springer, 203--214.
Pei Dong, Yong Xia, Shanshan Wang, Li Zhuo, and David Dagan Feng. 2014. An iteratively reweighting algorithm for dynamic video summarization. Multimedia Tools and Applications 74, 21 (2014), 9449--9473.
H. Duxans, X. Anguera, and D. Conejero. 2009. Audio based soccer game summarization. In Proceedings of the IEEE International Symposium on Broadband Multimedia Systems and Broadcasting. 1--6.
E. Elhamifar and M. C. D. P. Kaluza. 2017. Online summarization via submodular and convex optimization. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1818--1826.
E. Elhamifar, G. Sapiro, and R. Vidal. 2012. See all by looking at a few: Sparse modeling for finding representative objects. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. 1600--1607.
Georgios Evangelopoulos, Konstantinos Rapantzikos, Alexandros Potamianos, Petros Maragos, A. Zlatintsi, and Yair Avrithis. 2008. Movie summarization based on audiovisual saliency detection. In Proceedings of the 15th IEEE International Conference on Image Processing. 2528--2531.
Georgios Evangelopoulos, Athanasia Zlatintsi, Alexandros Potamianos, Petros Maragos, Konstantinos Rapantzikos, Georgios Skoumas, and Yannis Avrithis. 2013. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia 15, 7 (2013), 1553--1568.
Georgios Evangelopoulos, Athanasia Zlatintsi, Georgios Skoumas, Konstantinos Rapantzikos, Alexandros Potamianos, Petros Maragos, and Y. Avrithis. 2009. Video event detection and summarization using audio, visual and text saliency. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 3553--3556.
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2010. A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736 (2010).
Simone Frintrop, Erich Rome, and Henrik I. Christensen. 2010. Computational visual attention systems and their cognitive foundations: A survey. ACM Transactions on Applied Perception 7, 1 (2010), 1--39.
Yanwei Fu, Yanwen Guo, Yanshu Zhu, Feng Liu, Chuanming Song, and Zhi-Hua Zhou. 2010. Multi-view video summarization. IEEE Transactions on Multimedia 12, 7 (2010), 717--729.
Lianli Gao, Peng Wang, Jingkuan Song, Zi Huang, Jie Shao, and Heng Tao Shen. 2017. Event video mashup: From hundreds of videos to minutes of skeleton. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.
Yue Gao, Wei-Bo Wang, Jun-Hai Yong, and He-Jin Gu. 2009. Dynamic video summarization using two-level redundancy detection. Multimedia Tools and Applications 42, 2 (2009), 233--250.
Ana Garcia del Molino and Michael Gygli. 2018. PHD-GIFs: Personalized highlight detection for automatic GIF creation. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 600--608.
Boqing Gong, Wei-Lun Chao, Kristen Grauman, and Fei Sha. 2014. Diverse sequential subset selection for supervised video summarization. In Advances in Neural Information Processing Systems, Vol. 2. 2069--2077.
Stephen R. Gulliver and Gheorghita Ghinea. 2006. Defining user perception of distributed multimedia quality. ACM Transactions on Multimedia Computing, Communications, and Applications 2, 4 (2006), 241--257.
Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating summaries from user videos. In Proceedings of the European Conference on Computer Vision. Springer, 505--520.
Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3090--3098.
Michael Gygli, Yale Song, and Liangliang Cao. 2016. Video2gif: Automatic generation of animated gifs from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1001--1009.
Bohyung Han, Jihun Hamm, and Jack Sim. 2011. Personalized video summarization with human in the loop. In IEEE Workshop on Applications of Computer Vision. 51--57.
A. Hanjalic. 2002. Shot-boundary detection: Unraveled and resolved? IEEE Transactions on Circuits and Systems for Video Technology 12, 2 (Feb 2002), 90--105.
Hsuan-I Ho, Wei-Chen Chiu, and Yu-Chiang Frank Wang. 2018. Summarizing first-person videos from third persons’ points of view. In Proceedings of the European Conference on Computer Vision. 70--85.
Richang Hong, Jinhui Tang, Hung-Khoon Tan, Chong-Wah Ngo, Shuicheng Yan, and Tat-Seng Chua. 2011. Beyond search: Event-driven summarization for web videos. ACM Transactions on Multimedia Computing, Communications, and Applications 7, 4 (2011), 35:1--35:18.
Richang Hong, Jinhui Tang, Hung-Khoon Tan, Shuicheng Yan, Chongwah Ngo, and Tat-Seng Chua. 2009. Event driven summarization for web videos. In Proceedings of the 1st SIGMM Workshop on Social Media. 43--48.
Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank. 2011. A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 41, 6 (2011), 797--819.
Qian Huang, Zhu Liu, A. Rosenberg, D. Gibbon, and B. Shahraray. 1999. Automated generation of news content hierarchy by integrating audio, video, and text information. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing., Vol. 6. 3025--3028.
Zi Huang, Bo Hu, Hong Cheng, Heng Tao Shen, Hongyan Liu, and Xiaofang Zhou. 2010. Mining near-duplicate graph for cluster-based reranking of web video search results. ACM Transactions on Information Systems (TOIS) 28, 4 (2010), 22:1--22:27.
Peter J. Huber et al. 1964. Robust estimation of a location parameter. The Annals of Mathematical Statistics 35, 1 (1964), 73--101.
Zhong Ji, Yaru Ma, Yanwei Pang, and Xuelong Li. 2019. Query-aware sparse coding for web multi-video summarization. Information Sciences 478 (2019), 152--166.
Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li. 2017. Video summarization with attention-based encoder-decoder networks. arXiv preprint arXiv:1708.09545 (2017).
Richard M. Jiang, Abdul H. Sadka, and Danny Crookes. 2009. Advances in Video Summarization and Skimming, Vol. 231. Springer, 27--50.
Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval. 494--501.
Hideo Joho, Joemon M. Jose, Roberto Valenti, and Nicu Sebe. 2009. Exploiting facial expressions for affective video summarisation. In Proceedings of the ACM International Conference on Image and Video Retrieval. 31:1--31:8.
Narendra Jussien, Guillaume Rochart, and Xavier Lorca. 2008. Choco: An open source Java constraint programming library. In CPAIOR’08 Workshop on Open-Source Software for Integer and Constraint Programming. 1--10.
Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, and Gene H. Golub. 2003. Extrapolation methods for accelerating PageRank computations. In Proceedings of the 12th International Conference on World Wide Web. ACM, 261--270.
Rajkumar Kannan, Gheorghita Ghinea, and Sridhar Swaminathan. 2015. What do you wish to see? A summarization system for movies based on user preferences. Information Processing 8 Management 51, 3 (2015), 286--305.
E. Kasutani and A. Yamada. 2001. The MPEG-7 color layout descriptor: A compact image feature description for high-speed image/video segment retrieval. In Proceedings 2001 International Conference on Image Processing, Vol. 1. 674--677 vol.1.
Harish Katti, Karthik Yadati, Mohan Kankanhalli, and Chua Tat-Seng. 2011. Affective video summarization and story board generation using pupillary dilation and eye gaze. In Proceedings of the IEEE International Symposium on Multimedia. 319--326.
Steven M. Kay. 1998. Fundamentals of statistical signal processing: Detection theory. Prentice Hall PTR.
A. Khosla, R. Hamid, C. J. Lin, and N. Sundaresan. 2013. Large-scale video summarization using web-image priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2698--2705.
G. Kim, L. Sigal, and E. P. Xing. 2014. Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4225--4232.
Irena Koprinska and Sergio Carrato. 2001. Temporal video segmentation: A survey. Signal Processing: Image Communication 16, 5 (2001), 477--500.
P. Koutras, A. Zlatintsi, E. Iosif, A. Katsamanis, P. Maragos, and A. Potamianos. 2015. Predicting audio-visual salient events based on visual, audio and text modalities for movie summarization. In Proceedings of the 2015 IEEE International Conference on Image Processing. 4361--4365.
S. K. Kuanar, K. B. Ranga, and A. S. Chowdhury. 2015. Multi-view video summarization using bipartite matching constrained optimum-path forest clustering. IEEE Transactions on Multimedia 17, 8 (Aug 2015), 1166--1173.
Alex Kulesza and Ben Taskar. 2011. Learning determinantal point processes. arXiv preprint arXiv:1202.3738 (2011).
Alex Kulesza, Ben Taskar, et al. 2012. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning 5, 2--3 (2012), 123--286.
Robert Laganière, Raphael Bacco, Arnaud Hocevar, Patrick Lambert, Grégory Païs, and Bogdan E Ionescu. 2008. Video summarization from spatio-temporal features. In Proceedings of the 2nd ACM TRECVid Video Summarization Workshop. 144--148.
X. Li, B. Zhao, and X. Lu. 2017. A general framework for edited video and raw video summarization. IEEE Transactions on Image Processing 26, 8 (Aug 2017), 3652--3664.
Ying Li, Shih-Hung Lee, Chia-Hung Yeh, and C. C. Jay Kuo. 2006. Techniques for movie content analysis and skimming: Tutorial and overview on video abstraction techniques. IEEE Signal Processing Magazine 23, 2 (2006), 79--89.
Yingbo Li, Bernard Merialdo, Mickael Rouvier, and Georges Linares. 2011. Static and dynamic video summaries. In Proceedings of the 19th ACM International Conference on Multimedia. 1573--1576.
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. 74--81.
Yen-Liang Lin, Vlad I. Morariu, and Winston Hsu. 2015. Summarizing while recording: Context-based highlight detection for egocentric videos. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 51--59.
C. Liu, J. Yuen, and A. Torralba. 2011. SIFT flow: Dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 5 (May 2011), 978--994.
David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2 (2004), 91--110.
Lie Lu, Hong-Jiang Zhang, and Stan Z. Li. 2003. Content-based audio classification and segmentation by using support vector machines. Multimedia Systems 8, 6 (2003), 482--492.
Z. Lu and K. Grauman. 2013. Story-driven summarization for egocentric video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2714--2721.
Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, and Mingjing Li. 2002. A user attention model for video summarization. In Proceedings of the 10th ACM International Conference on Multimedia. 533--542.
Yu-Fei Ma and Hong-Jiang Zhang. 2002. A model of motion attention for video skimming. In Proceedings of International Conference on Image Processing, Vol. 1. I--129--I--132.
B. Mahasseni, M. Lam, and S. Todorovic. 2017. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2982--2991.
S. Marvaniya, M. Damoder, V. Gopalakrishnan, K. N. Iyer, and K. Soni. 2016. Real-time video summarization on mobile. In Proceedings of the IEEE International Conference on Image Processing. 176--180.
Brian W. Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405, 2 (1975), 442--451.
Irfan Mehmood, Muhammad Sajjad, Seungmin Rho, and Sung Wook Baik. 2016. Divide-and-conquer based summarization framework for extracting affective video content. Neurocomputing 174 (2016), 393--403.
Tao Mei, Lin-Xie Tang, Jinhui Tang, and Xian-Sheng Hua. 2013. Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing Communications and Applications 9, 3 (July 2013), 16:1--16:23.
J. Meng, H. Wang, J. Yuan, and Y. P. Tan. 2016. From keyframes to key objects: Video summarization by representative object proposal selection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 1039--1048.
A. Mitra, S. Biswas, and C. Bhattacharyya. 2017. Bayesian modeling of temporal coherence in videos for entity discovery and summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 3 (March 2017), 430--443.
Arthur G. Money and Harry Agius. 2008. Video summarisation: A conceptual framework and survey of the state of the art. Journal of Visual Communication and Image Representation 19, 2 (2008), 121--143.
Arthur G. Money and Harry Agius. 2010. Elvis: Entertainment-led video summaries. ACM Transactions on Multimedia Computing, Communications, and Applications 6, 3 (2010), 17:1--17:30.
Jeho Nam and Ahmed H. Tewfik. 1999. Dynamic video summarization and visualization. In Proceedings of the 7th ACM International Conference on Multimedia (Part 2). 53--56.
Apostol Natsev, John R. Smith, Jelena Tešié, Lexing Xie, and Rong Yan. 2008. IBM multimedia analysis and retrieval system. In Proceedings of the 2008 ACM International Conference on Content-based Image and Video Retrieval. 553--554.
Chong-Wah Ngo, Yu-Fei Ma, and Hong-Jiang Zhang. 2005. Video summarization and scene detection by graph modeling. IEEE Transactions on Circuits and Systems for Video Technology 15, 2 (2005), 296--305.
Chong-Wah Ngo, Ting-Chuen Pong, and Hong-Jiang Zhang. 2002. Motion-based video representation for scene change detection. Proceedings of the International Journal of Computer Vision 50, 2 (2002), 127--142.
Chong-Wah Ngo, Wan-Lei Zhao, and Yu-Gang Jiang. 2006. Fast tracking of near-duplicate keyframes in broadcast domain with transitivity propagation. In Proceedings of the 14th ACM International Conference on Multimedia. 845--854.
Payam Oskouie, Sara Alipour, and Amir-Masoud Eftekhari-Moghadam. 2014. Multimodal feature extraction and fusion for semantic mining of soccer video: A survey. Artificial Intelligence Review 42, 2 (2014), 173--210.
S. H. Ou, C. H. Lee, V. S. Somayazulu, Y. K. Chen, and S. Y. Chien. 2015. On-line multi-view video summarization for wireless video sensor network. IEEE Journal of Selected Topics in Signal Processing 9, 1 (Feb 2015), 165--179.
Paul Over, Alan F. Smeaton, and George Awad. 2008. The TRECVid 2008 BBC rushes summarization evaluation. In Proceedings of the 2nd ACM TRECVid Video Summarization Workshop. 1--20.
Paul Over, Alan F. Smeaton, and Philip Kelly. 2007. The TRECVID 2007 BBC rushes summarization evaluation pilot. In Proceedings of the International Workshop on TRECVID Video Summarization. 1--15.
Jim Owens. 2015. Television Sports Production. CRC Press.
R. Panda, A. Das, and A. K. Roy-Chowdhury. 2016. Embedded sparse coding for summarizing multi-view videos. In Proceedings of the IEEE International Conference on Image Processing. 191--195.
Rameswar Panda, Abir Das, Ziyan Wu, Jan Ernst, and Amit K. Roy-Chowdhury. 2017. Weakly supervised summarization of web videos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 3677--3686.
R. Panda and A. K. Roy-Chowdhury. 2017. Collaborative summarization of topic-related videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4274--4283.
R. Panda and A. K. Roy-Chowdhury. 2017. Sparse modeling for topic-oriented video summarization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 1388--1392.
Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. WordNet:: Similarity: measuring the relatedness of concepts. In Demonstration Papers at HLT-NAACL 2004. 38--41.
Wei-Ting Peng, Chia-Han Chang, Wei-Ta Chu, Wei-Jia Huang, Chien-Nan Chou, Wen-Yan Chang, and Yi-Ping Hung. 2010. A real-time user interest meter and its applications in home video summarizing. In Proceedings of the IEEE International Conference on Multimedia and Expo. 849--854.
Wei-Ting Peng, Yueh-Hsuan Chiang, Wei-Ta Chu, Wei-Jia Huang, Wei-Lun Chang, Po-Chung Huang, and Yi-Ping Hung. 2008. Aesthetics-based automatic home video skimming system. In Proceedings of the Advances in Multimedia Modeling. Springer, 186--197.
Wei-Ting Peng, Wei-Ta Chu, Chia-Han Chang, Chien-Nan Chou, Wei-Jia Huang, Wen-Yan Chang, and Yi-Ping Hung. 2011. Editing by viewing: Automatic home video summarization by viewing behavior analysis. IEEE Transactions on Multimedia 13, 3 (2011), 539--550.
Wei-Ting Peng, Wei-Jia Huang, Wei-Ta Chu, Chien-Nan Chou, Wen-Yan Chang, Chia-Han Chang, and Yi-Ping Hung. 2009. A user experience model for home video summarization. In Proceedings of the Advances in Multimedia Modeling. Springer, 484--495.
Silvia Pfeiffer, Rainer Lienhart, Stephan Fischer, and Wolfgang Effelsberg. 1996. Abstracting digital movies automatically. Journal of Visual Communication and Image Representation 7, 4 (1996), 345--353.
B. A. Plummer, M. Brown, and S. Lazebnik. 2017. Enhancing video summarization via vision-language embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1052--1060.
Dulce Ponceleon, Arnon Amir, Savitha Srinivasan, Tanveer Syeda-Mahmood, and Dragutin Petkovic. 1999. CueVideo: Automated multimedia indexing and retrieval. In Proceedings of the 7th ACM International Conference on Multimedia (Part 2). 199.
Danila Potapov, Matthijs Douze, Zaid Harchaoui, and Cordelia Schmid. 2014. Category-specific video summarization. In Proceedings of the European Conference on Computer Vision. 540--555.
Z. Rasheed and M. Shah. 2003. Scene detection in Hollywood movies and TV shows. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. II--343--II--348.
Z. Rasheed and M. Shah. 2005. Detection and representation of scenes in videos. IEEE Transactions on Multimedia 7, 6 (Dec 2005), 1097--1105.
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779--788.
J. Ren and J. Jiang. 2009. Hierarchical modeling and adaptive clustering for real-time summarization of rush videos. IEEE Transactions on Multimedia 11, 5 (Aug 2009), 906--917.
Reede Ren, Hemant Misra, and Joemon M. Jose. 2010. Semantic based adaptive movie summarisation. In Proceedings of the Advances in Multimedia Modeling. Springer, 389--399.
Tongwei Ren, Yan Liu, and Gangshan Wu. 2010. Video summary quality evaluation based on 4C assessment and user interaction. In Proceedings of the Multimedia Interaction and Intelligent User Interfaces. Springer, 243--269.
Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. 2010. Adapting visual category models to new domains. In Proceedings of the 11th European Conference on Computer Vision: Part IV. Springer-Verlag, Berlin, 213--226.
Helmut Schmid. 2013. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the New Methods in Language Processing. Routledge, 154.
Guy L. Scott and H. Christopher Longuet-Higgins. 1991. An algorithm for associating the features of two images. Proceedings of the Royal Society of London B: Biological Sciences 244, 1309 (1991), 21--26.
Dafna Shahaf and Carlos Guestrin. 2010. Connecting the dots between news articles. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 623--632.
Aidean Sharghi, Boqing Gong, and Mubarak Shah. 2016. Query-focused extractive video summarization. In Proceedings of European Conference on Computer Vision. 3--19.
Aidean Sharghi, Jacob S. Laurel, and Boqing Gong. 2017. Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2127--2136.
Ali Shokoufandeh and Sven Dickinson. 1999. Applications of bipartite matching to problems in object recognition. In Proceedings of the ICCV Workshop on Graph Algorithms and Computer Vision, Vol. 2. 1--18.
J. Sivic and A. Zisserman. 2009. Efficient visual search of videos cast as text retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 4 (April 2009), 591--606.
Alan F. Smeaton, Paul Over, and Aiden R. Doherty. 2010. Video shot boundary detection: Seven years of TRECVid activity. Computer Vision and Image Understanding 114, 4 (2010), 411--418.
Michael A. Smith and Takeo Kanade. 1998. Video skimming and characterization through the combination of image and language understanding. In Proceedings of IEEE International Workshop on Content-Based Access of Image and Video Database. 61--70.
Temple F. Smith and Michael S. Waterman. 1981. Identification of common molecular subsequences. Journal of Molecular Biology 147, 1 (1981), 195--197.
Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5179--5187.
Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of frustratingly easy domain adaptation. In Thirtieth AAAI Conference on Artificial Intelligence, Vol. 6. 2058--2065.
Liang Sun, Shuiwang Ji, and Jieping Ye. 2008. Hypergraph spectral learning for multi-label classification. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 668--676.
Min Sun, Ali Farhadi, and Steve Seitz. 2014. Ranking domain-specific highlights by analyzing edited videos. In Proceedings of the European Conference on Computer Vision. Springer, 787--802.
Anthony Tang and Sebastian Boring. 2012. #EpicPlay: Crowd-sourcing sports video highlights. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1569--1572.
Cuneyt M. Taskiran. 2006. Evaluation of automatic video summarization systems. In Proceedings of SPIE, Vol. 6073. 178--187.
Cüneyt M. Taskiran, Zygmunt Pizlo, Arnon Amir, Dulce Ponceleon, and Edward J. Delp. 2006. Automated video program summarization using speech transcripts. IEEE Transactions on Multimedia 8, 4 (2006), 775--791.
M. Tavassolipour, M. Karimian, and S. Kasaei. 2014. Event detection and summarization in soccer videos using Bayesian network and copula. IEEE Transactions on Circuits and Systems for Video Technology 24, 2 (2014), 291--304.
Ba Tu Truong and Svetha Venkatesh. 2007. Video abstraction: A systematic review and classification. ACM Transactions on Multimedia Computing, Communications, and Applications 3, 1 (2007), 3:1--3:37.
Chia-Ming Tsai, Li-Wei Kang, Chia-Wen Lin, and Weisi Lin. 2013. Scene-based movie summarization via role-community networks. IEEE Transactions on Circuits and Systems for Video Technology 23, 11 (2013), 1927--1940.
Víctor Valdés and José M. Martínez. 2008. Binary tree based on-line video summarization. In Proceedings of the 2nd ACM TRECVid Video Summarization Workshop. 134--138.
Patrizia Varini, Giuseppe Serra, and Rita Cucchiara. 2015. Egocentric video summarization of cultural tour based on user preferences. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 931--934.
P. Varini, G. Serra, and R. Cucchiara. 2017. Personalized egocentric video summarization of cultural tour on user preferences input. IEEE Transactions on Multimedia 19, 12 (Dec 2017), 2832--2845.
Nuno Vasconcelos and Andrew Lippman. 1998. Bayesian modeling of video editing and structure: Semantic features for video summarization and browsing. In Proceedings of International Conference on Image Processing. 153--157.
F. Wang and C. W. Ngo. 2012. Summarizing rushes videos by motion, object, and event understanding. IEEE Transactions on Multimedia 14, 1 (Feb 2012), 76--87.
L. Wang, Y. Li, and S. Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5005--5013.
Meng Wang, Richang Hong, Guangda Li, Zheng-Jun Zha, Shuicheng Yan, and Tat-Seng Chua. 2012. Event driven web video summarization by tag localization and key-shot identification. IEEE Transactions on Multimedia 14, 4 (2012), 975--985.
S. Wang and Q. Ji. 2015. Video affective content analysis: A survey of state-of-the-art methods. IEEE Transactions on Affective Computing 6, 4 (Oct 2015), 410--430.
Xi Wang, Yu-Gang Jiang, Zhenhua Chai, Zichen Gu, Xinyu Du, and Dong Wang. 2014. Real-time summarization of user-generated videos based on semantic recognition. In Proceedings of the 22nd ACM International Conference on Multimedia. 849--852.
Huawei Wei, Bingbing Ni, Yichao Yan, Huanyu Yu, Xiaokang Yang, and Chen Yao. 2018. Video summarization via semantic attended networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 216--223.
Tao Xiang and Shaogang Gong. 2004. Activity based video content trajectory representation and segmentation. In Proceedings of the BMVC. 1--10.
Xiaohong Xiang and Mohan S. Kankanhalli. 2011. Affect-based adaptive presentation of home videos. In Proceedings of the 19th ACM International Conference on Multimedia. 553--562.
Baohan Xu, Xi Wang, and Yu-Gang Jiang. 2016. Fast summarization of user-generated videos: Exploiting semantic, emotional, and quality clues. IEEE MultiMedia 23, 3 (2016), 23--33.
C. Xu, J. Wang, H. Lu, and Y. Zhang. 2008. A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Transactions on Multimedia 10, 3 (April 2008), 421--436.
Jia Xu, Lopamudra Mukherjee, Yin Li, Jamieson Warner, James M. Rehg, and Vikas Singh. 2015. Gaze-enabled egocentric video summarization via constrained submodular maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2235--2244.
X. Xu, T. M. Hospedales, and S. Gong. 2017. Discovery of shared semantic spaces for multiscene video query and summarization. IEEE Transactions on Circuits and Systems for Video Technology 27, 6 (June 2017), 1353--1367.
Huan Yang, Baoyuan Wang, Stephen Lin, David Wipf, Minyi Guo, and Baining Guo. 2015. Unsupervised extraction of video highlights via robust recurrent auto-encoders. In Proceedings of the IEEE International Conference on Computer Vision. 4633--4641.
T. Yao, T. Mei, and Y. Rui. 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 982--990.
Minerva M. Yeung and Boon-Lock Yeo. 1997. Video visualization for compact presentation and fast browsing of pictorial content. IEEE Transactions on Circuits and Systems for Video Technology 7, 5 (1997), 771--785.
Serena Yeung, Alireza Fathi, and Li Fei-Fei. 2014. Videoset: Video summary evaluation through text. arXiv preprint arXiv:1406.5824 (2014).
Atsuo Yoshitaka and Kazuaki Sawada. 2012. Personalized video summarization based on behavior of viewer. In Proceedings of the 8th International Conference on Signal Image Technology and Internet Based Systems. 661--667.
Junyong You, Guizhong Liu, Li Sun, and Hongliang Li. 2007. A multiple visual models based perceptive analysis framework for multilevel video summarization. IEEE Transactions on Circuits and Systems for Video Technology 17, 3 (2007), 273--285.
J. Yuan, H. Wang, L. Xiao, W. Zheng, J. Li, F. Lin, and B. Zhang. 2007. A formal study of shot boundary detection. IEEE Transactions on Circuits and Systems for Video Technology 17, 2 (Feb 2007), 168--186.
Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 1 (2006), 49--67.
Zheng Yuan, Taoran Lu, Dapeng Wu, Yu Huang, and Heather Yu. 2011. Video summarization with semantic concept preservation. In Proceedings of the 10th ACM International Conference on Mobile and Ubiquitous Multimedia. 109--112.
Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Summary transfer: Exemplar-based subset selection for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1059--1067.
Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Video summarization with long short-term memory. In European Conference on Computer Vision. Springer, 766--782.
S. Zhang, Y. Zhu, and A. K. Roy-Chowdhury. 2016. Context-aware surveillance video summarization. IEEE Transactions on Image Processing 25, 11 (Nov 2016), 5469--5478.
Ying Zhang, Guanfeng Wang, Beomjoo Seo, and Roger Zimmermann. 2012. Multi-video summary and skim generation of sensor-rich videos in geo-space. In Proceedings of the 3rd Multimedia Systems Conference. 53--64.
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 863--871.
Bin Zhao and Eric P. Xing. 2014. Quasi real-time summarization for consumer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2513--2520.
W. L. Zhao, C. W. Ngo, H. K. Tan, and X. Wu. 2007. Near-duplicate keyframe identification with interest point matching and pattern learning. IEEE Transactions on Multimedia 9, 5 (Aug 2007), 1037--1048.
A. Zlatintsi, E. Iosif, P. Marago, and A. Potamianos. 2015. Audio salient event detection and summarization using audio and text modalities. In Proceedings of the 23rd European Signal Processing Conference. 2311--2315.

Cited By

View all

Index Terms

  1. Video Skimming: Taxonomy and Comprehensive Survey



                              Information & Contributors


                              Published In

                              cover image ACM Computing Surveys
                              ACM Computing Surveys  Volume 52, Issue 5
                              September 2020
                              791 pages
                              • Editor:
                              • Sartaj Sahni
                              Issue’s Table of Contents
                              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]


                              Association for Computing Machinery

                              New York, NY, United States

                              Publication History

                              Published: 13 September 2019
                              Accepted: 01 July 2019
                              Revised: 01 March 2019
                              Received: 01 April 2018
                              Published in CSUR Volume 52, Issue 5


                              Request permissions for this article.

                              Check for updates

                              Author Tags

                              1. Dynamic video summarization/video skimming
                              2. affective content
                              3. attention model
                              4. deep learning
                              5. machine learning
                              6. semantic concept


                              • Tutorial
                              • Research
                              • Refereed

                              Funding Sources

                              • Ministry of Human Resource Development (MHRD), Government of India


                              Other Metrics

                              Bibliometrics & Citations


                              Article Metrics

                              • Downloads (Last 12 months)33
                              • Downloads (Last 6 weeks)4
                              Reflects downloads up to 13 Jan 2025

                              Other Metrics


                              Cited By

                              View all

                              View Options

                              Login options

                              Full Access

                              View options


                              View or Download as a PDF file.



                              View online with eReader.


                              HTML Format

                              View this article in HTML Format.

                              HTML Format







                              Share this Publication link

                              Share on social media