skip to main content
10.1145/3394171.3413517acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Generalized Zero-Shot Video Classification via Generative Adversarial Networks

Published: 12 October 2020 Publication History

Abstract

Zero-shot learning (ZSL) is to classify images according to detailed attribute annotations into new categories that are unseen during the training stage. Generalized zero-shot learning (GZSL) adds seen categories to the test samples. Since the learned classifier has inherent bias against seen categories, GZSL is more challenging than traditional ZSL. However, at present, there is no detailed attribute description dataset for video classification. Therefore, the current zero-shot video classification problem is based on the synthesis of generative adversarial networks trained on seen-class features into unseen-class features for ZSL classification. In order to solve this problem, we propose a description text dataset based on the UCF101 action recognition dataset. To the best of our knowledge, this is the first work to add description of the classes to zero-shot video classification. We propose a new loss function that combines visual features with textual features. We extract text features from the proposed text data set, and constrain the process of generating synthetic features based on the principle that videos with similar text types should be similar. Our method reapplies the traditional zero-shot learning idea to video classification. From the experimental point of view, our proposed dataset and method have a positive impact on the generalized zero-shot video classification.

Supplementary Material

MP4 File (3394171.3413517.mp4)
The main contributions of the paper are three folds. (i) A novel video classification framework for zero-shot learning problem by leveraging video text attribute description; (ii) a new loss function using both text and visual features; (iii) constructing a new video attribute description dataset. Previous works mainly used GAN to generate unseen class visual features to help classify unseen classes. Herein, these visual features are generated based on unseen class names, seen class videos and their names. Although the class names can represent the videos to some extent, they are not enough to describe the video features obviously. We think more accurate descriptions for unseen videos can promote the video classification performance, which is the motivation of our paper. Therefore, we proposed the novel framework by constructing a text description dataset to provide more video attribute information and designing a new loss function to bridge the semantic relationship between categories.

References

[1]
Jonas Adler and Sebastian Lunz. 2018. Banach Wasserstein GAN. (2018), 6755--6764.
[2]
Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. 2016a. Multi-cue Zero-Shot Learning with Strong Supervision. (2016), 59--68.
[3]
Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2013. Label-Embedding for Attribute-Based Classification. (2013), 819--826.
[4]
Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2016b. Label-Embedding for Image Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, 7 (2016), 1425--1438.
[5]
Zeynep Akata, Scott Reed, Daniel J Walter, Honglak Lee, and Bernt Schiele. 2015. Evaluation of output embeddings for fine-grained image classification. (2015), 2927--2936.
[6]
Mart'i n Arjovsky, Soumith Chintala, and Lé on Bottou. 2017. Wasserstein GAN. CoRR, Vol. abs/1701.07875 (2017). arxiv: 1701.07875 http://arxiv.org/abs/1701.07875
[7]
Biagio Brattoli, Uta Buchler, Annasophia Wahl, Martin E Schwab, and Bjorn Ommer. 2017. LSTM Self-Supervision for Detailed Behavior Analysis. (2017), 3747--3756.
[8]
Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. (2017), 4724--4733.
[9]
Soravit Changpinyo, Weilun Chao, Boqing Gong, and Fei Sha. 2016. Synthesized Classifiers for Zero-Shot Learning. (2016), 5327--5336.
[10]
Terrance Devries and Graham W Taylor. 2018. Learning Confidence for Out-of-Distribution Detection in Neural Networks. arXiv: Machine Learning (2018).
[11]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual Encoding for Zero-Example Video Retrieval. (2019), 9346--9355.
[12]
Mohamed Elhoseiny and Mohamed Elfeki. 2019. Creativity Inspired Zero-Shot Learning. arXiv: Computer Vision and Pattern Recognition (2019).
[13]
Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. 2013. Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions. (2013), 2584--2591.
[14]
Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, and Ahmed Elgammal. 2017. Link the Head to the "Beak": Zero Shot Learning from Noisy Text Description at Part Precision. (2017), 6288--6297.
[15]
Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. (2009), 1778--1785.
[16]
Rafael Felix, B G Vijay Kumar, Ian Reid, and Gustavo Carneiro. 2018. Multi-modal Cycle-Consistent Generalized Zero-Shot Learning. (2018), 21--37.
[17]
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, Marcaurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. (2013), 2121--2129.
[18]
Yanwei Fu, Timothy M Hospedales, Tao Xiang, Zhenyong Fu, and Shaogang Gong. 2014. Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation. (2014), 584--599.
[19]
Ian Goodfellow, Jean Pougetabadie, Mehdi Mirza, Bing Xu, David Wardefarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. (2014), 2672--2680.
[20]
Yuchen Guo, Guiguang Ding, Jungong Han, and Yue Gao. 2017a. Synthesizing samples for zero-shot learning. (2017), 1774--1780.
[21]
Yuchen Guo, Guiguang Ding, Jungong Han, and Yue Gao. 2017b. Zero-Shot Learning With Transferred Samples. IEEE Transactions on Image Processing, Vol. 26, 7 (2017), 3277--3290.
[22]
Harish Chandra Gupta and Bhu Dev Sharma. 1976. On non-additive measures of inaccuracy. Czechoslovak Mathematical Journal, Vol. 26, 4 (1976), 584--595.
[23]
Frederic Jurie, Maxime Bucher, and Stephane Herbin. 2017. Generating Visual Representations for Zero-Shot Classification. (2017), 2666--2673.
[24]
Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao Wang, Yujia Zhang, and Eric P Xing. 2019. Rethinking Knowledge Graph Propagation for Zero-Shot Learning. (2019), 11487--11496.
[25]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Feifei. 2014. Large-Scale Video Classification with Convolutional Neural Networks. (2014), 1725--1732.
[26]
Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shaogang Gong. 2015. Unsupervised Domain Adaptation for Zero-Shot Learning. (2015), 2452--2460.
[27]
Elyor Kodirov, Tao Xiang, and Shaogang Gong. 2017. Semantic Autoencoder for Zero-Shot Learning. (2017), 4447--4456.
[28]
Hilde Kuehne, Hueihan Jhuang, Estibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. (2011), 2556--2563.
[29]
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. (2009), 951--958.
[30]
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-Based Classification for Zero-Shot Visual Object Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, 3 (2014), 453--465.
[31]
Chungwei Lee, Wei Fang, Chihkuan Yeh, and Yuchiang Frank Wang. 2018. Multi-label Zero-Shot Learning with Structured Knowledge Graphs. (2018), 1576--1585.
[32]
Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. 2017. Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples. arXiv: Machine Learning (2017).
[33]
Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human actions by attributes. (2011), 3337--3344.
[34]
Yang Long, Li Liu, Ling Shao, Fumin Shen, Guiguang Ding, and Jungong Han. 2017. From Zero-Shot Learning to Conventional Supervised Classification: Unseen Visual Data Synthesis. (2017), 6165--6174.
[35]
Yang Long, Li Liu, Fumin Shen, Ling Shao, and Xuelong Li. 2018. Zero-Shot Learning Using Synthesised Unseen Visual Data with Diffusion Regularisation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 10 (2018), 2498--2512.
[36]
Devraj Mandal, Sanath Narayan, Sai Kumar Dwivedi, Vikram Gupta, Shuaib Ahmed, Fahad Shahbaz Khan, and Ling Shao. 2019. Out-Of-Distribution Detection for Generalized Zero-Shot Action Recognition. (2019), 9985--9993.
[37]
Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. arXiv: Learning (2014).
[38]
Ashish Kumar Mishra, Vinay Kumar Verma, M Shiva Krishna Reddy, S Arulkumar, Piyush Rai, and Anurag Mittal. 2018. A Generative Approach to Zero-Shot and Few-Shot Action Recognition. (2018), 372--380.
[39]
Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. 2013. Zero-Shot Learning by Convex Combination of Semantic Embeddings. arXiv: Learning (2013).
[40]
Junting Pan, Chengyu Wang, Xu Jia, Jing Shao, Lu Sheng, Junjie Yan, and Xiaogang Wang. 2019. Video Generation from Single Semantic Label Map. arXiv: Computer Vision and Pattern Recognition (2019).
[41]
Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Jiaxin Chen, and Yunhong Wang. 2017. Zero-Shot Action Recognition with Error-Correcting Output Codes. (2017), 1042--1051.
[42]
Gerard Salton and Chris Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management, Vol. 24, 5 (1988), 323--328.
[43]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv: Computer Vision and Pattern Recognition (2012).
[44]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. (2015), 4489--4497.
[45]
Xiaolong Wang, Yufei Ye, and Abhinav Gupta. 2018. Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs. (2018), 6857--6866.
[46]
Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018. Feature Generating Networks for Zero-Shot Learning. (2018), 5542--5551.
[47]
Yongqin Xian, Bernt Schiele, and Zeynep Akata. 2017. Zero-Shot Learning - The Good, the Bad and the Ugly. (2017), 3077--3086.
[48]
Xun Xu, Timothy M Hospedales, and Shaogang Gong. 2017. Transductive Zero-Shot Action Recognition by Word-Vector Embedding. International Journal of Computer Vision, Vol. 123, 3 (2017), 309--333.
[49]
Beichen Zhang, Liang Li, Shijie Yang, Shuhui Wang, Zheng-Jun Zha, and Qingming Huang. 2020. State-Relabeling Adversarial Active Learning. In Proceedings of the IEEE International Conference on Computer Vision.
[50]
Chenrui Zhang and Yuxin Peng. 2018. Visual Data Synthesis via GAN for Zero-Shot Video Classification. (2018), 1128--1134.
[51]
Junyan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. (2017), 2242--2251.

Cited By

View all

Index Terms

  1. Generalized Zero-Shot Video Classification via Generative Adversarial Networks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. computer vision
    2. machine learning

    Qualifiers

    • Research-article

    Conference

    MM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)34
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 06 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media