research-article

Generalized Zero-Shot Video Classification via Generative Adversarial Networks

Authors:

Mingyao Hong,

Guorong Li,

Xinfeng Zhang,

Qingming HuangAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 2419 - 2426

https://doi.org/10.1145/3394171.3413517

Published: 12 October 2020 Publication History

Get Access

Abstract

Zero-shot learning (ZSL) is to classify images according to detailed attribute annotations into new categories that are unseen during the training stage. Generalized zero-shot learning (GZSL) adds seen categories to the test samples. Since the learned classifier has inherent bias against seen categories, GZSL is more challenging than traditional ZSL. However, at present, there is no detailed attribute description dataset for video classification. Therefore, the current zero-shot video classification problem is based on the synthesis of generative adversarial networks trained on seen-class features into unseen-class features for ZSL classification. In order to solve this problem, we propose a description text dataset based on the UCF101 action recognition dataset. To the best of our knowledge, this is the first work to add description of the classes to zero-shot video classification. We propose a new loss function that combines visual features with textual features. We extract text features from the proposed text data set, and constrain the process of generating synthetic features based on the principle that videos with similar text types should be similar. Our method reapplies the traditional zero-shot learning idea to video classification. From the experimental point of view, our proposed dataset and method have a positive impact on the generalized zero-shot video classification.

Supplementary Material

MP4 File (3394171.3413517.mp4)

The main contributions of the paper are three folds. (i) A novel video classification framework for zero-shot learning problem by leveraging video text attribute description; (ii) a new loss function using both text and visual features; (iii) constructing a new video attribute description dataset. Previous works mainly used GAN to generate unseen class visual features to help classify unseen classes. Herein, these visual features are generated based on unseen class names, seen class videos and their names. Although the class names can represent the videos to some extent, they are not enough to describe the video features obviously. We think more accurate descriptions for unseen videos can promote the video classification performance, which is the motivation of our paper. Therefore, we proposed the novel framework by constructing a text description dataset to provide more video attribute information and designing a new loss function to bridge the semantic relationship between categories.

Download
5.89 MB

References

[1]

Jonas Adler and Sebastian Lunz. 2018. Banach Wasserstein GAN. (2018), 6755--6764.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

A study on zero-shot learning from semantic viewpoint

Transductive Multilabel Learning via Label Set Propagation

Pseudo-labeling generative adversarial networks for medical image classification

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations