Improving Class Representation for Zero-Shot Action Recognition

Published: 01 January 2024 Publication History


Zero-Shot Action Recognition (ZSAR) enables models to infer new action classes from previously seen data without any samples of those new classes. How an action class is represented in an understandable and processable format influences the performance in ZSAR. Semantic representations of action classes have been made in various forms, such as attributes, class labels, and text descriptions, while in video recognition, the action classes can also have visual representations in the form of images. This paper proposes a novel method by improving class representation for ZSAR. On the one hand, to improve the collection and quality of text descriptions, this paper uses ChatGPT to generate descriptions and designs conversation-based text prompts that can quickly obtain high-quality descriptions of many actions. On the other hand, to overcome the ambiguity of single-modal class representation, we propose the Image-based Description Refinement (IDR) method to obtain multimodal class representation. Specifically, action classes are represented by relevant images from the web and descriptions, and action videos are represented by spatio-temporal features and extracted objects. By training on the seen set to learn the mapping of multimodal representations for classes and videos, we can infer video classes on the unseen set from the similarity of the mapped representations. Experiments on two popular benchmarks and two elderly daily activity datasets show the effectiveness of our method. In particular, it has a significant improvement in the case of less available video samples.

