research-article

Improving Class Representation for Zero-Shot Action Recognition

Authors:

Jianing MaoAuthors Info & Claims

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

Article No.: 79, Pages 1 - 7

https://doi.org/10.1145/3595916.3626453

Published: 01 January 2024 Publication History

Abstract

Zero-Shot Action Recognition (ZSAR) enables models to infer new action classes from previously seen data without any samples of those new classes. How an action class is represented in an understandable and processable format influences the performance in ZSAR. Semantic representations of action classes have been made in various forms, such as attributes, class labels, and text descriptions, while in video recognition, the action classes can also have visual representations in the form of images. This paper proposes a novel method by improving class representation for ZSAR. On the one hand, to improve the collection and quality of text descriptions, this paper uses ChatGPT to generate descriptions and designs conversation-based text prompts that can quickly obtain high-quality descriptions of many actions. On the other hand, to overcome the ambiguity of single-modal class representation, we propose the Image-based Description Refinement (IDR) method to obtain multimodal class representation. Specifically, action classes are represented by relevant images from the web and descriptions, and action videos are represented by spatio-temporal features and extracted objects. By training on the seen set to learn the mapping of multimodal representations for classes and videos, we can infer video classes on the unseen set from the similarity of the mapped representations. Experiments on two popular benchmarks and two elderly daily activity datasets show the effectiveness of our method. In particular, it has a significant improvement in the case of less available video samples.

Supplementary Material

Appendix (Appendix_Textual_Descriptions_of Actions_for_different_datasets.pdf)

Download
178.05 KB

References

[1]

Mina Bishay, Georgios Zoumpourlis, and Ioannis Patras. 2019. Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021 (2019).

[2]

Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, and Krzysztof Chalupka. 2020. Rethinking zero-shot video classification: End-to-end training for realistic applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4613–4623.

[3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.

[4]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.

[5]

Shizhe Chen and Dong Huang. 2021. Elaborative rehearsal for zero-shot action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13638–13647.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[7]

Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, and Dahua Lin. 2020. Omni-sourced webly-supervised learning for video recognition. In European Conference on Computer Vision. Springer, 670–688.

Digital Library

[8]

Valter Estevam, Rayson Laroca, David Menotti, and Helio Pedrini. 2021. Tell me what you see: A zero-shot action recognition method based on natural language descriptions. arXiv preprint arXiv:2112.09976 (2021).

[9]

Yaogong Feng, Jian Yu, Jitao Sang, and Pengbo Yang. 2020. Survey on knowledge-based zero-shot visual recognition. Journal of Software 32, 2 (2020), 370–405.

[10]

Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. 2015. Transductive multi-view zero-shot learning. IEEE transactions on pattern analysis and machine intelligence 37, 11 (2015), 2332–2345.

Digital Library

[11]

Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 8303–8311.

Digital Library

[12]

Shreyank N Gowda. 2023. Synthetic Sample Selection for Generalized Zero-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 58–67.

[13]

Shreyank N Gowda, Marcus Rohrbach, and Laura Sevilla-Lara. 2021. Smart frame selection for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1451–1459.

[14]

Shreyank N Gowda, Laura Sevilla-Lara, Frank Keller, and Marcus Rohrbach. 2022. Claster: clustering with reinforcement learning for zero-shot action recognition. In European Conference on Computer Vision. Springer, 187–203.

Digital Library

[15]

Mihir Jain, Jan C Van Gemert, Thomas Mensink, and Cees GM Snoek. 2015. Objects2action: Classifying and localizing actions without any video example. In Proceedings of the IEEE international conference on computer vision. 4588–4596.

Digital Library

[16]

Jinhyeok Jang, Dohyung Kim, Cheonshu Park, Minsu Jang, Jaeyeon Lee, and Jaehong Kim. 2020. ETRI-activity3D: A large-scale RGB-D dataset for robots to recognize daily activities of the elderly. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 10990–10997.

Digital Library

[17]

Anastasios Karakostas, Alexia Briassouli, Konstantinos Avgerinakis, Ioannis Kompatsiaris, and Magda Tsolaki. 2016. The dem@ care experiments and datasets: a technical report. arXiv preprint arXiv:1701.01142 (2016).

[18]

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. 2020. Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer, 491–507.

[19]

Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision. IEEE, 2556–2563.

Digital Library

[20]

Chung-Ching Lin, Kevin Lin, Lijuan Wang, Zicheng Liu, and Linjie Li. 2022. Cross-modal representation learning for zero-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19978–19988.

[21]

Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision. 7083–7093.

[22]

Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human actions by attributes. In CVPR 2011. IEEE, 3337–3344.

Digital Library

[23]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26 (2013).

[24]

George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.

Digital Library

[25]

Devi Parikh and Kristen Grauman. 2011. Relative attributes. In 2011 International Conference on Computer Vision. IEEE, 503–510.

Digital Library

[26]

Shi Pu, Kaili Zhao, and Mao Zheng. 2022. Alignment-uniformity aware representation learning for zero-shot video classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19968–19977.

[27]

Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, and Yin Cui. 2022. Multimodal open-vocabulary video classification via pre-trained vision and language models. arXiv preprint arXiv:2207.07646 (2022).

[28]

Yijun Qian, Lijun Yu, Wenhe Liu, and Alexander G Hauptmann. 2022. Rethinking zero-shot action recognition: Learning from latent atomic actions. In European Conference on Computer Vision. Springer, 104–120.

Digital Library

[29]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).

[30]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.

Digital Library

[31]

Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. 2018. Generalized zero-shot learning via synthesized examples. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4281–4289.

[32]

Qian Wang and Ke Chen. 2017. Alternative semantic representations for zero-shot human action recognition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part I 10. Springer, 87–102.

[33]

Wangmeng Xiang, Chao Li, Yuxuan Zhou, Biao Wang, and Lei Zhang. 2022. Language supervised training for skeleton-based action recognition. arXiv preprint arXiv:2208.05318 (2022).

[34]

Xun Xu, Timothy Hospedales, and Shaogang Gong. 2017. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision 123 (2017), 309–333.

Digital Library

Index Terms

Improving Class Representation for Zero-Shot Action Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition
Computer Vision – ECCV 2022
Abstract
Zero-Shot action recognition is the task of recognizing action classes without visual examples. The problem can be seen as learning a representation on seen classes which generalizes well to instances of unseen classes, without losing ...
Transductive Zero-Shot Action Recognition by Word-Vector Embedding

The number of categories for action recognition is growing rapidly and it has become increasingly hard to label sufficient training data for learning conventional models for all categories. Instead of collecting ever more data and labelling them ...
Action matching network: open-set action recognition using spatio-temporal representation matching
Abstract
In this paper, we address an open-set action recognition problem. While the closed-set action recognition classifies test samples into the same classes of actions used for model training, the problem of the open-set action recognition is more ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

December 2023

745 pages

ISBN:9798400702051

DOI:10.1145/3595916

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China

Conference

MMAsia '23

Sponsor:

SIGMM

MMAsia '23: ACM Multimedia Asia

December 6 - 8, 2023

Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
71
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)3

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten