skip to main content
10.1145/3595916.3626453acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Improving Class Representation for Zero-Shot Action Recognition

Published: 01 January 2024 Publication History

Abstract

Zero-Shot Action Recognition (ZSAR) enables models to infer new action classes from previously seen data without any samples of those new classes. How an action class is represented in an understandable and processable format influences the performance in ZSAR. Semantic representations of action classes have been made in various forms, such as attributes, class labels, and text descriptions, while in video recognition, the action classes can also have visual representations in the form of images. This paper proposes a novel method by improving class representation for ZSAR. On the one hand, to improve the collection and quality of text descriptions, this paper uses ChatGPT to generate descriptions and designs conversation-based text prompts that can quickly obtain high-quality descriptions of many actions. On the other hand, to overcome the ambiguity of single-modal class representation, we propose the Image-based Description Refinement (IDR) method to obtain multimodal class representation. Specifically, action classes are represented by relevant images from the web and descriptions, and action videos are represented by spatio-temporal features and extracted objects. By training on the seen set to learn the mapping of multimodal representations for classes and videos, we can infer video classes on the unseen set from the similarity of the mapped representations. Experiments on two popular benchmarks and two elderly daily activity datasets show the effectiveness of our method. In particular, it has a significant improvement in the case of less available video samples.

Supplementary Material

Appendix (Appendix_Textual_Descriptions_of Actions_for_different_datasets.pdf)

References

[1]
Mina Bishay, Georgios Zoumpourlis, and Ioannis Patras. 2019. Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021 (2019).
[2]
Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, and Krzysztof Chalupka. 2020. Rethinking zero-shot video classification: End-to-end training for realistic applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4613–4623.
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
[5]
Shizhe Chen and Dong Huang. 2021. Elaborative rehearsal for zero-shot action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13638–13647.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[7]
Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, and Dahua Lin. 2020. Omni-sourced webly-supervised learning for video recognition. In European Conference on Computer Vision. Springer, 670–688.
[8]
Valter Estevam, Rayson Laroca, David Menotti, and Helio Pedrini. 2021. Tell me what you see: A zero-shot action recognition method based on natural language descriptions. arXiv preprint arXiv:2112.09976 (2021).
[9]
Yaogong Feng, Jian Yu, Jitao Sang, and Pengbo Yang. 2020. Survey on knowledge-based zero-shot visual recognition. Journal of Software 32, 2 (2020), 370–405.
[10]
Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. 2015. Transductive multi-view zero-shot learning. IEEE transactions on pattern analysis and machine intelligence 37, 11 (2015), 2332–2345.
[11]
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 8303–8311.
[12]
Shreyank N Gowda. 2023. Synthetic Sample Selection for Generalized Zero-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 58–67.
[13]
Shreyank N Gowda, Marcus Rohrbach, and Laura Sevilla-Lara. 2021. Smart frame selection for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1451–1459.
[14]
Shreyank N Gowda, Laura Sevilla-Lara, Frank Keller, and Marcus Rohrbach. 2022. Claster: clustering with reinforcement learning for zero-shot action recognition. In European Conference on Computer Vision. Springer, 187–203.
[15]
Mihir Jain, Jan C Van Gemert, Thomas Mensink, and Cees GM Snoek. 2015. Objects2action: Classifying and localizing actions without any video example. In Proceedings of the IEEE international conference on computer vision. 4588–4596.
[16]
Jinhyeok Jang, Dohyung Kim, Cheonshu Park, Minsu Jang, Jaeyeon Lee, and Jaehong Kim. 2020. ETRI-activity3D: A large-scale RGB-D dataset for robots to recognize daily activities of the elderly. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 10990–10997.
[17]
Anastasios Karakostas, Alexia Briassouli, Konstantinos Avgerinakis, Ioannis Kompatsiaris, and Magda Tsolaki. 2016. The dem@ care experiments and datasets: a technical report. arXiv preprint arXiv:1701.01142 (2016).
[18]
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. 2020. Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer, 491–507.
[19]
Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision. IEEE, 2556–2563.
[20]
Chung-Ching Lin, Kevin Lin, Lijuan Wang, Zicheng Liu, and Linjie Li. 2022. Cross-modal representation learning for zero-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19978–19988.
[21]
Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision. 7083–7093.
[22]
Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human actions by attributes. In CVPR 2011. IEEE, 3337–3344.
[23]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26 (2013).
[24]
George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
[25]
Devi Parikh and Kristen Grauman. 2011. Relative attributes. In 2011 International Conference on Computer Vision. IEEE, 503–510.
[26]
Shi Pu, Kaili Zhao, and Mao Zheng. 2022. Alignment-uniformity aware representation learning for zero-shot video classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19968–19977.
[27]
Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, and Yin Cui. 2022. Multimodal open-vocabulary video classification via pre-trained vision and language models. arXiv preprint arXiv:2207.07646 (2022).
[28]
Yijun Qian, Lijun Yu, Wenhe Liu, and Alexander G Hauptmann. 2022. Rethinking zero-shot action recognition: Learning from latent atomic actions. In European Conference on Computer Vision. Springer, 104–120.
[29]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
[30]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.
[31]
Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. 2018. Generalized zero-shot learning via synthesized examples. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4281–4289.
[32]
Qian Wang and Ke Chen. 2017. Alternative semantic representations for zero-shot human action recognition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part I 10. Springer, 87–102.
[33]
Wangmeng Xiang, Chao Li, Yuxuan Zhou, Biao Wang, and Lei Zhang. 2022. Language supervised training for skeleton-based action recognition. arXiv preprint arXiv:2208.05318 (2022).
[34]
Xun Xu, Timothy Hospedales, and Shaogang Gong. 2017. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision 123 (2017), 309–333.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
December 2023
745 pages
ISBN:9798400702051
DOI:10.1145/3595916
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. action recognition
  2. multimodal
  3. zero-shot

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

MMAsia '23
Sponsor:
MMAsia '23: ACM Multimedia Asia
December 6 - 8, 2023
Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 71
    Total Downloads
  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)3
Reflects downloads up to 28 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media