research-article

Mitigating Context Bias in Action Recognition via Skeleton-Dominated Two-Stream Network

Authors:

Xiaolong Huang,

Zengfu WangAuthors Info & Claims

AMC-SME '23: Proceedings of the 2023 Workshop on Advanced Multimedia Computing for Smart Manufacturing and Engineering

Pages 65 - 70

https://doi.org/10.1145/3606042.3616458

Published: 29 October 2023 Publication History

Abstract

In the realm of intelligent manufacturing and industrial upgrading, sophisticated multimedia computing technologies play a pivotal role in the recognition of video actions. However, most studies suffer from the issue of background bias, where the models excessively focus on the contextual information within the videos rather than concentrating on comprehending the human actions themselves. This could potentially lead to severe misjudgments in industrial applications. In this paper, we propose a Skeleton-Dominated Two-Stream Network (SDTSN), which is a novel two-stream framework that fuses and ensembles the skeleton and RGB modalities for video action recognition. Experimental results on the Mimetics dataset, without any background bias, demonstrate the efficacy of our approach.

References

[1]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luvc ić, and Cordelia Schmid. 2021. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6836--6846.

[2]

Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).

[3]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[4]

Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. 2022. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2969--2978.

[5]

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6824--6835.

[6]

Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. 445--450.

Digital Library

[7]

Christoph Feichtenhofer. 2020. X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 203--213.

[8]

Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video action transformer network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 244--253.

[9]

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision. 5842--5850.

[10]

Imen Jegham, Anouar Ben Khalifa, Ihsen Alouani, and Mohamed Ali Mahjoub. 2018. Safe driving: Driver action recognition using surf keypoints. In 2018 30th International Conference on Microelectronics (ICM). IEEE, 60--63.

[11]

Imen Jegham, Anouar Ben Khalifa, Ihsen Alouani, and Mohamed Ali Mahjoub. 2020. Soft spatial attention-based multimodal driver action recognition using deep learning. IEEE Sensors Journal, Vol. 21, 2 (2020), 1918--1925.

[12]

Hildegard Kuehne, Hueihan Jhuang, Est'ibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In 2011 International Conference on Computer Vision. IEEE, 2556--2563.

Digital Library

[13]

Guiyu Liu, Jiuchao Qian, Fei Wen, Xiaoguang Zhu, Rendong Ying, and Peilin Liu. 2019. Action recognition based on 3d skeleton and rgb frame fusion. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 258--264.

Digital Library

[14]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.

[15]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3202--3211.

[16]

Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. 2020. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 143--152.

Digital Library

[17]

Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. 2018. Sport action recognition with siamese spatio-temporal cnns: Application to table tennis. In 2018 International Conference on Content-Based Multimedia Indexing (CBMI). IEEE, 1--6.

[18]

Chundi Mu, Jianbin Xie, Wei Yan, Tong Liu, and Peiqin Li. 2016. A fast recognition algorithm for suspicious behavior in high definition videos. Multimedia Systems, Vol. 22 (2016), 275--285.

Digital Library

[19]

Hideo Saito, Thomas B Moeslund, and Rainer Lienhart. 2022. MMSports' 22: 5th International ACM Workshop on Multimedia Content Analysis in Sports. In Proceedings of the 30th ACM International Conference on Multimedia. 7386--7388.

Digital Library

[20]

Jing Shi, Yuanyuan Zhang, Weihang Wang, Bin Xing, Dasha Hu, and Liangyin Chen. 2023. A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition. Applied Sciences, Vol. 13, 4 (2023), 2058.

[21]

Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12026--12035.

[22]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, Vol. 27 (2014).

[23]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv Preprint ArXiv:1212.0402 (2012).

[24]

Kamal Kant Verma, Brij Mohan Singh, and Amit Dixit. 2019. A review of supervised and unsupervised machine learning techniques for suspicious behavior recognition in intelligent surveillance system. International Journal of Information Technology (2019), 1--14.

[25]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2018. Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 11 (2018), 2740--2755.

[26]

Philippe Weinzaepfel and Grégory Rogez. 2021. Mimetics: Towards understanding human actions out of context. International Journal of Computer Vision, Vol. 129, 5 (2021), 1675--1690.

[27]

Fei Wu, Qingzhong Wang, Jiang Bian, Ning Ding, Feixiang Lu, Jun Cheng, Dejing Dou, and Haoyi Xiong. 2022. A survey on video action recognition in sports: Datasets, methods and applications. IEEE Transactions on Multimedia (2022).

[28]

Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. io

Cited By

Li QHuang XChen HHe FChen QWang ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Advancing Micro-Action Recognition with Multi-Auxiliary Heads and Hybrid Loss OptimizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688975(11313-11319)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3688975

Index Terms

Mitigating Context Bias in Action Recognition via Skeleton-Dominated Two-Stream Network
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Integrating Gaussian mixture model and dilated residual network for action recognition in videos
Abstract
Action recognition in video is one of the important applications in computer vision. In recent years, the two-stream architecture has made significant progress in action recognition, but it has not systematically explored spatial–temporal ...
An attention mechanism based convolutional LSTM network for video action recognition

As an important issue in video classification, human action recognition is becoming a hot topic in computer vision. The ways of effectively representing the spatial static and temporal dynamic information of videos are important problems in video action ...
Cross-enhancement transform two-stream 3D ConvNets for action recognition
AIIPCC '19: Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing

Action recognition is an important research topic in computer vision. It is the basic work for visual understanding and has been applied in many fields. Since human actions can vary in different environments, it is difficult to infer actions in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AMC-SME '23: Proceedings of the 2023 Workshop on Advanced Multimedia Computing for Smart Manufacturing and Engineering

October 2023

83 pages

ISBN:9798400702730

DOI:10.1145/3606042

General Chairs:
Junxin Chen
Dalian University of Technology, China
,
Wei Wang
Shenzhen MSU-BIT University, China
,
Gwanggil Jeon
Incheon National University, Korea

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29, 2023

Ottawa ON, Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
40
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)2

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li QHuang XChen HHe FChen QWang ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Advancing Micro-Action Recognition with Multi-Auxiliary Heads and Hybrid Loss OptimizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688975(11313-11319)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3688975

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents