skip to main content
10.1145/3606042.3616458acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Mitigating Context Bias in Action Recognition via Skeleton-Dominated Two-Stream Network

Published: 29 October 2023 Publication History

Abstract

In the realm of intelligent manufacturing and industrial upgrading, sophisticated multimedia computing technologies play a pivotal role in the recognition of video actions. However, most studies suffer from the issue of background bias, where the models excessively focus on the contextual information within the videos rather than concentrating on comprehending the human actions themselves. This could potentially lead to severe misjudgments in industrial applications. In this paper, we propose a Skeleton-Dominated Two-Stream Network (SDTSN), which is a novel two-stream framework that fuses and ensembles the skeleton and RGB modalities for video action recognition. Experimental results on the Mimetics dataset, without any background bias, demonstrate the efficacy of our approach.

References

[1]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luvc ić, and Cordelia Schmid. 2021. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6836--6846.
[2]
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[4]
Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. 2022. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2969--2978.
[5]
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6824--6835.
[6]
Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. 445--450.
[7]
Christoph Feichtenhofer. 2020. X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 203--213.
[8]
Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video action transformer network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 244--253.
[9]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision. 5842--5850.
[10]
Imen Jegham, Anouar Ben Khalifa, Ihsen Alouani, and Mohamed Ali Mahjoub. 2018. Safe driving: Driver action recognition using surf keypoints. In 2018 30th International Conference on Microelectronics (ICM). IEEE, 60--63.
[11]
Imen Jegham, Anouar Ben Khalifa, Ihsen Alouani, and Mohamed Ali Mahjoub. 2020. Soft spatial attention-based multimodal driver action recognition using deep learning. IEEE Sensors Journal, Vol. 21, 2 (2020), 1918--1925.
[12]
Hildegard Kuehne, Hueihan Jhuang, Est'ibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In 2011 International Conference on Computer Vision. IEEE, 2556--2563.
[13]
Guiyu Liu, Jiuchao Qian, Fei Wen, Xiaoguang Zhu, Rendong Ying, and Peilin Liu. 2019. Action recognition based on 3d skeleton and rgb frame fusion. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 258--264.
[14]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.
[15]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3202--3211.
[16]
Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. 2020. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 143--152.
[17]
Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. 2018. Sport action recognition with siamese spatio-temporal cnns: Application to table tennis. In 2018 International Conference on Content-Based Multimedia Indexing (CBMI). IEEE, 1--6.
[18]
Chundi Mu, Jianbin Xie, Wei Yan, Tong Liu, and Peiqin Li. 2016. A fast recognition algorithm for suspicious behavior in high definition videos. Multimedia Systems, Vol. 22 (2016), 275--285.
[19]
Hideo Saito, Thomas B Moeslund, and Rainer Lienhart. 2022. MMSports' 22: 5th International ACM Workshop on Multimedia Content Analysis in Sports. In Proceedings of the 30th ACM International Conference on Multimedia. 7386--7388.
[20]
Jing Shi, Yuanyuan Zhang, Weihang Wang, Bin Xing, Dasha Hu, and Liangyin Chen. 2023. A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition. Applied Sciences, Vol. 13, 4 (2023), 2058.
[21]
Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12026--12035.
[22]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, Vol. 27 (2014).
[23]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv Preprint ArXiv:1212.0402 (2012).
[24]
Kamal Kant Verma, Brij Mohan Singh, and Amit Dixit. 2019. A review of supervised and unsupervised machine learning techniques for suspicious behavior recognition in intelligent surveillance system. International Journal of Information Technology (2019), 1--14.
[25]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2018. Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 11 (2018), 2740--2755.
[26]
Philippe Weinzaepfel and Grégory Rogez. 2021. Mimetics: Towards understanding human actions out of context. International Journal of Computer Vision, Vol. 129, 5 (2021), 1675--1690.
[27]
Fei Wu, Qingzhong Wang, Jiang Bian, Ning Ding, Feixiang Lu, Jun Cheng, Dejing Dou, and Haoyi Xiong. 2022. A survey on video action recognition in sports: Datasets, methods and applications. IEEE Transactions on Multimedia (2022).
[28]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. io

Cited By

View all
  • (2024)Advancing Micro-Action Recognition with Multi-Auxiliary Heads and Hybrid Loss OptimizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688975(11313-11319)Online publication date: 28-Oct-2024

Index Terms

  1. Mitigating Context Bias in Action Recognition via Skeleton-Dominated Two-Stream Network

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    AMC-SME '23: Proceedings of the 2023 Workshop on Advanced Multimedia Computing for Smart Manufacturing and Engineering
    October 2023
    83 pages
    ISBN:9798400702730
    DOI:10.1145/3606042
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. context bias
    2. skeleton-dominated
    3. two-stream
    4. video action recognition

    Qualifiers

    • Research-article

    Conference

    MM '23
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Advancing Micro-Action Recognition with Multi-Auxiliary Heads and Hybrid Loss OptimizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688975(11313-11319)Online publication date: 28-Oct-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media