skip to main content
10.1145/3581783.3612301acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

Published: 27 October 2023 Publication History

Abstract

RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture called Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture recognition. The proposed MFST model comprises a 3D Central Difference Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal stages. The CDC-Stem enriches fine-grained temporal perception, and the multiple hierarchical spatio-temporal stages construct dimension-independent higher-order semantic primitives. Specifically, the CDC-Stem module captures bottom-level spatio-temporal features and passes them successively to the following spatio-temporal factored stages to capture the hierarchical spatial and temporal features through the Multi-Scale Convolution and Transformer (MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans) block. The seamless integration of these innovative designs results in a robust spatio-temporal representation that outperforms state-of-the-art approaches on RGB-D action and gesture recognition datasets.

Supplemental Material

MP4 File
Presentation Video

References

[1]
XB Bruce, Yan Liu, and Keith CC Chan. 2021. Multimodal fusion via teacher-student network for indoor action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3199--3207.
[2]
XB Bruce, Yan Liu, Xiang Zhang, Sheng-hua Zhong, and Keith CC Chan. 2022. Mmnet: A model-based multimodal network for human action recognition in rgb-d videos. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[3]
Jun Cheng, Ziliang Ren, Qieshi Zhang, Xiangyang Gao, and Fusheng Hao. 2021. Cross-modality compensation convolutional neural networks for RGB-D action recognition. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2021), 1498--1509.
[4]
Srijan Das, Saurav Sharma, Rui Dai, Francois Bremond, and Monique Thonnat. 2020. Vpn: Learning video-pose embedding for activities of daily living. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IX 16. Springer, 72--90.
[5]
Jun Kong, Yuhang Bian, and Min Jiang. 2022. MTT: Multi-scale temporal trans- former for skeleton-based action recognition. IEEE Signal Processing Letters 29 (2022), 528--532.
[6]
Sumin Lee, Sangmin Woo, Yeonju Park, Muhammad Adi Nugroho, and Changick Kim. 2023. Modality Mixer for Multi-modal Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3298--3307.
[7]
Ji Lin, Chuang Gan, Kuan Wang, and Song Han. 2020. TSM: Temporal shift module for efficient and scalable video understanding on edge devices. IEEE transactions on pattern analysis and machine intelligence 44, 5 (2020), 2760--2774.
[8]
Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. 2019. Ntu rgb d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence 42, 10 (2019), 2684--2701.
[9]
Mengyuan Liu and Junsong Yuan. 2018. Recognizing human actions as the evolution of pose estimation maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1159--1168.
[10]
Yanan Liu, Hao Zhang, Yanqiu Li, Kangjian He, and Dan Xu. 2023. Skeleton-based human action recognition via large-kernel attention graph convolutional network. IEEE Transactions on Visualization and Computer Graphics 29, 5 (2023), 2575--2585.
[11]
Zhen Liu, Jun Cheng, Libo Liu, Ziliang Ren, Qieshi Zhang, and Chengqun Song. 2022. Dual-stream cross-modality fusion transformer for RGB-D action recognition. Knowledge-Based Systems 255 (2022), 109741.
[12]
Zhen Liu, Qin Cheng, Chengqun Song, and Jun Cheng. 2023. Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168 (2023), 17--23.
[13]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202--3211.
[14]
Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. 2020. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 143--152.
[15]
Mauricio Perez, Jun Liu, and Alex C Kot. 2021. Interaction relational network for mutual action recognition. IEEE Transactions on Multimedia 24 (2021), 366--376.
[16]
Ziliang Ren, Qieshi Zhang, Jun Cheng, Fusheng Hao, and Xiangyang Gao. 2021. Segment spatial-temporal representation and cooperative learning of convolution neural networks for multimodal-based action recognition. Neurocomputing 433 (2021), 142--153.
[17]
Xiongjiang Xiao, Ziliang Ren, Huan Li, Wenhong Wei, Zhiyong Yang, and Huaide Yang. 2023. Slow Fast Multimodality Compensation Fusion Swin Transformer Networks for RGB-D Action Recognition. Mathematics 11, 9 (2023), 2115.
[18]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)

Cited By

View all

Index Terms

  1. Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. multi-modal representation
    2. rgb-d action and gesture recognition
    3. spatio-temporal

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Development Plan
    • Guangdong Provincial Key R&D Programme
    • Science and Technology Development Fund of Macau Project
    • External cooperation key project of Chinese Academy Sciences

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)195
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 07 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media