skip to main content
10.1145/3562007.3562041acmotherconferencesArticle/Chapter ViewAbstractPublication PagesccrisConference Proceedingsconference-collections
research-article

Global-aware Pyramid Network with Boundary Adjustment for Anchor-free Temporal Action Detection

Published: 12 October 2022 Publication History

Abstract

Fine-grained temporal action detection aims at predicting the categories and locating the boundaries of fine-grained action instances in long, untrimmed videos. The fine-grained classification of action instances brings new challenges to temporal action detection, which changes the distribution of action instances of different durations and increases the proportion of short action instances. However, the existing anchor-free detection methods cannot fully utilize global information and local information. Therefore, this paper proposes an anchor-free temporal action detection method with global feature enhancement and local boundary adjustment. Based on the feature pyramid, the attention mechanism of transformer is used to model long-range temporal dependencies between features at different locations of the same level and introduce global information from the upper-level of the feature pyramid to generate coarse predictions. To obtain local details, the interaction with the low-level feature is used to adjust the boundaries of coarse predictions. Experiments on FineAction demonstrate the effectiveness of this method.

References

[1]
Yi Liu, Limin Wang, Xiao Ma, Yali Wang, and Yu Qiao. 2021. FineAction: a fine-grained video dataset for temporal action localization. arXiv:2105.11107. Retrieved from https://arxiv.org/abs/2105.11107
[2]
Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu and Junwei Han. 2020. Revisiting anchor mechanisms for temporal action localization. In IEEE Transactions on Image Processing. IEEE 29, August 2020, 8535-8548. https://doi.org/10.1109/TIP.2020.3016486
[3]
Shuming Liu, Xu Zhao, Haisheng Su, Zhilan Hu. 2020. TSI: temporal scale invariant network for action proposal generation. In Proceedings of the Asian Conference on Computer Vision (ACCV’20). Springer 12626, Online, 530–546. https://doi.org/10.1007/978-3-030-69541-5_32
[4]
Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Yanwei Fu. 2021. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). IEEE, Online, 3320-3329. https://doi.org/10.1109/CVPR46437.2021.00333
[5]
Ranyu Ning, Can Zhang, Yuexian Zou. 2021. SRF-Net: selective receptive field network for anchor-free temporal action detection. In Proceedings of the 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, Toronto, ON, 2460-2464. https://doi.org/10.1109/ICASSP39728.2021.9414253
[6]
Peisen Zhao, Lingxi Xie, Ya Zhang, and Qi Tian. 2021. Actionness-guided transformer for anchor-free temporal action localization. In IEEE Signal Processing Letters. IEEE 29, December 2021, 194-198. https://doi.org/10.1109/LSP.2021.3132287
[7]
Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE International Conference on Computer Vision (CVPR’16). IEEE, Las Vegas, NV, 1049–1058. https://doi.org/10.1109/CVPR.2016.119
[8]
Yi Zhu and Shawn Newsam. 2017. Efficient action detection in untrimmed videos via multi-task learning. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV’16). IEEE, Santa Rosa, CA, 197-206. https://doi.org/10.1109/WACV.2017.29
[9]
Zehuan Yuan, Jonathan C. Stroud, Tong Lu, and Jia Deng. 2017. Temporal action localization by structured maximal sums. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, HI, 3215-3223. https://doi.org/10.1109/CVPR.2017.342
[10]
Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, Juan Carlos Niebles. 2017. SST: single-stream temporal action proposals. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, HI, 6373-6382. https://doi.org/10.1109/CVPR.2017.675
[11]
Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017. Turn Tap: temporal unit regression network for temporal action proposals. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). IEEE, Venice, Italy, 3648–3656. https://doi.org/10.1109/ICCV.2017.392
[12]
Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-Tad: Sub-graph localization for temporal action detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, Seattle, WA, 10153–10162. https://doi.org/10.1109/CVPR42600.2020.01017
[13]
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). IEEE, Long Beach, CA, 344-353. https://doi.org/ 10.1109/CVPR.2019.00043
[14]
Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE, Venice, Italy, 2933–2942. https://doi.org/10.1109/ICCV.2017.317
[15]
Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, HI, 1417-1426. https://doi.org/10.1109/CVPR.2017.155
[16]
Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer 11208, 3-21. https://doi.org/10.1007/978-3-030-01225-0_1
[17]
Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. BMN: boundary-matching network for temporal action proposal generation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, Seoul, Korea (South), 3889-3898. https://doi.org/10.1109/ICCV.2019.00399
[18]
Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipeng Cui, Chengjie Wang, Jilin Li, Feiyue Huang, and Rongrong Ji. 2020. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’20). 34, 07 (Apr. 2020), 11499–11506. https://doi.org/10.1609/aaai.v34i07.6815
[19]
Yueran Bai, Yingying Wang, Yunhai Tong, Yang Yang, Qiyue Liu, and Junhui Liu. Boundary content graph neural network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’20), pages 121–137, 2020. https://doi.org/10.1609/aaai.v34i07.6829
[20]
Christoph Feichtenhofer. 2020. X3D: expanding Architectures for Efficient Video Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, Seattle, WA, 200-210. https://doi.org/ 10.1109/CVPR42600.2020.00028
[21]
Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, and Qianru Sun. Feature Pyramid Transformer. In Proceedings of the European Conference on Computer Vision (ECCV’20). Springer 12373, 323-339. https://doi.org/10.1007/978-3-030-58604-1_20
[22]
Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). IEEE, Nashville, TN, 485-494. https://doi.org/10.1109/CVPR46437.2021.00055
[23]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). IEEE, Venice, Italy, 2999-3007. https://doi.org/ 10.1109/ICCV.2017.324
[24]
Qiang Wang, and Rongliang Cheng. An Empirical Study of Feature Representation for Actionness based Temporal Action Localization. Retrieved from track1-top2.pdf (deeperaction.github.io)
[25]
Chenglu Wu, Xuefeng Yang, Fuzhi Duan, Yanxun Yu, Yayun Wang, and Jun Yin. Learning Efficient Feature Representation for Temporal Action Localization. Retrieved from track1-top1.pdf (deeperaction.github.io)

Index Terms

  1. Global-aware Pyramid Network with Boundary Adjustment for Anchor-free Temporal Action Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CCRIS '22: Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System
    August 2022
    253 pages
    ISBN:9781450396851
    DOI:10.1145/3562007
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Anchor-free action detection
    2. Boundary adjustment
    3. Fine-grained action detection
    4. Global feature enhancement

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    CCRIS'22

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 28
      Total Downloads
    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media