skip to main content
survey

Continuous Human Action Recognition for Human-machine Interaction: A Review

Published: 13 July 2023 Publication History

Abstract

With advances in data-driven machine learning research, a wide variety of prediction models have been proposed to capture spatio-temporal features for the analysis of video streams. Recognising actions and detecting action transitions within an input video are challenging but necessary tasks for applications that require real-time human-machine interaction. By reviewing a large body of recent related work in the literature, we thoroughly analyse, explain, and compare action segmentation methods and provide details on the feature extraction and learning strategies that are used on most state-of-the-art methods. We cover the impact of the performance of object detection and tracking techniques on human action segmentation methodologies. We investigate the application of such models to real-world scenarios and discuss several limitations and key research directions towards improving interpretability, generalisation, optimisation, and deployment.

Supplementary Material

3587931.supp (3587931.supp.pdf)
Supplementary material

References

[1]
David Ahmedt-Aristizabal, Mohammad Ali Armin, Simon Denman, Clinton Fookes, and Lars Petersson. 2022. A survey on graph-based deep learning for computational histopathology. Computerized Medical Imaging and Graphics 95 (2022), 102027.
[2]
Hyemin Ahn and Dongheui Lee. 2021. Refining action segmentation with hierarchical video representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16302–16310.
[3]
AlexeyAB. 2021. Darknet: Open Source Neural Networks in C. Retrieved from https://github.com/AlexeyAB/darknet.
[4]
Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018), 1–14.
[5]
Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. 2019. Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 941–951.
[6]
Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. 2016. Simple online and realtime tracking. In Proceedings of the IEEE International Conference on Image Processing (ICIP). 3464–3468.
[7]
Dulari Bhatt, Chirag Patel, Hardik Talsania, Jigar Patel, Rasmika Vaghela, Sharnil Pandya, Kirit Modi, and Hemant Ghayvat. 2021. CNN variants for computer vision: History, architecture, application, challenges and future scope. Electronics 10, 20 (2021), 2470.
[8]
Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. YOLOv4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).
[9]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[10]
Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[11]
João Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4724–4733.
[12]
Min-Hung Chen, Baopu Li, Yingze Bao, Ghassan AlRegib, and Zsolt Kira. 2020. Action segmentation with joint self-supervised temporal domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9454–9463.
[13]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[14]
Huseyin Coskun, M. Zeeshan Zia, Bugra Tekin, Federica Bogo, Nassir Navab, Federico Tombari, and Harpreet Sawhney. 2021. Domain-specific priors and meta learning for few-shot first-person action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43 (2021).
[15]
Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 379–387.
[16]
Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human detection using oriented histograms of flow and appearance. In Proceedings of the European Conference on Computer Vision. Springer, 428–441.
[17]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2021. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. Int. J. Comput. Vis. (2021). DOI:
[18]
Srijan Das, Rui Dai, Michal Koperski, Luca Minciullo, Lorenzo Garattoni, Francois Bremond, and Gianpiero Francesca. 2019. Toyota smarthome: Real-world activities of daily living. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[19]
Avijit Dasgupta, C. V. Jawahar, and Karteek Alahari. 2021. Context aware group activity recognition. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR). IEEE, 10098–10105.
[20]
Kedar Dhamdhere, Mukund Sundararajan, and Qiqi Yan. 2018. How important is a neuron? CoRR abs/1805.12233 (2018).
[21]
Li Ding and Chenliang Xu. 2018. Weakly-supervised action segmentation with iterative soft boundary assignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6508–6516.
[22]
Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. 2019. CenterNet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6569–6578.
[23]
Yazan Abu Farha and Jurgen Gall. 2019. MS-TCN: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Springer-Verlag, 3575–3584.
[24]
Alireza Fathi, Xiaofeng Ren, and James M. Rehg. 2011. Learning to recognize objects in egocentric activities. In Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 3281–3288.
[25]
Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C. Berg. 2017. DSSD: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659 (2017).
[26]
Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. 2017. Two stream LSTM: A deep fusion framework for human action recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 177–186.
[27]
Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. 2018. Multi-level sequence GAN for group activity recognition. In Proceedings of the Asian Conference on Computer Vision. Springer, 331–346.
[28]
Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. 2019. Predicting the future: A jointly learnt model for action anticipation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5562–5571.
[29]
Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. 2020. Fine-grained action segmentation using the semi-supervised action GAN. Pattern Recog. 98 (2020), 107039.
[30]
Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. 2021. TMMF: Temporal multi-modal fusion for single-stage continuous gesture recognition. IEEE Trans. Image Process. 30 (2021), 7689–7701.
[31]
Harshala Gammulle, Tharindu Fernando, Simon Denman, Sridha Sridharan, and Clinton Fookes. 2019. Coupled generative adversarial network for continuous fine-grained action segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 200–209.
[32]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 1 (2016), 2096–2030.
[33]
Shang-Hua Gao, Qi Han, Zhong-Yu Li, Pai Peng, Liang Wang, and Ming-Ming Cheng. 2021. Global2Local: Efficient structure search for video action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16805–16814.
[34]
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. YOLOX: Exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430 (2021).
[35]
Kumie Gedamu, Yanli Ji, Yang Yang, LingLing Gao, and Heng Tao Shen. 2021. Arbitrary-view human action recognition via novel-view action generation. Pattern Recog. 118 (2021), 108043.
[36]
Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V. Le. 2018. DropBlock: A regularization method for convolutional networks. In Proceedings of the International Conference Advances in Neural Information Processing Systems. 10727–10737.
[37]
Muhammad Ghifary, W. Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, and Wen Li. 2016. Deep reconstruction-classification networks for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision. Springer, Amsterdam, The Netherlands, 597–613.
[38]
Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 759–768.
[39]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27 (2014).
[40]
Klaus Greff, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2016. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28, 10 (2016), 2222–2232.
[41]
Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test. J. Mach. Learn. Res. 13, 1 (2012), 723–773.
[42]
Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar et al. 2018. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6047–6056.
[43]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.
[44]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 9 (2015), 1904–1916.
[45]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770–778.
[46]
Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. 2019. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 558–567.
[47]
Tao Hu, Chengjiang Long, and Chunxia Xiao. 2021. A novel visual representation on text using diverse conditional GAN for visual recognition. IEEE Trans. Image Process. 30 (2021), 3499–3512.
[48]
Yifei Huang, Yusuke Sugano, and Yoichi Sato. 2020. Improving action segmentation via graph-based temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14024–14034.
[49]
Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).
[50]
Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The THUMOS challenge on action recognition for videos “in the wild.” Comput. Vis. Image Underst. 155 (2017), 1–23.
[51]
Ifzhang. 2021. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. Retrieved from https://github.com/ifzhang/ByteTrack.
[52]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1125–1134.
[53]
Imen Jegham, Anouar Ben Khalifa, Ihsen Alouani, and Mohamed Ali Mahjoub. 2020. Vision-based human action recognition: An overview and real world challenges. Forens. Sci. Int.: Digit. Investig. 32 (2020), 200901.
[54]
Licheng Jiao, Fan Zhang, Fang Liu, Shuyuan Yang, Lingling Li, Zhixi Feng, and Rong Qu. 2019. A survey of deep learning-based object detection. IEEE Access 7 (2019), 128837–128868.
[55]
Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G. Hauptmann. 2019. Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4893–4902.
[56]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
[57]
Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2021. With a little help from my temporal context: Multimodal egocentric action recognition. In Proceedings of the British Machine Vision Conference.
[58]
Wei Ke, Tianliang Zhang, Zeyi Huang, Qixiang Ye, Jianzhuang Liu, and Dong Huang. 2020. Multiple anchor learning for visual object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10206–10215.
[59]
Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan et al. 2020. Captum: A unified and generic model interpretability library for PyTorch. arXiv preprint arXiv:2009.07896 (2020).
[60]
Yu Kong and Yun Fu. 2018. Human action recognition and prediction: A survey. arXiv preprint arXiv:1806.11230 (2018).
[61]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 1097–1105.
[62]
Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 780–787.
[63]
Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8.
[64]
Ivan Laptev and Patrick Pérez. 2007. Retrieving actions in movies. In Proceedings of the IEEE 11th International Conference on Computer Vision. IEEE, 1–8.
[65]
Hei Law and Jia Deng. 2018. CornerNet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV). 734–750.
[66]
Hei Law, Yun Teng, Olga Russakovsky, and Jia Deng. 2019. CornerNet-Lite: Efficient keypoint based object detection. arXiv preprint arXiv:1904.08900 (2019).
[67]
Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 156–165.
[68]
Peng Lei and Sinisa Todorovic. 2018. Temporal deformable residual networks for action segmentation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6742–6751.
[69]
Shi-Jie Li, Yazan AbuFarha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. 2020. MS-TCN++: Multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 42 (2020).
[70]
Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. 2019. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6054–6063.
[71]
Yin Li, Miao Liu, and James M. Rehg. 2018. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European Conference on Computer Vision (ECCV). 619–635.
[72]
Zhe Li, Yazan Abu Farha, and Jurgen Gall. 2021. Temporal action segmentation from timestamp supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8365–8374.
[73]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2117–2125.
[74]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980–2988.
[75]
Bangli Liu, Haibin Cai, Zhaojie Ju, and Honghai Liu. 2019. RGB-D sensing based human action and interaction analysis: A survey. Pattern Recog. 94 (2019), 1–12.
[76]
Ming-Yu Liu and Oncel Tuzel. 2016. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Curran Associates, Inc., Barcelona, Spain.
[77]
Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8759–8768.
[78]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer, 21–37.
[79]
Michael Mathieu, Camille Couprie, and Yann LeCun. 2016. Deep multi-scale video prediction beyond mean square error. In Proceedings of the 4th International Conference on Learning Representations. 1–14.
[80]
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014), 1–7.
[81]
Diganta Misra. 2019. Mish: A self regularized non-monotonic neural activation function. arXiv preprint arXiv:1908.08681 (2019).
[82]
Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick et al. 2019. Moments in time dataset: One million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2 (2019), 502–508.
[83]
Mathew Monfort, Bowen Pan, Kandan Ramakrishnan, Alex Andonian, Barry A. McNamara, Alex Lascelles, Quanfu Fan, Dan Gutfreund, Rogério Schmidt Feris, and Aude Oliva. 2021. Multi-moments in time: Learning and interpreting models for multi-action video understanding. IEEE Trans. Pattern Anal. Mach. Intell. 44, 12 (2021), 9434–9445.
[84]
Bingbing Ni, Vignesh R. Paramathayalan, and Pierre Moulin. 2014. Multiple granularity analysis for fine-grained action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 756–763.
[85]
Bingbing Ni, Xiaokang Yang, and Shenghua Gao. 2016. Progressively parsing interactional objects for fine grained action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1020–1028.
[86]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
[87]
Wanli Ouyang, Kun Wang, Xin Zhu, and Xiaogang Wang. 2017. Chained cascade network for object detection. In Proceedings of the IEEE International Conference on Computer Vision. 1938–1946.
[88]
Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. 2019. Libra R-CNN: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 821–830.
[89]
Preksha Pareek and Ankit Thakkar. 2021. A survey on video-based human action recognition: Recent updates, datasets, challenges, and applications. Artif. Intell. Rev. 54, 3 (2021), 2259–2322.
[90]
Alonso Patron-Perez, Marcin Marszalek, Andrew Zisserman, and Ian Reid. 2010. High five: Recognising human interactions in TV shows. In Proceedings of the British Machine Vision Conference. Citeseer, 33.
[91]
Jinlong Peng, Tao Wang, Weiyao Lin, Jian Wang, John See, Shilei Wen, and Erui Ding. 2020. TPM: Multiple object tracking with tracklet-plane matching. Pattern Recog. 107 (2020), 107480.
[92]
Hamed Pirsiavash and Deva Ramanan. 2014. Parsing videos of actions with segmental grammars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 612–619.
[93]
Zhongjian Qi, Jun Sun, Jinzhao Qian, Jiajia Xu, and Shu Zhan. 2021. PCCM-GAN: Photographic text-to-image generation with pyramid contrastive consistency model. Neurocomputing 449 (2021), 330–341.
[94]
Mohammad Mahfujur Rahman, Clinton Fookes, Mahsa Baktashmotlagh, and Sridha Sridharan. 2020. Correlation-aware adversarial domain adaptation and generalization. Pattern Recog. 100 (2020), 107124.
[95]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.
[96]
Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7263–7271.
[97]
Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).
[98]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 91–99.
[99]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1135–1144.
[100]
Alexander Richard, Hilde Kuehne, and Juergen Gall. 2017. Weakly supervised action learning with RNN based fine-to-coarse modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 754–763.
[101]
Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka, and Bernt Schiele. 2012. A database for fine grained activity detection of cooking activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1194–1201.
[102]
Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and Bernt Schiele. 2015. Recognizing fine-grained and composite activities using hand-centric features and script data. Int. J. Comput. Vis. (2015), 1–28. DOI:
[103]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.
[104]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626.
[105]
Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. FineGym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[106]
Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. 2018. Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.
[107]
Bing Shuai, Andrew Berneshawi, Xinyu Li, Davide Modolo, and Joseph Tighe. 2021. SiamMOT: Siamese multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12372–12382.
[108]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 568–576.
[109]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[110]
Ankit Singh, Omprakash Chakraborty, Ashutosh Varshney, Rameswar Panda, Rogerio Feris, Kate Saenko, and Abir Das. 2021. Semi-supervised action recognition with temporal contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10389–10399.
[111]
Bharat Singh, Tim K. Marks, Michael Jones, Oncel Tuzel, and Ming Shao. 2016. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1961–1970.
[112]
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. 2014. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806 (2014).
[113]
Sebastian Stein and Stephen J. McKenna. 2013. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing. 729–738.
[114]
Baochen Sun, Jiashi Feng, and Kate Saenko. 2017. Correlation alignment for unsupervised domain adaptation. In Domain Adaptation in Computer Vision Applications. Springer, 153–171.
[115]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.
[116]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
[117]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826.
[118]
Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 6105–6114.
[119]
Mingxing Tan, Ruoming Pang, and Quoc V. Le. 2020. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10781–10790.
[120]
Hayato Terao, Wataru Noguchi, Hiroyuki Iizuka, and Masahito Yamamoto. 2020. Semi-supervised learning combining 2DCNNs and video compression for action recognition. In Proceedings of the 4th International Conference on Vision, Image and Signal Processing. 1–6.
[121]
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2019. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9627–9636.
[122]
Lachlan Tychsen-Smith and Lars Petersson. 2017. DeNet: Scalable real-time object detection with directed sparse sampling. In Proceedings of the IEEE International Conference on Computer Vision. 428–436.
[123]
Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. 2021. Scaled-YOLOv4: Scaling cross stage partial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13029–13038.
[124]
Chien-Yao Wang, Hong-Yuan Mark Liao, I-Hau Yeh, Yueh-Hua Wu, Ping-Yang Chen, and Jun-Wei Hsieh. 2019. CSPNet: A new backbone that can enhance learning capability of CNN. arXiv preprint arXiv:1911.11929 (2019).
[125]
Dong Wang, Di Hu, Xingjian Li, and Dejing Dou. 2021. Temporal relational modeling with self-supervision for action segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2729–2737.
[126]
Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 3169–3176.
[127]
Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4305–4314.
[128]
Mei Wang and Weihong Deng. 2018. Deep visual domain adaptation: A survey. Neurocomputing 312 (2018), 135–153.
[129]
Zhenzhi Wang, Ziteng Gao, Limin Wang, Zhifeng Li, and Gangshan Wu. 2020. Boundary-aware cascade networks for temporal action segmentation. In Proceedings of the European Conference on Computer Vision. Springer, 34–51.
[130]
Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple online and realtime tracking with a deep association metric. In Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 3645–3649.
[131]
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S. Yu Philip. 2021. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 1 (2021), 4–24.
[132]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492–1500.
[133]
Donggeun Yoo, Namil Kim, Sunggyun Park, Anthony S. Paek, and In So Kweon. 2016. Pixel-level domain transfer. In Proceedings of the European Conference on Computer Vision. Springer, Amsterdam, The Netherlands, 517–532.
[134]
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. CutMix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision. 6023–6032.
[135]
Hong-Bo Zhang, Yi-Xiang Zhang, Bineng Zhong, Qing Lei, Lijie Yang, Ji-Xiang Du, and Duan-Sheng Chen. 2019. A comprehensive survey of vision-based human action recognition methods. Sensors 19, 5 (2019), 1005.
[136]
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848–6856.
[137]
Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. 2021. ByteTrack: Multi-object tracking by associating every detection box. arXiv preprint arXiv:2110.06864 (2021).
[138]
Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. 2021. FairMOT: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 129 (2021), 3069–3087.
[139]
Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. 2019. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9259–9266.
[140]
Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision. 1116–1124.
[141]
Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. 2020. Tracking objects as points. In Proceedings of the European Conference on Computer Vision. Springer, 474–490.
[142]
Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. 2019. Objects as points. arXiv preprint arXiv:1904.07850 (2019).
[143]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2223–2232.
[144]
Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. 2018. Towards universal representation for unseen action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9436–9445.
[145]
Andreas Zinnen, Ulf Blanke, and Bernt Schiele. 2009. An analysis of sensor-oriented vs. model-based activity recognition. In Proceedings of the International Symposium on Wearable Computers. IEEE, 93–100.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 55, Issue 13s
December 2023
1367 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3606252
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 July 2023
Online AM: 14 March 2023
Accepted: 10 March 2023
Revised: 20 December 2022
Received: 03 March 2022
Published in CSUR Volume 55, Issue 13s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Datasets
  2. neural networks

Qualifiers

  • Survey

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)848
  • Downloads (Last 6 weeks)81
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media