skip to main content
research-article

One-shot Video Graph Generation for Explainable Action Reasoning

Published: 01 June 2022 Publication History

Abstract

Human action analysis is a critical yet challenging task for understanding diverse video content. Recently, to enable explainable reasoning of video actions, a spatio-temporal video graph structure was proposed to represent the video state changes at the semantic level. However, its requirement of tedious manual annotation of all the video frames is a serious limitation. Obviously, this approach would have a tremendously expanded applicability if the video graph generation process can be automated. In this paper, a One-Shot Video Graph (OSVG) generation approach is proposed for more effective explainable action reasoning, which only requires a one-time annotation of the objects in the starting frame of the video. We first estimate the predefined relevant objects across the temporal dimension by employing a proposed one-shot target-aware tracking strategy. This helps obtain the object locations and links objects simultaneously across all video frames. Then, the scene graph of each video frame can be constructed by an attribute detector and a relationship detector based on the estimated object locations. In addition, to further enhance the reasoning accuracy of performed actions, a video graph smoothing mechanism is designed with a fully-connected Conditional Random Field (CRF). By sequentially examining every state transition (including attributes and relationships) of the smoothed video graph, the actions occurring can be recognized using pre-defined rules. Experiments on the CAD-120++ dataset and a newly collected NTU RGBD++ dataset have verified that the proposed OSVG is able to outperform other state-of-the-art video action reasoning strategies on both state recognition and action recognition accuracy.

References

[1]
C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1933–1941.
[2]
G. Varol, I. Laptev, C. Schmid, Long-term temporal convolutions for action recognition, IEEE transactions on pattern analysis and machine intelligence 40 (6) (2017) 1510–1517.
[3]
X. Wang, L. Gao, P. Wang, X. Sun, X. Liu, Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length, IEEE Transactions on Multimedia 20 (3) (2017) 634–644.
[4]
D. Li, T. Yao, L.-Y. Duan, T. Mei, Y. Rui, Unified spatio-temporal attention networks for action recognition in videos, IEEE Transactions on Multimedia 21 (2) (2018) 416–428.
[5]
T. Yu, L. Wang, C. Da, H. Gu, S. Xiang, C. Pan, Weakly semantic guided action recognition, IEEE Transactions on Multimedia 21 (10) (2019) 2504–2517.
[6]
T. Zhuo, Z. Cheng, P. Zhang, Y. Wong, M. Kankanhalli, Explainable video action reasoning via prior knowledge and state transitions, in: Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 521–529.
[7]
X. Liang, L. Lee, E.P. Xing, Deep variation-structured reinforcement learning for visual relationship and attribute detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 848–857.
[8]
C. Lu, R. Krishna, M. Bernstein, L. Fei-Fei, Visual relationship detection with language priors, in: European conference on computer vision, Springer, 2016, pp. 852–869.
[9]
Z. Cui, C. Xu, W. Zheng, J. Yang, Context-dependent diffusion network for visual relationship detection, in: Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 1475–1482.
[10]
K. Liang, Y. Guo, H. Chang, X. Chen, Visual relationship detection with deep structural ranking, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[11]
S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, 2015, pp. 91–99.
[12]
J. Dai, Y. Li, K. He, J. Sun, R-fcn: Object detection via region-based fully convolutional networks, in: Advances in neural information processing systems, 2016, pp. 379–387.
[13]
T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
[14]
L. Bertinetto, J. Valmadre, J.F. Henriques, A. Vedaldi, P.H. Torr, Fully-convolutional siamese networks for object tracking, in: European conference on computer vision, Springer, 2016, pp. 850–865.
[15]
B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, High performance visual tracking with siamese region proposal network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8971–8980.
[16]
H. Fan, H. Ling, Siamese cascaded region proposal networks for real-time visual tracking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7952–7961.
[17]
P. Krähenbühl, V. Koltun, Efficient inference in fully connected crfs with gaussian edge potentials, in: Advances in neural information processing systems, 2011, pp. 109–117.
[18]
A. Shahroudy, J. Liu, T.-T. Ng, G. Wang, Ntu rgb+ d: A large scale dataset for 3d human activity analysis, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.
[19]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
[20]
J. Xu, R. Song, H. Wei, J. Guo, Y. Zhou, X. Huang, A fast human action recognition network based on spatio-temporal features, Neurocomputing 441 (2021) 350–358.
[21]
J.-Y. He, X. Wu, Z.-Q. Cheng, Z. Yuan, Y.-G. Jiang, Db-lstm: Densely-connected bi-directional lstm for human action recognition, Neurocomputing 444 (2021) 319–331.
[22]
Y. Xu, H. Cao, J. Yang, K. Mao, J. Yin, S. See, Pnl: Efficient long-range dependencies extraction with pyramid non-local module for action recognition, Neurocomputing 447 (2021) 282–293.
[23]
W. Wang, X. Peng, Y. Su, Y. Qiao, J. Cheng, Ttpp: Temporal transformer with progressive prediction for efficient action anticipation, Neurocomputing 438 (2021) 270–279.
[24]
S.D. Tran, L.S. Davis, Event modeling and recognition using markov logic networks, in: European Conference on Computer Vision, Springer, 2008, pp. 610–623.
[25]
W. Brendel, A. Fern, S. Todorovic, Probabilistic event logic for interval-based event recognition, in: CVPR 2011, IEEE, 2011, pp. 3329–3336.
[26]
J. Ijsselmuiden, R. Stiefelhagen, Towards high-level human activity recognition through computer vision and temporal logic, in: Annual Conference on Artificial Intelligence, Springer, 2010, pp. 426–435.
[27]
V.I. Morariu, L.S. Davis, Multi-agent event recognition in structured scenarios, in: CVPR 2011, IEEE, 2011, pp. 3289–3296.
[28]
A. Fathi, J.M. Rehg, Modeling actions through state changes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2579–2586.
[29]
X. Wang, A. Farhadi, A. Gupta, Actions transformations, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 2658–2667.
[30]
Y. Liu, P. Wei, S.-C. Zhu, Jointly recognizing object fluents and tasks in egocentric videos, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2924–2932.
[31]
J.-B. Alayrac, I. Laptev, J. Sivic, S. Lacoste-Julien, Joint discovery of object states and manipulation actions, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2127–2136.
[32]
Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, A. Farhadi, Visual semantic planning using deep successor representations, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 483–492.
[33]
X. Wang, A. Gupta, Videos as space-time region graphs, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 399–417.
[34]
R. Tao, E. Gavves, A.W. Smeulders, Siamese instance search for tracking, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1420–1429.
[35]
M. Danelljan, G. Bhat, F.S. Khan, M. Felsberg, Atom, Accurate tracking by overlap maximization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4660–4669.
[36]
X. Li, C. Ma, B. Wu, Z. He, M.-H. Yang, Target-aware deep tracking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1369–1378.
[37]
J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, L. Fei-Fei, Image retrieval using scene graphs, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3668–3678.
[38]
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D.A. Shamma, et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision 123 (1) (2017) 32–73.
[39]
X. Li, S. Jiang, Know more say less: Image captioning based on scene graphs, IEEE Transactions on Multimedia 21 (8) (2019) 2117–2130.
[40]
B. Dai, Y. Zhang, D. Lin, Detecting visual relationships with deep relational networks, in: Proceedings of the IEEE conference on computer vision and Pattern recognition, 2017, pp. 3076–3086.
[41]
H. Zhou, C. Hu, C. Zhang, S. Shen, Visual relationship recognition via language and position guided attention, in: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, 2019, pp. 2097–2101.
[42]
C. Galleguillos, A. Rabinovich, S. Belongie, Object categorization using co-occurrence, location and appearance, in: 2008 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2008, pp. 1–8.
[43]
H. Zhang, Z. Kyaw, S.-F. Chang, T.-S. Chua, Visual translation embedding network for visual relation detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5532–5540.
[44]
H. Zhou, C. Zhang, C. Hu, Visual relationship detection with relative location mining, in: Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 30–38.
[45]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[46]
H.S. Koppula, R. Gupta, A. Saxena, Learning human activities and object affordances from rgb-d videos, The International Journal of Robotics Research 32 (8) (2013) 951–970.
[47]
Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2411–2418.
[48]
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks for action recognition in videos, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (11) (2018) 2740–2755.

Cited By

View all

Index Terms

  1. One-shot Video Graph Generation for Explainable Action Reasoning
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Neurocomputing
            Neurocomputing  Volume 488, Issue C
            Jun 2022
            707 pages

            Publisher

            Elsevier Science Publishers B. V.

            Netherlands

            Publication History

            Published: 01 June 2022

            Author Tags

            1. Explainable action reasoning
            2. Spatial-temporal scene graphs
            3. State transition

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 26 Jan 2025

            Other Metrics

            Citations

            Cited By

            View all

            View Options

            View options

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media