research-article

One-shot Video Graph Generation for Explainable Action Reasoning

Authors:

Mohan KankanhalliAuthors Info & Claims

Volume 488, Issue C

Pages 212 - 225

https://doi.org/10.1016/j.neucom.2022.02.069

Published: 01 June 2022 Publication History

Abstract

Human action analysis is a critical yet challenging task for understanding diverse video content. Recently, to enable explainable reasoning of video actions, a spatio-temporal video graph structure was proposed to represent the video state changes at the semantic level. However, its requirement of tedious manual annotation of all the video frames is a serious limitation. Obviously, this approach would have a tremendously expanded applicability if the video graph generation process can be automated. In this paper, a One-Shot Video Graph (OSVG) generation approach is proposed for more effective explainable action reasoning, which only requires a one-time annotation of the objects in the starting frame of the video. We first estimate the predefined relevant objects across the temporal dimension by employing a proposed one-shot target-aware tracking strategy. This helps obtain the object locations and links objects simultaneously across all video frames. Then, the scene graph of each video frame can be constructed by an attribute detector and a relationship detector based on the estimated object locations. In addition, to further enhance the reasoning accuracy of performed actions, a video graph smoothing mechanism is designed with a fully-connected Conditional Random Field (CRF). By sequentially examining every state transition (including attributes and relationships) of the smoothed video graph, the actions occurring can be recognized using pre-defined rules. Experiments on the CAD-120++ dataset and a newly collected NTU RGBD++ dataset have verified that the proposed OSVG is able to outperform other state-of-the-art video action reasoning strategies on both state recognition and action recognition accuracy.

References

[1]

C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1933–1941.

[2]

G. Varol, I. Laptev, C. Schmid, Long-term temporal convolutions for action recognition, IEEE transactions on pattern analysis and machine intelligence 40 (6) (2017) 1510–1517.

[3]

X. Wang, L. Gao, P. Wang, X. Sun, X. Liu, Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length, IEEE Transactions on Multimedia 20 (3) (2017) 634–644.

[4]

D. Li, T. Yao, L.-Y. Duan, T. Mei, Y. Rui, Unified spatio-temporal attention networks for action recognition in videos, IEEE Transactions on Multimedia 21 (2) (2018) 416–428.

[5]

T. Yu, L. Wang, C. Da, H. Gu, S. Xiang, C. Pan, Weakly semantic guided action recognition, IEEE Transactions on Multimedia 21 (10) (2019) 2504–2517.

[6]

T. Zhuo, Z. Cheng, P. Zhang, Y. Wong, M. Kankanhalli, Explainable video action reasoning via prior knowledge and state transitions, in: Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 521–529.

[7]

X. Liang, L. Lee, E.P. Xing, Deep variation-structured reinforcement learning for visual relationship and attribute detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 848–857.

[8]

C. Lu, R. Krishna, M. Bernstein, L. Fei-Fei, Visual relationship detection with language priors, in: European conference on computer vision, Springer, 2016, pp. 852–869.

[9]

Z. Cui, C. Xu, W. Zheng, J. Yang, Context-dependent diffusion network for visual relationship detection, in: Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 1475–1482.

[10]

K. Liang, Y. Guo, H. Chang, X. Chen, Visual relationship detection with deep structural ranking, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[11]

S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, 2015, pp. 91–99.

[12]

J. Dai, Y. Li, K. He, J. Sun, R-fcn: Object detection via region-based fully convolutional networks, in: Advances in neural information processing systems, 2016, pp. 379–387.

[13]

T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.

[14]

L. Bertinetto, J. Valmadre, J.F. Henriques, A. Vedaldi, P.H. Torr, Fully-convolutional siamese networks for object tracking, in: European conference on computer vision, Springer, 2016, pp. 850–865.

[15]

B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, High performance visual tracking with siamese region proposal network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8971–8980.

[16]

H. Fan, H. Ling, Siamese cascaded region proposal networks for real-time visual tracking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7952–7961.

[17]

P. Krähenbühl, V. Koltun, Efficient inference in fully connected crfs with gaussian edge potentials, in: Advances in neural information processing systems, 2011, pp. 109–117.

[18]

A. Shahroudy, J. Liu, T.-T. Ng, G. Wang, Ntu rgb+ d: A large scale dataset for 3d human activity analysis, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.

[19]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.

[20]

J. Xu, R. Song, H. Wei, J. Guo, Y. Zhou, X. Huang, A fast human action recognition network based on spatio-temporal features, Neurocomputing 441 (2021) 350–358.

[21]

J.-Y. He, X. Wu, Z.-Q. Cheng, Z. Yuan, Y.-G. Jiang, Db-lstm: Densely-connected bi-directional lstm for human action recognition, Neurocomputing 444 (2021) 319–331.

[22]

Y. Xu, H. Cao, J. Yang, K. Mao, J. Yin, S. See, Pnl: Efficient long-range dependencies extraction with pyramid non-local module for action recognition, Neurocomputing 447 (2021) 282–293.

[23]

W. Wang, X. Peng, Y. Su, Y. Qiao, J. Cheng, Ttpp: Temporal transformer with progressive prediction for efficient action anticipation, Neurocomputing 438 (2021) 270–279.

[24]

S.D. Tran, L.S. Davis, Event modeling and recognition using markov logic networks, in: European Conference on Computer Vision, Springer, 2008, pp. 610–623.

[25]

W. Brendel, A. Fern, S. Todorovic, Probabilistic event logic for interval-based event recognition, in: CVPR 2011, IEEE, 2011, pp. 3329–3336.

[26]

J. Ijsselmuiden, R. Stiefelhagen, Towards high-level human activity recognition through computer vision and temporal logic, in: Annual Conference on Artificial Intelligence, Springer, 2010, pp. 426–435.

[27]

V.I. Morariu, L.S. Davis, Multi-agent event recognition in structured scenarios, in: CVPR 2011, IEEE, 2011, pp. 3289–3296.

[28]

A. Fathi, J.M. Rehg, Modeling actions through state changes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2579–2586.

[29]

X. Wang, A. Farhadi, A. Gupta, Actions transformations, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 2658–2667.

[30]

Y. Liu, P. Wei, S.-C. Zhu, Jointly recognizing object fluents and tasks in egocentric videos, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2924–2932.

[31]

J.-B. Alayrac, I. Laptev, J. Sivic, S. Lacoste-Julien, Joint discovery of object states and manipulation actions, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2127–2136.

[32]

Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, A. Farhadi, Visual semantic planning using deep successor representations, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 483–492.

[33]

X. Wang, A. Gupta, Videos as space-time region graphs, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 399–417.

[34]

R. Tao, E. Gavves, A.W. Smeulders, Siamese instance search for tracking, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1420–1429.

[35]

M. Danelljan, G. Bhat, F.S. Khan, M. Felsberg, Atom, Accurate tracking by overlap maximization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4660–4669.

[36]

X. Li, C. Ma, B. Wu, Z. He, M.-H. Yang, Target-aware deep tracking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1369–1378.

[37]

J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, L. Fei-Fei, Image retrieval using scene graphs, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3668–3678.

[38]

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D.A. Shamma, et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision 123 (1) (2017) 32–73.

Digital Library

[39]

X. Li, S. Jiang, Know more say less: Image captioning based on scene graphs, IEEE Transactions on Multimedia 21 (8) (2019) 2117–2130.

[40]

B. Dai, Y. Zhang, D. Lin, Detecting visual relationships with deep relational networks, in: Proceedings of the IEEE conference on computer vision and Pattern recognition, 2017, pp. 3076–3086.

[41]

H. Zhou, C. Hu, C. Zhang, S. Shen, Visual relationship recognition via language and position guided attention, in: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, 2019, pp. 2097–2101.

[42]

C. Galleguillos, A. Rabinovich, S. Belongie, Object categorization using co-occurrence, location and appearance, in: 2008 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2008, pp. 1–8.

[43]

H. Zhang, Z. Kyaw, S.-F. Chang, T.-S. Chua, Visual translation embedding network for visual relation detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5532–5540.

[44]

H. Zhou, C. Zhang, C. Hu, Visual relationship detection with relative location mining, in: Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 30–38.

[45]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[46]

H.S. Koppula, R. Gupta, A. Saxena, Learning human activities and object affordances from rgb-d videos, The International Journal of Robotics Research 32 (8) (2013) 951–970.

[47]

Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2411–2418.

Digital Library

[48]

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks for action recognition in videos, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (11) (2018) 2740–2755.

Cited By

Chowdhury SSany MAhamed MDas SBadal FDas PTasneem ZHasan MIslam MAli MAbhi SIslam MSarker S(2023)A State-of-the-Art Computer Vision Adopting Non-Euclidean Deep-Learning ModelsInternational Journal of Intelligent Systems10.1155/2023/86746412023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/8674641
Apostolidis EMezaris VPatras IKankanhalli MPatras ILiu JWong YKomamizu T(2023)A Study on the Use of Attention for Explaining Video SummarizationProceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos10.1145/3607540.3617138(41-49)Online publication date: 29-Oct-2023
https://dl.acm.org/doi/10.1145/3607540.3617138
Chang XWang TCai SSun C(2023)LANDMARK: language-guided representation enhancement framework for scene graph generationApplied Intelligence10.1007/s10489-023-04722-153:21(26126-26138)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1007/s10489-023-04722-1

Index Terms

One-shot Video Graph Generation for Explainable Action Reasoning

Index terms have been assigned to the content through auto-classification.

Recommendations

Explainable Video Action Reasoning via Prior Knowledge and State Transitions
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Human action analysis and understanding in videos is an important and challenging task. Although substantial progress has been made in past years, the explainability of existing methods is still limited. In this work, we propose a novel action reasoning ...
Decomposing a graph into pseudoforests with one having bounded degree

The maximum average degree of a graph G, denoted by mad ( G ) , is defined as mad ( G ) = max H G 2 e ( H ) v ( H ) . Suppose that is an orientation of G, G denotes the oriented graph. It is well-known that for any graph G, there exists an orientation ...
Action-Based Causal Reasoning

In this paper we present a causal theory based on an interventionist conception of causality, i.e., a preference to select causes among a set of actions which an agent has the ability to perform or not to perform (free will). The most interesting ...

Comments

Information & Contributors

Information

Published In

cover image Neurocomputing

Neurocomputing Volume 488, Issue C

Jun 2022

707 pages

ISSN:0925-2312

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 June 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chowdhury SSany MAhamed MDas SBadal FDas PTasneem ZHasan MIslam MAli MAbhi SIslam MSarker S(2023)A State-of-the-Art Computer Vision Adopting Non-Euclidean Deep-Learning ModelsInternational Journal of Intelligent Systems10.1155/2023/86746412023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/8674641
Apostolidis EMezaris VPatras IKankanhalli MPatras ILiu JWong YKomamizu T(2023)A Study on the Use of Attention for Explaining Video SummarizationProceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos10.1145/3607540.3617138(41-49)Online publication date: 29-Oct-2023
https://dl.acm.org/doi/10.1145/3607540.3617138
Chang XWang TCai SSun C(2023)LANDMARK: language-guided representation enhancement framework for scene graph generationApplied Intelligence10.1007/s10489-023-04722-153:21(26126-26138)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1007/s10489-023-04722-1

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents