skip to main content
research-article

Sparse Coding Guided Spatiotemporal Feature Learning for Abnormal Event Detection in Large Videos

Published: 01 January 2019 Publication History

Abstract

Abnormal event detection in large videos is an important task in research and industrial applications, which has attracted considerable attention in recent years. Existing methods usually solve this problem by extracting local features and then learning an outlier detection model on training videos. However, most previous approaches merely employ hand-crafted visual features, which is a clear disadvantage due to their limited representation capacity. In this paper, we present a novel unsupervised deep feature learning algorithm for the abnormal event detection problem. To exploit the spatiotemporal information of the inputs, we utilize the deep three-dimensional convolutional network (C3D) to perform feature extraction. Then, the key problem is how to train the C3D network without any category labels. Here, we employ the sparse coding results of the hand-crafted features generated from the inputs to guide the unsupervised feature learning. Specifically, we define a multilevel similarity relationship between these inputs according to the statistical information of the shared atoms. In the following, we introduce the quadruplet concept to model the multilevel similarity structure, which could be used to construct a generalized triplet loss for training the C3D network. Furthermore, the C3D network could be utilized to generate the features for sparse coding again, and this pipeline could be iterated for several times. By jointly optimizing between the sparse coding and the unsupervised feature learning, we can obtain robust and rich feature representations. Based on the learned representations, the sparse reconstruction error is applied to predicting the anomaly score of each testing input. Experiments on several publicly available video surveillance datasets in comparison with a number of existing works demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.

References

[1]
Y. Xie et al., “A unified framework for locating and recognizing human actions,” in Proc. IEEE Conf. Comput. Vis Pattern Recognit., 2011, pp. 25–32.
[2]
S. H. Khatoonabadi and I. V. Bajic, “Video object tracking in the compressed domain using spatio-temporal Markov random fields,” IEEE Trans. Image Process., vol. 22, no. 1, pp. 300–313, Jan. 2013.
[3]
Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang, “Super fast event recognition in internet videos,” IEEE Trans. Multimedia, vol. 17, no. 8, pp. 1174–1186, Aug. 2015.
[4]
D. Coppi, S. Calderara, and R. Cucchiara, “Transductive people tracking in unconstrained surveillance,” IEEE Trans. Circuits Syst. Video Technol., vol. 26, no. 4, pp. 762–775, Apr. 2016.
[5]
Y. Li, R. Wang, Z. Cui, S. Shan, and X. Chen, “Spatial pyramid covariance-based compact video code for robust face retrieval in TV-series,” IEEE Trans. Image Process., vol. 25, no. 12, pp. 5905–5919, Dec. 2016.
[6]
K. R. Jerripothula, J. Cai, and J. Yuan, “Cats: Co-saliency activated tracklet selection for video co-localization,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 187–202.
[7]
A. Montes, A. Salvador, S. Pascual, and X. Giro-i Nieto, “Temporal activity detection in untrimmed videos with recurrent neural networks,” 1st NIPS Workshop on Large Scale Computer Vision Systems, 2016.
[8]
Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and language,” in Proc. IEEE Conf. Comput. Vis Pattern Recognit., 2016, pp. 4594–4602.
[9]
Y. Yan et al., “Image classification by cross-media active learning with privileged information,” IEEE Trans. Multimedia, vol. 18, no. 12, pp. 2494–2502, Dec. 2016.
[10]
J. Yuan, B. Ni, X. Yang, and A. A. Kassim, “Temporal action localization with pyramid of score distribution features,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2016, pp. 3093–3102.
[11]
L. Sun, X. Wang, Z. Wang, H. Zhao, and W. Zhu, “Social-aware video recommendation for online social groups,” IEEE Trans. Multimedia, vol. 19, no. 3, pp. 609–618, Mar. 2017.
[12]
J. Li, Y. Tian, T. Huang, and W. Gao, “Probabilistic multi-task learning for visual saliency estimation in video,” Int. J. Comput. Vis., vol. 90, no. 2, pp. 150–165, 2010.
[13]
M. J. Roshtkhari and M. D. Levine, “Online dominant and anomalous behavior detection in videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 2611–2618.
[14]
Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, and Y. Yao, “Interestingness prediction by robust learning to rank,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 488–503.
[15]
R. Hamid et al., “Detection and explanation of anomalous activities: Representing activities as bags of event n-grams,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2005, vol. 1, pp. 1031–1038.
[16]
D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan, “Semi-supervised adapted HMMs for unusual event detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2005, vol. 1, pp. 611–618.
[17]
C. Piciarelli, C. Micheloni, and G. L. Foresti, “Trajectory-based anomalous event detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 11, pp. 1544–1554, Nov. 2008.
[18]
A. Basharat, A. Gritai, and M. Shah, “Learning object motion patterns for anomaly detection and improved object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8.
[19]
F. Jiang, Y. Wu, and A. K. Katsaggelos, “A dynamic hierarchical clustering method for trajectory-based unusual video event detection,” IEEE Trans. Image Process., vol. 18, no. 4, pp. 907–913, Apr. 2009.
[20]
Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal event detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 3449–3456.
[21]
C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 FPs in MATLAB,” in Proc. IEEE Int. Conf. Comput. Vis, 2013, pp. 2720–2727.
[22]
A. Del Giorno, J. A. Bagnell, and M. Hebert, “A discriminative framework for anomaly detection in large videos,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 334–349.
[23]
C. Stauffer and W. E. L. Grimson, “Learning patterns of activity using real-time tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 747–757, Aug. 2000.
[24]
Y. Shi, Y. Tian, Y. Wang, and T. Huang, “Sequential deep trajectory descriptor for action recognition with three-stream CNN,” IEEE Trans. Multimedia, vol. 19, no. 7, pp. 1510–1520, Jul. 2017.
[25]
L. Kratz and K. Nishino, “Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 1446–1453.
[26]
B. Zhao, L. Fei-Fei, and E. P. Xing, “Online detection of unusual events in videos via dynamic sparse coding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 3313–3320.
[27]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[28]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis Pattern Recognit., 2014, pp. 580–587.
[29]
A. Karpathy et al., “Large-scale video classification with convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1725–1732.
[30]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 4489–4497.
[31]
L. Wang et al., “Temporal segment networks: Towards good practices for deep action recognition,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 20–36.
[32]
K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. Neural Inf. Process. Syst., 2014, pp. 568–576.
[33]
D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe, “Learning deep representations of appearance and motion for anomalous event detection,” in Proc. Brit. Mach. Vis. Conf., 2015, pp. 8-1–8-12.
[34]
M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, “Learning temporal regularity in video sequences,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 733–742.
[35]
M. Sabokrou, M. Fayyaz, M. Fathy, and R. Klette, “Deep-cascade: Cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes,” IEEE Trans. Image Process., vol. 26, no. 4, pp. 1992–2004, Apr. 2017.
[36]
A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1558–1566.
[37]
Z. Shou, D. Wang, and S.-F. Chang, “Temporal action localization in untrimmed videos via multi-stage CNNs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1049–1058.
[38]
T. Han et al., “Dancelets mining for video recommendation based on dance styles,” IEEE Trans. Multimedia, vol. 19, no. 4, pp. 712–724, Apr. 2017.
[39]
C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1422–1430.
[40]
J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 5147–5156.
[41]
W. Chu and D. Cai, “Stacked similarity-aware autoencoders,” in Proc. Int. Joint Conf. Artif. Intell., 2017, pp. 1561–1567.
[42]
X. Wang, K. Tieu, and E. Grimson, “Learning semantic scene models by trajectory analysis,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 110–123.
[43]
T. Zhang, H. Lu, and S. Z. Li, “Learning semantic scene models by object classification and trajectory clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 1940–1947.
[44]
X. Cui, Q. Liu, M. Gao, and D. N. Metaxas, “Abnormal detection using interaction energy potentials,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 3161–3167.
[45]
F. Jiang, J. Yuan, S. A. Tsaftaris, and A. K. Katsaggelos, “Anomalous video event detection using spatiotemporal context,” Comput. Vis. Image Understanding, vol. 115, no. 3, pp. 323–333, 2011.
[46]
M. Bertini, A. Del Bimbo, and L. Seidenari, “Multi-scale and real-time non-parametric approach for anomaly detection and localization,” Comput. Vis. Image Understanding, vol. 116, no. 3, pp. 320–329, 2012.
[47]
A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz, “Robust real-time unusual event detection using multiple fixed-location monitors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 3, pp. 555–560, Mar. 2008.
[48]
T. Xiang and S. Gong, “Video behavior profiling for anomaly detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 5, pp. 893–908, May 2008.
[49]
J. Kim and K. Grauman, “Observe locally, infer globally: A space-time MRF for detecting abnormal activities with incremental updates,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 2921–2928.
[50]
R. Mehran, A. Oyama, and M. Shah, “Abnormal crowd behavior detection using social force model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 935–942.
[51]
S. Wu, B. E. Moore, and M. Shah, “Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 2054–2060.
[52]
B. Antić and B. Ommer, “Video parsing for abnormality detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 2415–2422.
[53]
V. Saligrama and Z. Chen, “Video anomaly detection based on local statistical aggregates,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2012, pp. 2112–2119.
[54]
H. Ren, W. Liu, S. I. Olsen, S. Escalera, and T. B. Moeslund, “Unsupervised behavior-specific dictionary learning for abnormal event detection,” in Proc. Brit. Mach. Vis. Conf., 2015, pp. 28–1.
[55]
Y. Benezeth, P.-M. Jodoin, V. Saligrama, and C. Rosenberger, “Abnormal events detection based on spatio-temporal co-occurences,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 2458–2465.
[56]
V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, “Anomaly detection in crowded scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 1975–1981.
[57]
Y. Yang, Z. Ma, A. G. Hauptmann, and N. Sebe, “Feature selection for multimedia analysis by sharing information among multiple tasks,” IEEE Trans. Multimedia, vol. 15, no. 3, pp. 661–669, Apr. 2013.
[58]
X. Zhu, Z. Huang, J. Cui, and H. T. Shen, “Video-to-shot tag propagation by graph sparse group lasso,” IEEE Trans. Multimedia, vol. 15, no. 3, pp. 633–646, Apr. 2013.
[59]
T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987, Jul. 2002.
[60]
D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
[61]
Y. Yang, Z.-J. Zha, Y. Gao, X. Zhu, and T.-S. Chua, “Exploiting web images for semantic video indexing via robust sample-specific loss,” IEEE Trans. Multimedia, vol. 16, no. 6, pp. 1677–1689, Oct. 2014.
[62]
J. K. Dutta and B. Banerjee, “Online detection of abnormal events using incremental coding length.” in Proc. AAAI Conf. Artif. Intell., 2015, pp. 3755–3761.
[63]
L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “Uncovering the temporal context for video question answering,” Int. J. Comput. Vis., vol. 124, no. 3, pp. 409–421, 2017.
[64]
L. Zhu, Z. Xu, and Y. Yang, “Bidirectional multirate reconstruction for temporal modeling in videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2653–2662.
[65]
R. Tudor Ionescu, S. Smeureanu, B. Alexe, and M. Popescu, “Unmasking the abnormal events in video,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2895–2903.
[66]
D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2536–2544.
[67]
D. Li et al., “Unsupervised visual representation learning by graph-based consistent constraints,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 678–694.
[68]
M. Aharon, M. Elad, and A. Bruckstein, “$ rm k$-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006.
[69]
X. Zhang, F. Zhou, Y. Lin, and S. Zhang, “Embedding label structures for fine-grained feature representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1114–1123.
[70]
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res., vol. 11, pp. 3371–3408, 2010.
[71]
K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human action classes from videos in the wild,” CRCV-TR-12-01, Nov. 2012. [Online]. Available: http://crcv.ucf.edu/data/UCF101.php
[72]
Y. S. Chong and Y. H. Tay, “Abnormal event detection in videos using spatiotemporal autoencoder,” in Proc. Int. Symp. Neural Netw., 2017, pp. 189–196.
[73]
H. Vu, T. D. Nguyen, A. Travers, S. Venkatesh, and D. Phung, “Energy-based localized anomaly detection in video surveillance,” in Proc. Pac.-Asia Conf. Knowl. Discovery Data Mining, 2017, pp. 641–653.
[74]
M. J. Roshtkhari and M. D. Levine, “An on-line, real-time learning method for detecting anomalies in videos using spatio-temporal compositions,” Comput. Vis. Image Understanding, vol. 117, no. 10, pp. 1436–1452, 2013.

Cited By

View all

Index Terms

  1. Sparse Coding Guided Spatiotemporal Feature Learning for Abnormal Event Detection in Large Videos
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Multimedia
        IEEE Transactions on Multimedia  Volume 21, Issue 1
        Jan. 2019
        268 pages

        Publisher

        IEEE Press

        Publication History

        Published: 01 January 2019

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 26 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media