skip to main content
10.1145/3206025.3206052acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Multi-Scale Spatiotemporal Conv-LSTM Network for Video Saliency Detection

Published: 05 June 2018 Publication History

Abstract

Recently, deep neural networks have been crucial techniques for image salient detection. However, two difficulties prevent the development of deep learning in video saliency detection. The first one is that the traditional static network cannot conduct a robust motion estimation in videos. The other is that the data-driven deep learning is in lack of sufficient manually annotated pixel-wise ground truths for video saliency network training. In this paper, we propose a multi-scale spatiotemporal convolutional LSTM network (MSST-ConvLSTM) to incorporate spatial and temporal cues for video salient objects detection. Furthermore, as manually pixel-wised labeling is very time-consuming, we sign lots of coarse labels, which are mixed with fine labels to train a robust saliency prediction model. Experiments on the widely used challenging benchmark datasets (e.g., FBMS and DAVIS) demonstrate that the proposed approach has competitive performance of video saliency detection compared with the state-of-the-art saliency models.

References

[1]
Thomas Brox and Jitendra Malik. 2010. Object segmentation by long term analysis of point trajectories Proceedings of the European Conference on Computer Vision. Springer, 282--295.
[2]
Chenglizhao Chen, Shuai Li, Yongguang Wang, Hong Qin, and Aimin Hao. 2017. Video Saliency Detection via Spatial-Temporal Fusion and Low-Rank Coherency Diffusion. IEEE Transactions on Image Processing Vol. 26, 7 (2017), 3156--3170.
[3]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2014. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint:1412.7062 (2014).
[4]
Ming-Ming Cheng, Niloy J Mitra, Xiaolei Huang, Philip HS Torr, and Shi-Min Hu. 2015. Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 37, 3 (2015), 569--582.
[5]
Jia Deng, Wei Dong, Richard Socher, Li Jia Li, Kai Li, and Fei Fei Li. 2009. ImageNet: A large-scale hierarchical image database Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248--255.
[6]
Ionut C. Duta, Bogdan Ionescu, Kiyoharu Aizawa, and Nicu Sebe. 2017. Simple, Efficient and Effective Encodings of Local Deep Features for Video Action Recognition. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. ACM, 218--225.
[7]
Pedro F. Felzenszwalb and Daniel P. Huttenlocher. 2004. Efficient Graph-Based Image Segmentation. International Journal of Computer Vision Vol. 59, 2 (2004), 167--181.
[8]
Huazhu Fu, Xiaochun Cao, and Zhuowen Tu. 2013. Cluster-based co-saliency detection. IEEE Transactions on Image Processing Vol. 22, 10 (2013), 3766--3778.
[9]
Dashan Gao, Vijay Mahadevan, and Nuno Vasconcelos. 2008. The discriminant center-surround hypothesis for bottom-up saliency Proceedings of the Advances in Neural Information Processing Systems. 497--504.
[10]
Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Zhuowen Tu, and A Borji. 2017. Deeply supervised salient object detection with short connections Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.
[11]
Huaizu Jiang, Jingdong Wang, Zejian Yuan, Yang Wu, Nanning Zheng, and Shipeng Li. 2013. Salient object detection: A discriminative regional feature integration approach Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2083--2090.
[12]
Hansang Kim, Youngbae Kim, Jae-Young Sim, and Chang-Su Kim. 2015. Spatiotemporal saliency detection for video sequences based on random walk with restart. IEEE Transactions on Image Processing Vol. 24, 8 (2015), 2552--2564.
[13]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. Computer Science (2014).
[14]
Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M Rehg. 2013. Video segmentation by tracking many figure-ground segments Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2192--2199.
[15]
Guanbin Li and Yizhou Yu. 2016 a. Deep contrast learning for salient object detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 478--487.
[16]
Guanbin Li and Yizhou Yu. 2016 b. Visual saliency detection based on multiscale deep CNN features. IEEE Transactions on Image Processing Vol. 25, 11 (2016), 5012--5024.
[17]
Xi Li, Liming Zhao, Lina Wei, Ming-Hsuan Yang, Fei Wu, Yueting Zhuang, Haibin Ling, and Jingdong Wang. 2016. DeepSaliency: Multi-Task deep neural network model for salient object detection. IEEE Transactions on Image Processing Vol. 25, 8 (2016), 3919--3930.
[18]
Nian Liu and Junwei Han. 2016. Dhsnet: Deep hierarchical saliency network for salient object detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 678--686.
[19]
Wu Liu, Cheng Zhang, Huadong Ma, and Shuangqun Li. 2018. Learning Efficient Spatial-Temporal Gait Features with Deep Learning for Human Identification. Neuroinformatics (2018), 1--15.
[20]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3431--3440.
[21]
Anestis Papazoglou and Vittorio Ferrari. 2013. Fast object segmentation in unconstrained video. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1777--1784.
[22]
Yuxin Peng, Yunzhen Zhao, and Junchao Zhang. 2017. Two-stream Collaborative Learning with Spatial-Temporal Attention for Video Classification. arXiv preprint arXiv:1711.03273 (2017).
[23]
Federico Perazzi, Philipp Krähenbühl, Yael Pritch, and Alexander Hornung . 2012. Saliency filters: Contrast based filtering for salient region detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 733--740.
[24]
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 724--732.
[25]
Esa Rahtu, Juho Kannala, Mikko Salo, and Janne Heikkilä. 2010. Segmenting salient objects from images and videos. In Proceedings of the European Conference on Computer Vision. Springer, 366--379.
[26]
Hae Jong Seo and Peyman Milanfar. 2009. Static and space-time visual saliency detection by self-resemblance. Journal of vision Vol. 9, 12 (2009), 15--15.
[27]
Lijun Wang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. 2015 a. Deep networks for saliency detection via local estimation and global search Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3183--3192.
[28]
Linzhao Wang, Lijun Wang, Huchuan Lu, Pingping Zhang, and Xiang Ruan. 2016. Saliency detection with recurrent fully convolutional networks Proceedings of the European Conference on Computer Vision. Springer, 825--841.
[29]
Wenguan Wang, Jianbing Shen, and Fatih Porikli. 2015 b. Saliency-aware geodesic video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3395--3402.
[30]
Wenguan Wang, Jianbing Shen, and Ling Shao. 2015 c. Consistent video saliency using local gradient flow optimization and global refinement. IEEE Transactions on Image Processing Vol. 24, 11 (2015), 4185--4196.
[31]
Wenguan Wang, Jianbing Shen, and Ling Shao. 2018. Video salient object detection via fully convolutional networks. IEEE Transactions on Image Processing Vol. 27, 1 (2018), 38--49.
[32]
Yichen Wei, Fang Wen, Wangjiang Zhu, and Jian Sun. 2012. Geodesic saliency using background priors. In Proceedings of the European Conference on Computer Vision. Springer, 29--42.
[33]
Dayan Wu, Zheng Lin, Bo Li, Mingzhen Ye, and Weiping Wang. 2017. Deep Supervised Hashing for Multi-Label and Large-Scale Image Retrieval Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. ACM, 150--158.
[34]
Yun Zhai and Mubarak Shah. 2006. Visual attention detection in video sequences using spatiotemporal cues Proceedings of the ACM international conference on Multimedia. ACM, 815--824.
[35]
Rui Zhao, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2015. Saliency detection by multi-context deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1265--1274.
[36]
Yunzhen Zhao and Yuxin Peng. 2017. Saliency-guided video classification via adaptively weighted learning Proceedings of the IEEE International Conference on Multimedia and Expo. IEEE, 847--852.
[37]
Feng Zhou, Sing Bing Kang, and Michael F Cohen. 2014. Time-mapping using space-time saliency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3358--3365.
[38]
Wenbin Zou and Nikos Komodakis. 2015. HARF: Hierarchy-Associated Rich Features for Salient Object Detection Proceedings of the IEEE International Conference on Computer Vision. IEEE, 406--414.
[39]
Wenbin Zou, Kidiyo Kpalma, Zhi Liu, and Joseph Ronsin. 2013. Segmentation driven low-rank matrix recovery for saliency detection Proceedings of the British Machine Vision on Conference. 1--13.
[40]
Wenbin Zou, Zhi Liu, Kidiyo Kpalma, Joseph Ronsin, Yong. Zhao, and Nikos Komodakis. 2015. Unsupervised Joint Salient Region Detection and Object Segmentation. IEEE Transactions on Image Processing Vol. 24, 11 (2015), 3858--3873.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval
June 2018
550 pages
ISBN:9781450350464
DOI:10.1145/3206025
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. spatiotemporal fusion
  3. video saliency

Qualifiers

  • Research-article

Conference

ICMR '18
Sponsor:

Acceptance Rates

ICMR '18 Paper Acceptance Rate 44 of 136 submissions, 32%;
Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)6
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media