skip to main content
10.1007/978-3-030-58558-7_25guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Unified Image and Video Saliency Modeling

Published: 23 August 2020 Publication History

Abstract

Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent DHF1K benchmark. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We identify different sources of domain shift between image and video saliency data and between different video saliency datasets as a key challenge for effective joint modelling. To address this we propose four novel domain adaptation techniques—Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN—in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300. With one set of parameters, UNISAL achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency datasets, despite faster runtime and a 5 to 20-fold smaller model size compared to all competing deep methods. We provide retrospective analyses and ablation studies which confirm the importance of the domain shift modeling. The code is available at https://github.com/rdroste/unisal.

References

[1]
Bak C, Kocak A, Erdem E, and Erdem A Spatio-temporal saliency networks for dynamic saliency prediction IEEE TMM 2017 20 7 1688-1698
[2]
Borji, A.: Saliency prediction in the deep learning era: an empirical investigation. arXiv:1810.03716 (2018)
[3]
Borji A and Itti L State-of-the-art in visual attention modeling IEEE TPAMI 2012 35 1 185-207
[4]
Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: NeurIPS (2016)
[5]
Bylinskii Z, Judd T, Oliva A, Torralba A, and Durand F What do different evaluation metrics tell us about saliency models? IEEE TPAMI 2019 41 3 740-757
[6]
Chang, W.G., You, T., Seo, S., Kwak, S., Han, B.: Domain-specific batch normalization for unsupervised domain adaptation. In: CVPR (2019)
[7]
Cornia M, Baraldi L, Serra G, and Cucchiara R Predicting human eye fixations via an LSTM-based saliency attentive model IEEE TIP 2016 27 10 5142-5154
[8]
Fang Y, Wang Z, Lin W, and Fang Z Video saliency incorporating spatiotemporal cues and uncertainty weighting IEEE TIP 2014 23 9 3910-3921
[9]
Gal, Y., Ghahramani, Z.: A Theoretically grounded application of dropout in recurrent neural networks. In: NeurIPS (2016)
[10]
Gorji, S., Clark, J.J.: Going from image to video saliency: augmenting image salience with dynamic attentional push. In: CVPR (2018)
[11]
Guo, C., Ma, Q., Zhang, L.: Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In: CVPR (2008)
[12]
Guo C and Zhang L A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression IEEE TIP 2009 19 1 185-198
[13]
Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: NeurIPS (2007)
[14]
Hossein Khatoonabadi, S., Vasconcelos, N., Bajic, I.V., Shan, Y.: How many bits does it take for a stimulus to be salient? In: CVPR (2015)
[15]
Hou, X., Zhang, L.: Dynamic visual attention: searching for coding length increments. In: NeurIPS (2009)
[16]
Huang, X., Shen, C., Boix, X., Zhao, Q.: SALICON: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: ICCV (2015)
[17]
Itti L, Koch C, and Niebur E A model of saliency-based visual attention for rapid scene analysis IEEE TPAMI 1998 20 11 1254-1259
[18]
Jetley, S., Murray, N., Vig, E.: End-to-end saliency mapping via probability distribution prediction. In: CVPR (2016)
[19]
Jiang, L., Xu, M., Liu, T., Qiao, M., Wang, Z.: DeepVS: a deep learning based video saliency prediction approach. In: ECCV (2018)
[20]
Jiang, M., Huang, S., Duan, J., Zhao, Q.: SALICON: saliency in context. In: CVPR (2015)
[21]
Judd, T., Durand, F., Torralba, A.: A Benchmark of computational models of saliency to predict human fixations. In: MIT-CSAIL-TR-2012, vol. 1, pp. 1–7 (2012)
[22]
Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: ICCV (2009)
[23]
Kruthiventi SSS, Ayush K, and Babu RV DeepFix: a fully convolutional neural network for predicting human eye fixations IEEE TIP 2015 26 9 4446-4456
[24]
Kümmerer, M., Wallis, T.S.A., Bethge, M.: DeepGaze II: reading fixations from deep features trained on object recognition. arXiv:1610.01563 (2016)
[25]
Lai Q, Wang W, Sun H, and Shen J Video saliency prediction using spatiotemporal residual attentive networks IEEE TIP 2019 26 1113-1126
[26]
Le Meur O, Le Callet P, Barba D, and Thoreau D A coherent computational approach to model bottom-up visual attention IEEE TPAMI 2006 28 5 802-817
[27]
Leboran V, Garcia-Diaz A, Fdez-Vidal XR, and Pardo XM Dynamic whitening saliency IEEE TPAMI 2016 39 5 893-907
[28]
Li, Y., Wang, N., Shi, J., Liu, J., Hou, X.: Revisiting batch normalization for practical domain adaptation. In: ICLR (2016)
[29]
Linardos, P., Mohedano, E., Nieto, J.J., McGuinness, K., Giro-i Nieto, X., O’Connor, N.E.: Simple vs complex temporal recurrences for video saliency prediction. In: BMVC (2019)
[30]
Liu J, Shahroudy A, Xu D, and Wang G Leibe B, Matas J, Sebe N, and Welling M Spatio-temporal LSTM with trust gates for 3D human action recognition Computer Vision – ECCV 2016 2016 Cham Springer 816-833
[31]
Maaten LVD and Hinton G Visualizing data using t-SNE J. Mach. Learn. Res. 2008 9 Nov 2579-2605
[32]
Mahadevan V and Vasconcelos N Spatiotemporal saliency in dynamic scenes IEEE TPAMI 2009 32 1 171-177
[33]
Marat S, Phuoc TH, Granjon L, Guyader N, Pellerin D, and Guérin-Dugué A Modelling spatio-temporal saliency to predict gaze direction for short videos Int. J. Comput. Vis. 2009 82 3 231
[34]
Mathe S and Sminchisescu C Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition IEEE TPAMI 2015 37 7 1408-1424
[35]
Min, K., Corso, J.J.: TASED-net: temporally-aggregating spatial encoder-decoder network for video saliency detection. In: ICCV (2019)
[36]
Pan, J., et al.: SaLGAN: visual saliency prediction with generative adversarial networks. arXiv:1701.01081 (2017)
[37]
Pan, J., Sayrol, E., Giro-i Nieto, X., McGuinness, K., O’Connor, N.E.: Shallow and deep convolutional networks for saliency prediction. In: CVPR (2016)
[38]
Rozantsev A, Salzmann M, and Fua P Beyond sharing weights for deep domain adaptation IEEE TPAMI 2019 41 4 801-814
[39]
Rudoy, D., Goldman, D.B., Shechtman, E., Zelnik-Manor, L.: Learning video saliency from human gaze using candidate selection. In: CVPR (2013)
[40]
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: CVPR (2018)
[41]
Seo HJ and Milanfar P Static and space-time visual saliency detection by self-resemblance J. Vis. 2009 9 12 15-15
[42]
Sun Y and Fisher R Object-based visual attention for computer vision Artif. Intell. 2003 146 1 77-123
[43]
Tsai, J.C., Chien, J.T.: Adversarial domain separation and adaptation. In: 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6 (2017)
[44]
Valipour, S., Siam, M., Jagersand, M., Ray, N.: Recurrent fully convolutional networks for video segmentation. In: IEEE WACV, pp. 29–36 (2017)
[45]
Vig, E., Dorr, M., Cox, D.: Large-scale optimization of hierarchical features for saliency prediction in natural images. In: CVPR (2014)
[46]
Wang W and Shen J Deep visual attention prediction IEEE TIP 2017 27 5 2368-2378
[47]
Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting video saliency: a large-scale benchmark and a new model. In: CVPR (2018)
[48]
Wang, W., Shen, J., Xie, J., Cheng, M.M., Ling, H., Borji, A.: Revisiting video saliency prediction in the deep learning era. IEEE TPAMI (2019, early access)
[49]
Xiao, T., Li, H., Ouyang, W., Wang, X.: Learning deep feature representations with domain guided dropout for person re-identification. arXiv:1604.07528 (2016)
[50]
Yang, S., Lin, G., Jiang, Q., Lin, W.: A dilated inception network for visual saliency prediction. IEEE TMM 22(8), 2163–2176 (2020)
[51]
Zheng, Q., Jiao, J., Cao, Y., Lau, R.W.: Task-driven webpage saliency. In: ECCV (2018)
[52]
Zhong, S.H., Liu, Y., Ren, F., Zhang, J., Ren, T.: Video saliency detection via dynamic consistent spatio-temporal attention modelling. In: AAAI (2013)

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V
Aug 2020
843 pages
ISBN:978-3-030-58557-0
DOI:10.1007/978-3-030-58558-7

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 August 2020

Author Tags

  1. Visual saliency
  2. Video saliency
  3. Domain adaptation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media