skip to main content
10.1609/aaai.v38i9.28883guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

TransGOP: transformer-based gaze object prediction

Published: 20 February 2024 Publication History

Abstract

Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the globalmemory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our Trans-GOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at https://github.com/chenxi-Guo/TransGOP.git.

References

[1]
Balim, H.; Park, S.; Wang, X.; Zhang, X.; and Hilliges, O. 2023. EFE: End-to-end Frame-to-Gaze Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2687-2696.
[2]
Bochkovskiy, A.; Wang, C.-Y.; and Liao, H.-Y. M. 2020. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934.
[3]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; and Zagoruyko, S. 2020. End-to-end object detection with transformers. In Eur. Conf. Comput. Vis., 213-229. Springer.
[4]
Cheng, Y.; and Lu, F. 2021. Gaze Estimation Using Transformer. arxiv:2105.14424.
[5]
Cheng, Y.; Lu, F.; and Zhang, X. 2018. Appearance-based gaze estimation via evaluation-guided asymmetric regression. In Eur. Conf. Comput. Vis., 100-115.
[6]
Chong, E.; Wang, Y.; Ruiz, N.; and Rehg, J. M. 2020. Detecting attended visual targets in video. In IEEE Conf. Comput. Vis. Pattern Recog., 5396-5406.
[7]
Dai, Z.; Cai, B.; Lin, Y.; and Chen, J. 2021. Up-detr: Unsupervised pre-training for object detection with transformers. In IEEE Conf. Comput. Vis. Pattern Recog., 1601-1610.
[8]
de Belen, R. A.; Eapen, V.; Bednarz, T.; and Sowmya, A. 2023. Using visual attention estimation on videos for automated prediction of Autism Spectrum Disorder and symptom severity in preschool children. medRxiv, 2023-06.
[9]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; and Tian, Q. 2019. Centernet: Keypoint triplets for object detection. In IEEE Conf. Comput. Vis. Pattern Recog., 6569-6578.
[10]
Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; and Liu, W. 2021. You only look at one sequence: Rethinking transformer in vision through object detection. Adv. Neural Inform. Process. Syst., 34: 26183-26197.
[11]
Girshick, R. 2015. Fast r-cnn. In Int. Conf. Comput. Vis., 1440-1448.
[12]
Guo, H.; Hu, Z.; and Liu, J. 2022. MGTR: End-to-End Mutual Gaze Detection with Transformer. ACCV.
[13]
He, J.; Pham, K.; Valliappan, N.; Xu, P.; Roberts, C.; Lagun, D.; and Navalpakkam, V. 2019. On-device few-shot personalization for real-time gaze estimation. In Int. Conf. Comput. Vis. Worksh., 0-0.
[14]
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell., 37(9): 1904-1916.
[15]
Huang, L.; Yang, Y.; Deng, Y.; and Yu, Y. 2015. Dense-box: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874.
[16]
Judd, T.; Ehinger, K.; Durand, F.; and Torralba, A. 2009. Learning to predict where humans look. In Int. Conf. Comput. Vis., 2106-2113. IEEE.
[17]
Kleinke, C. L. 1986. Gaze and eye contact: a research review. Psychological bulletin, 100(1): 78.
[18]
Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; and Shi, J. 2020. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process., 29: 7389-7398.
[19]
Krafka, K.; Khosla, A.; Kellnhofer, P.; Kannan, H.; Bhandarkar, S.; Matusik, W.; and Torralba, A. 2016. Eye tracking for everyone. In IEEE Conf. Comput. Vis. Pattern Recog., 2176-2184.
[20]
Land, M.; and Tatler, B. 2009. Looking and acting: vision and eye movements in natural behaviour. Oxford University Press.
[21]
Law, H.; and Deng, J. 2018. Cornernet: Detecting objects as paired keypoints. In Eur. Conf. Comput. Vis., 734-750.
[22]
Leifman, G.; Rudoy, D.; Swedish, T.; Bayro-Corrochano, E.; and Raskar, R. 2017. Learning gaze transitions from depth to improve video saliency estimation. In Int. Conf. Comput. Vis., 1698-1707.
[23]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L. M.; and Zhang, L. 2022a. Dn-detr: Accelerate detr training by introducing query denoising. In IEEE Conf. Comput. Vis. Pattern Recog., 13619-13627.
[24]
Li, Y.; Mao, H.; Girshick, R.; and He, K. 2022b. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, 280-296. Springer.
[25]
Lian, D.; Yu, Z.; and Gao, S. 2018. Believe it or not, we know what you are looking at! In ACCV, 35-50. Springer.
[26]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In IEEE Conf. Comput. Vis. Pattern Recog., 2980-2988.
[27]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; and Zhang, L. 2022a. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329.
[28]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. 2022b. Swin transformer v2: Scaling up capacity and resolution. In IEEE Conf. Comput. Vis. Pattern Recog., 12009-12019.
[29]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; and Wang, J. 2021. Conditional detr for fast training convergence. In Int. Conf. Comput. Vis., 3651-3660.
[30]
Park, S.; Spurr, A.; and Hilliges, O. 2018. Deep pictorial gaze estimation. In Eur. Conf. Comput. Vis., 721-738.
[31]
Recasens, A.; Khosla, A.; Vondrick, C.; and Torralba, A. 2015. Where are they looking? Adv. Neural Inform. Process. Syst., 28.
[32]
Recasens, A.; Vondrick, C.; Khosla, A.; and Torralba, A. 2017. Following gaze in video. In Int. Conf. Comput. Vis., 1435-1443.
[33]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; and Savarese, S. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In IEEE Conf. Comput. Vis. Pattern Recog., 658-666.
[34]
Tian, Z.; Chu, X.; Wang, X.; Wei, X.; and Shen, C. 2022. Fully Convolutional One-Stage 3D Object Detection on Li-DAR Range Images. arXiv preprint arXiv:2205.13764.
[35]
Tomas, H.; Reyes, M.; Dionido, R.; Ty, M.; Mirando, J.; Casimiro, J.; Atienza, R.; and Guinto, R. 2021. Goo: A dataset for gaze object prediction in retail environments. In IEEE Conf. Comput. Vis. Pattern Recog., 3125-3133.
[36]
Tonini, F.; Dall'Asen, N.; Beyan, C.; and Ricci, E. 2023. Object-aware Gaze Target Detection. arXiv preprint arXiv:2307.09662.
[37]
Tu, D.; Min, X.; Duan, H.; Guo, G.; Zhai, G.; and Shen, W. 2022. End-to-End Human-Gaze-Target Detection with Transformers. arxiv:2203.10433.
[38]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Adv. Neural Inform. Process. Syst., 30.
[39]
Wang, B.; Hu, T.; Li, B.; Chen, X.; and Zhang, Z. 2022a. GaTector: A Unified Framework for Gaze Object Prediction. In IEEE Conf. Comput. Vis. Pattern Recog., 19588-19597.
[40]
Wang, C.-Y.; Bochkovskiy, A.; and Liao, H.-Y. M. 2023. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7464-7475.
[41]
Wang, Y.; Zhang, X.; Yang, T.; and Sun, J. 2022b. Anchor detr: Query design for transformer-based detector. In AAAI, 2567-2575.
[42]
Yang, H.; Deng, R.; Lu, Y.; Zhu, Z.; Chen, Y.; Roland, J. T.; Lu, L.; Landman, B. A.; Fogo, A. B.; and Huo, Y. 2020. Cir cleNet: Anchor-free glomerulus detection with circle representation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 35-44. Springer.
[43]
Yin, P.; Dai, J.; Wang, J.; Xie, D.; and Pu, S. 2022. NeRF-Gaze: A Head-Eye Redirection Parametric Model for Gaze Estimation. arXiv preprint arXiv:2212.14710.
[44]
Yu, L.; Zhou, X.; Wang, L.; and Zhang, J. 2022. Boundary-Aware Salient Object Detection in Optical Remote-Sensing Images. Electronics, 11(24): 4200.
[45]
Yu, Q.; Xia, Y.; Bai, Y.; Lu, Y.; Yuille, A. L.; and Shen, W. 2021. Glance-and-Gaze Vision Transformer. In Advances in Neural Information Processing Systems, volume 34, 12992-13003. Curran Associates, Inc.
[46]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.; and Shum, H.-Y. 2022. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In Int. Conf. Learn. Represent.
[47]
Zhang, X.; Sugano, Y.; Fritz, M.; and Bulling, A. 2015. Appearance-based gaze estimation in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., 4511-4520.
[48]
Zhang, X.; Sugano, Y.; Fritz, M.; and Bulling, A. 2017. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell., 41(1): 162-175.
[49]
Zhao, H.; Lu, M.; Yao, A.; Chen, Y.; and Zhang, L. 2020. Learning to draw sight lines. Int. J. Comput. Vis., 128(5): 1076-1100.
[50]
Zheng, D.; Dong, W.; Hu, H.; Chen, X.; and Wang, Y. 2023. Less is More: Focus Attention for Efficient DETR. arXiv preprint arXiv:2307.12612.
[51]
Zhou, X.; Zhuo, J.; and Krahenbuhl, P. 2019. Bottom-up object detection by grouping extreme and center points. In IEEE Conf. Comput. Vis. Pattern Recog., 850-859.
[52]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; and Dai, J. 2020. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.
[53]
Zhu, Z.; and Ji, Q. 2005. Eye gaze tracking under natural head movements. In IEEE Conf. Comput. Vis. Pattern Recog., volume 1, 918-923. IEEE.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence
February 2024
23861 pages
ISBN:978-1-57735-887-9

Sponsors

  • Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 20 February 2024

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media