research-article

TransGOP: transformer-based gaze object prediction

AUTHORs:

Nian LiuAuthors Info & Claims

AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence

Article No.: 1135, Pages 10180 - 10188

https://doi.org/10.1609/aaai.v38i9.28883

Published: 20 February 2024 Publication History

Abstract

Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the globalmemory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our Trans-GOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at https://github.com/chenxi-Guo/TransGOP.git.

References

[1]

Balim, H.; Park, S.; Wang, X.; Zhang, X.; and Hilliges, O. 2023. EFE: End-to-end Frame-to-Gaze Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2687-2696.

[2]

Bochkovskiy, A.; Wang, C.-Y.; and Liao, H.-Y. M. 2020. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934.

[3]

Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; and Zagoruyko, S. 2020. End-to-end object detection with transformers. In Eur. Conf. Comput. Vis., 213-229. Springer.

[4]

Cheng, Y.; and Lu, F. 2021. Gaze Estimation Using Transformer. arxiv:2105.14424.

[5]

Cheng, Y.; Lu, F.; and Zhang, X. 2018. Appearance-based gaze estimation via evaluation-guided asymmetric regression. In Eur. Conf. Comput. Vis., 100-115.

[6]

Chong, E.; Wang, Y.; Ruiz, N.; and Rehg, J. M. 2020. Detecting attended visual targets in video. In IEEE Conf. Comput. Vis. Pattern Recog., 5396-5406.

[7]

Dai, Z.; Cai, B.; Lin, Y.; and Chen, J. 2021. Up-detr: Unsupervised pre-training for object detection with transformers. In IEEE Conf. Comput. Vis. Pattern Recog., 1601-1610.

[8]

de Belen, R. A.; Eapen, V.; Bednarz, T.; and Sowmya, A. 2023. Using visual attention estimation on videos for automated prediction of Autism Spectrum Disorder and symptom severity in preschool children. medRxiv, 2023-06.

[9]

Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; and Tian, Q. 2019. Centernet: Keypoint triplets for object detection. In IEEE Conf. Comput. Vis. Pattern Recog., 6569-6578.

[10]

Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; and Liu, W. 2021. You only look at one sequence: Rethinking transformer in vision through object detection. Adv. Neural Inform. Process. Syst., 34: 26183-26197.

[11]

Girshick, R. 2015. Fast r-cnn. In Int. Conf. Comput. Vis., 1440-1448.

Digital Library

[12]

Guo, H.; Hu, Z.; and Liu, J. 2022. MGTR: End-to-End Mutual Gaze Detection with Transformer. ACCV.

[13]

He, J.; Pham, K.; Valliappan, N.; Xu, P.; Roberts, C.; Lagun, D.; and Navalpakkam, V. 2019. On-device few-shot personalization for real-time gaze estimation. In Int. Conf. Comput. Vis. Worksh., 0-0.

[14]

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell., 37(9): 1904-1916.

Digital Library

[15]

Huang, L.; Yang, Y.; Deng, Y.; and Yu, Y. 2015. Dense-box: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874.

[16]

Judd, T.; Ehinger, K.; Durand, F.; and Torralba, A. 2009. Learning to predict where humans look. In Int. Conf. Comput. Vis., 2106-2113. IEEE.

[17]

Kleinke, C. L. 1986. Gaze and eye contact: a research review. Psychological bulletin, 100(1): 78.

[18]

Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; and Shi, J. 2020. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process., 29: 7389-7398.

Digital Library

[19]

Krafka, K.; Khosla, A.; Kellnhofer, P.; Kannan, H.; Bhandarkar, S.; Matusik, W.; and Torralba, A. 2016. Eye tracking for everyone. In IEEE Conf. Comput. Vis. Pattern Recog., 2176-2184.

[20]

Land, M.; and Tatler, B. 2009. Looking and acting: vision and eye movements in natural behaviour. Oxford University Press.

[21]

Law, H.; and Deng, J. 2018. Cornernet: Detecting objects as paired keypoints. In Eur. Conf. Comput. Vis., 734-750.

[22]

Leifman, G.; Rudoy, D.; Swedish, T.; Bayro-Corrochano, E.; and Raskar, R. 2017. Learning gaze transitions from depth to improve video saliency estimation. In Int. Conf. Comput. Vis., 1698-1707.

[23]

Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L. M.; and Zhang, L. 2022a. Dn-detr: Accelerate detr training by introducing query denoising. In IEEE Conf. Comput. Vis. Pattern Recog., 13619-13627.

[24]

Li, Y.; Mao, H.; Girshick, R.; and He, K. 2022b. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, 280-296. Springer.

[25]

Lian, D.; Yu, Z.; and Gao, S. 2018. Believe it or not, we know what you are looking at! In ACCV, 35-50. Springer.

[26]

Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In IEEE Conf. Comput. Vis. Pattern Recog., 2980-2988.

[27]

Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; and Zhang, L. 2022a. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329.

[28]

Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. 2022b. Swin transformer v2: Scaling up capacity and resolution. In IEEE Conf. Comput. Vis. Pattern Recog., 12009-12019.

[29]

Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; and Wang, J. 2021. Conditional detr for fast training convergence. In Int. Conf. Comput. Vis., 3651-3660.

[30]

Park, S.; Spurr, A.; and Hilliges, O. 2018. Deep pictorial gaze estimation. In Eur. Conf. Comput. Vis., 721-738.

[31]

Recasens, A.; Khosla, A.; Vondrick, C.; and Torralba, A. 2015. Where are they looking? Adv. Neural Inform. Process. Syst., 28.

[32]

Recasens, A.; Vondrick, C.; Khosla, A.; and Torralba, A. 2017. Following gaze in video. In Int. Conf. Comput. Vis., 1435-1443.

[33]

Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; and Savarese, S. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In IEEE Conf. Comput. Vis. Pattern Recog., 658-666.

[34]

Tian, Z.; Chu, X.; Wang, X.; Wei, X.; and Shen, C. 2022. Fully Convolutional One-Stage 3D Object Detection on Li-DAR Range Images. arXiv preprint arXiv:2205.13764.

[35]

Tomas, H.; Reyes, M.; Dionido, R.; Ty, M.; Mirando, J.; Casimiro, J.; Atienza, R.; and Guinto, R. 2021. Goo: A dataset for gaze object prediction in retail environments. In IEEE Conf. Comput. Vis. Pattern Recog., 3125-3133.

[36]

Tonini, F.; Dall'Asen, N.; Beyan, C.; and Ricci, E. 2023. Object-aware Gaze Target Detection. arXiv preprint arXiv:2307.09662.

[37]

Tu, D.; Min, X.; Duan, H.; Guo, G.; Zhai, G.; and Shen, W. 2022. End-to-End Human-Gaze-Target Detection with Transformers. arxiv:2203.10433.

[38]

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Adv. Neural Inform. Process. Syst., 30.

[39]

Wang, B.; Hu, T.; Li, B.; Chen, X.; and Zhang, Z. 2022a. GaTector: A Unified Framework for Gaze Object Prediction. In IEEE Conf. Comput. Vis. Pattern Recog., 19588-19597.

[40]

Wang, C.-Y.; Bochkovskiy, A.; and Liao, H.-Y. M. 2023. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7464-7475.

[41]

Wang, Y.; Zhang, X.; Yang, T.; and Sun, J. 2022b. Anchor detr: Query design for transformer-based detector. In AAAI, 2567-2575.

[42]

Yang, H.; Deng, R.; Lu, Y.; Zhu, Z.; Chen, Y.; Roland, J. T.; Lu, L.; Landman, B. A.; Fogo, A. B.; and Huo, Y. 2020. Cir cleNet: Anchor-free glomerulus detection with circle representation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 35-44. Springer.

[43]

Yin, P.; Dai, J.; Wang, J.; Xie, D.; and Pu, S. 2022. NeRF-Gaze: A Head-Eye Redirection Parametric Model for Gaze Estimation. arXiv preprint arXiv:2212.14710.

[44]

Yu, L.; Zhou, X.; Wang, L.; and Zhang, J. 2022. Boundary-Aware Salient Object Detection in Optical Remote-Sensing Images. Electronics, 11(24): 4200.

[45]

Yu, Q.; Xia, Y.; Bai, Y.; Lu, Y.; Yuille, A. L.; and Shen, W. 2021. Glance-and-Gaze Vision Transformer. In Advances in Neural Information Processing Systems, volume 34, 12992-13003. Curran Associates, Inc.

[46]

Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.; and Shum, H.-Y. 2022. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In Int. Conf. Learn. Represent.

[47]

Zhang, X.; Sugano, Y.; Fritz, M.; and Bulling, A. 2015. Appearance-based gaze estimation in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., 4511-4520.

[48]

Zhang, X.; Sugano, Y.; Fritz, M.; and Bulling, A. 2017. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell., 41(1): 162-175.

Digital Library

[49]

Zhao, H.; Lu, M.; Yao, A.; Chen, Y.; and Zhang, L. 2020. Learning to draw sight lines. Int. J. Comput. Vis., 128(5): 1076-1100.

Digital Library

[50]

Zheng, D.; Dong, W.; Hu, H.; Chen, X.; and Wang, Y. 2023. Less is More: Focus Attention for Efficient DETR. arXiv preprint arXiv:2307.12612.

[51]

Zhou, X.; Zhuo, J.; and Krahenbuhl, P. 2019. Bottom-up object detection by grouping extreme and center points. In IEEE Conf. Comput. Vis. Pattern Recog., 850-859.

[52]

Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; and Dai, J. 2020. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.

[53]

Zhu, Z.; and Ji, Q. 2005. Eye gaze tracking under natural head movements. In IEEE Conf. Comput. Vis. Pattern Recog., volume 1, 918-923. IEEE.

Index Terms

TransGOP: transformer-based gaze object prediction
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
      2. Computer vision tasks
        Scene understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Human-centered computing
  1. Visualization

Index terms have been assigned to the content through auto-classification.

Recommendations

Pseudo-softness evaluation in grasping a virtual object with a bare hand
SIGGRAPH '16: ACM SIGGRAPH 2016 Posters

Bare hand interaction with a virtual object reduces uncomfortableness with devices mounted on a user's hand. There are some studies on the bare hand interaction[Benko et al. 2012], however a virtual object is supposed to be a hard object or a user ...
Eye-based head gestures
ETRA '12: Proceedings of the Symposium on Eye Tracking Research and Applications

A novel method for video-based head gesture recognition using eye information by an eye tracker has been proposed. The method uses a combination of gaze and eye movement to infer head gestures. Compared to other gesture-based methods a major advantage ...
Analysing EOG signal features for the discrimination of eye movements with wearable devices
PETMEI '11: Proceedings of the 1st international workshop on pervasive eye tracking & mobile eye-based interaction

Eye tracking research in human-computer interaction and experimental psychology traditionally focuses on stationary devices and a small number of common eye movements. The advent of pervasive eye tracking promises new applications, such as eye-based ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence

February 2024

23861 pages

ISBN:978-1-57735-887-9

Copyright © 2024 Association for the Advancement of Artificial Intelligence.

Sponsors

Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 20 February 2024

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten