research-article

An Attentive Survey of Attention Models

Authors:

Sneha Chaudhari,

Gungor Polatkan,

Rohan RamanathAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 12, Issue 5

Article No.: 53, Pages 1 - 32

https://doi.org/10.1145/3465055

Published: 22 October 2021 Publication History

Abstract

Attention Model has now become an important concept in neural networks that has been researched within diverse application domains. This survey provides a structured and comprehensive overview of the developments in modeling attention. In particular, we propose a taxonomy that groups existing techniques into coherent categories. We review salient neural architectures in which attention has been incorporated and discuss applications in which modeling attention has shown a significant impact. We also describe how attention has been used to improve the interpretability of neural networks. Finally, we discuss some future research directions in attention. We hope this survey will provide a succinct introduction to attention models and guide practitioners while developing approaches for their applications.

References

[1]

Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alexander A. Alemi. 2018. Watch your step: Learning node embeddings via graph attention. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 9180–9190.

Digital Library

[2]

Artaches Ambartsoumian and Fred Popowich. 2018. Self-attention: A better building block for sentiment analysis neural network classifiers. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics, 130–139.

[3]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6077–6086.

[4]

Lei Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. 2014. Multiple object recognition with visual attention. In Proceedings of the International Conference on Learning Representations.

[5]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.

[6]

Daniel Beck, Gholamreza Haffari, and Trevor Cohn. 2018. Graph-to-sequence learning using gated graph neural networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 273–283.

[7]

Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive exploration of neural machine translation architectures. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1442–1451.

[8]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, and Ilya2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, Vol. 33. 1877–1901.

[9]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Lecture Notes in Computer Science, Vol. 12346). Springer, 213–229.

[10]

William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE, 4960–4964.

Digital Library

[11]

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 119. PMLR, 1691–1703.

[12]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv:1904.10509. Retrieved from https://arxiv.org/abs/1904.10509.

[13]

Chung-Cheng Chiu and Colin Raffel. 2017. Monotonic chunkwise attention. arXiv:1712.05382. Retrieved from https://arxiv.org/abs/1712.05382.

[14]

Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans. Multimedia 17, 11 (2015), 1875–1886.

Digital Library

[15]

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014a. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8). Association for Computational Linguistics, 103–111.

[16]

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014b. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1724–1734.

[17]

Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 93–98.

[18]

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller. 2021. Rethinking attention with performers. In Proceedings of the International Conference on Learning Representations.

[19]

Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems. MIT Press, 577–585.

Digital Library

[20]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? An analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 276–286.

[21]

Limeng Cui, Haeseung Seo, Maryam Tabar, Fenglong Ma, Suhang Wang, and Dongwon Lee. 2020. DETERRENT: Knowledge guided graph attention network for detecting healthcare misinformation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, 492–502.

Digital Library

[22]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2978–2988.

[23]

Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 120–128.

Digital Library

[24]

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In Proceedings of the 7th International Conference on Learning Representations.

[25]

Zhongfen Deng, Hao Peng, Congying Xia, Jianxin Li, Lifang He, and Philip Yu. 2020. Hierarchical bi-directional self-attention networks for paper review rating recommendation. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 6302–6314.

[26]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186.

[27]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations.

[28]

Jun Feng, Minlie Huang, Yang Yang, and Xiaoyan Zhu. 2016. GAKE: Graph aware knowledge embedding. In Proceedings of the 26th International Conference on Computational Linguistics. The COLING 2016 Organizing Committee, 641–651.

[29]

Keisuke Fujii, Naoya Takeishi, Yoshinobu Kawahara, and Kazuya Takeda. 2020. Policy learning with partial observation and mechanical constraints for multi-person modeling. arXiv:2007.03155. Retrieved from https://arxiv.org/abs/2007.03155.

[30]

Andrea Galassi, Marco Lippi, and Paolo Torroni. 2020. Attention in natural language processing. IEEE Trans. Neural Netw. Learn. Syst. (2020), 1–18.

[31]

Alex Graves, Greg Wayne, and Ivo Danihelka. 2014a. Neural turing machines. arXiv:1410.5401. Retrieved from https://arxiv.org/abs/1410.5401.

[32]

Alex Graves, Greg Wayne, and Ivo Danihelka. 2014b. Neural turing machines. arXiv:1410.5401. Retrieved from https://arxiv.org/abs/1410.5401.

[33]

Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. 2015. DRAW: A recurrent neural network for image generation.Proceedings of Machine Learning Research, Vol. 37. PMLR, 1462–1471.

Digital Library

[34]

Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. A survey of methods for explaining black box models. Comput. Surv. 51, 5, Article 93 (2018), 42 pages.

Digital Library

[35]

Xiangnan He, Zhankui He, Jingkuan Song, Zhenguang Liu, Yu-Gang Jiang, and Tat-Seng Chua. 2018. NAIS: Neural attentive item similarity model for recommendation. IEEE Trans. Knowl. Data Eng. 30, 12 (2018), 2354–2366. https://doi.org/10.1109/TKDE.2018.2831682

Digital Library

[36]

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 1693–1701.

Digital Library

[37]

Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics, 3543–3556.

[38]

Saumya Jetley, Nicholas A. Lord, Namhoon Lee, and Philip Torr. 2018. Learn to pay attention. In Proceedings of the International Conference on Learning Representations.

[39]

Wang-Cheng Kang and Julian J. McAuley. 2018. Self-attentive sequential recommendation. In Proceedings of the IEEE International Conference on Data Mining. IEEE Computer Society, 197–206.

[40]

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in vision: A survey. arXiv:2101.01169. Retrieved from https://arxiv.org/abs/2101.01169.

Digital Library

[41]

Douwe Kiela, Changhan Wang, and Kyunghyun Cho. 2018. Dynamic meta-embeddings for improved sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1466–1477.

[42]

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In Proceedings of the 8th International Conference on Learning Representations.

[43]

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2020. Attention is not only a weight: Analyzing transformers with vector norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 7057–7075.

[44]

Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In Proceedings of the 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 48. PMLR, 1378–1387.

Digital Library

[45]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations.

[46]

Jaesong Lee, Joong-Hwi Shin, and Jun-Seok Kim. 2017. Interactive visualization and manipulation of attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 121–126.

[47]

John Boaz Lee, Ryan Rossi, and Xiangnan Kong. 2018. Graph classification using structural attention. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 1666–1674.

Digital Library

[48]

John Boaz Lee, Ryan A. Rossi, Sungchul Kim, Nesreen K. Ahmed, and Eunyee Koh. 2019. Attention models in graphs: A survey. ACM Trans. Knowl. Discov. Data 13, 6, Article 62 (2019), 25 pages. https://doi.org/10.1145/3363574

Digital Library

[49]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 7871–7880.

[50]

Guangyu Li, Bo Jiang, Hao Zhu, Zhengping Che, and Yan Liu. 2020a. Generative attention networks for multi-agent behavioral modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7195–7202.

[51]

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding neural networks through representation erasure. CoRR abs/1612.08220.

[52]

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019b. VisualBERT: A simple and performant baseline for vision and language. arXiv:1908.03557. Retrieved from https://arxiv.org/abs/1908.03557.

[53]

Xiaoya Li, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu, and Jiwei Li. 2020b. Sac: Accelerating and structuring self-attention via sparse adaptive connection. arXiv:2003.09833. Retrieved from https://arxiv.org/abs/2003.09833.

[54]

Yang Li, Lukasz Kaiser, Samy Bengio, and Si Si. 2019a. Area attention. In Proceedings of the International Conference on Machine Learning. PMLR, 3846–3855.

[55]

Zhouhan Lin, Minwei Feng, Cícero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In Proceedings of the International Conference on Learning Representations (2017).

[56]

Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang, and Sungjin Ahn. 2020. SPACE: Unsupervised object-oriented scene representation via spatial attention and decomposition. In Proceedings of the 8th International Conference on Learning Representations.

[57]

Shusen Liu, Tao Li, Zhimin Li, Vivek Srikumar, Valerio Pascucci, and Peer-Timo Bremer. 2018. Visual interrogation of attention-based models for natural language inference and machine comprehension. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 36–41.

[58]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692.

[59]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, Vol. 32.

Digital Library

[60]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 289–297.

Digital Library

[61]

Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches to attention-based neural machine translation. arXiv:1508.04025. Retrieved from https://arxiv.org/abs/1508.04025.

[62]

Thang Luong, Hieu Pham, and Christopher D. Manning. 2015b. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1412–1421.

[63]

Benteng Ma, Jing Zhang, Yong Xia, and Dacheng Tao. 2020. Auto learning attention. Adv. Neural Inf. Process. Syst. 33 (2020).

[64]

Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2017b. Interactive attention networks for aspect-level sentiment classification. arXiv:1709.00893. Retrieved from https://arxiv.org/abs/1709.00893.

Digital Library

[65]

Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. 2017a. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 1903–1911.

Digital Library

[66]

Xutai Ma, Juan Pino, James Cross, Liezl Puzon, and Jiatao Gu. 2019. Monotonic multihead attention. arXiv:1909.12406. Retrieved from https://arxiv.org/abs/1909.12406.

[67]

Yukun Ma, Haiyun Peng, and Erik Cambria. 2018. Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive LSTM. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 5876–5883.

[68]

Suraj Maharjan, Manuel Montes, Fabio A. González, and Thamar Solorio. 2018. A genre-aware attention model to improve the likability prediction of books. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3381–3391.

[69]

Andre Martins and Ramon Astudillo. 2016. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the International Conference on Machine Learning. PMLR, 1614–1623.

Digital Library

[70]

André F. T. Martins, Marcos Treviso, António Farinhas, Vlad Niculae, Mário A. T. Figueiredo, and Pedro M. Q. Aguiar. 2020. Sparse and continuous attention mechanisms. arXiv:2006.07214. Retrieved from https://arxiv.org/abs/2006.07214.

[71]

Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, Vol. 32.

Digital Library

[72]

Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5191–5198.

[73]

Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Volume 2. MIT Press, 2204–2212.

Digital Library

[74]

Akash Kumar Mohankumar, Preksha Nema, Sharan Narasimhan, Mitesh M. Khapra, Balaji Vasan Srinivasan, and Balaraman Ravindran. 2020. Towards transparent and explainable attention models. arXiv:2004.14243. Retrieved from https://arxiv.org/abs/2004.14243.

[75]

Elizbar A. Nadaraya. 1964. On estimating regression. Theory Probab. Appl. 9, 1 (1964), 141–142.

[76]

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çağlar Gulçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, 280–290.

[77]

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer.Proceedings of Machine Learning Research, Vol. 80. 4055–4064.

[78]

John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. 2017. Deeper attention to abusive user content moderation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1125–1135.

[79]

Ben Peters, Vlad Niculae, and André FT Martins. 2019. Sparse sequence-to-sequence models. arXiv:1905.05702. Retrieved from https://arxiv.org/abs/1905.05702.

[80]

Zhu Qiannan, Xiaofei Zhou, Zeliang Song, Jianlong Tan, and Li Guo. 2019. DAN: Deep attention neural network for news recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, 5973–5980.

[81]

C. Qin and D. Qu. 2020. Towards understanding attention-based speech recognition models. IEEE Access 8 (2020), 24358–24369.

[82]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2018. Language models are unsupervised multitask learners. https://openai.com/blog/better-language-models/.

[83]

Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. 2019. Stand-alone self-attention in vision models. arXiv:1906.05909. Retrieved from https://arxiv.org/abs/1906.05909.

[84]

Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 379–389.

[85]

Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2931–2951.

[86]

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. Disan: Directional self-attention network for rnn/cnn-free language understanding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.

[87]

Min Yang, Baocheng Li, Qiang Qu, Jialie Shen, Shuai Yu, and Yongbo Wang. 2019. NAIRS: A neural attentive interpretable recommendation system. The Web Conference (2019).

Digital Library

[88]

Alessandro Sordoni, Philip Bachman, Adam Trischler, and Yoshua Bengio. 2016. Iterative alternating neural attention for machine reading. arXiv:1606.02245. Retrieved from https://arxiv.org/abs/1606.02245.

[89]

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019a. Adaptive attention span in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 331–335.

[90]

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019b. Adaptive attention span in transformers. arXiv:1905.07799. Retrieved from https://arxiv.org/abs/1905.07799.

[91]

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-to-end memory networks. In Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 2440–2448.

Digital Library

[92]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

[93]

Qiang Sun and Yanwei Fu. 2019. Stacked self-attention networks for visual question answering. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. Association for Computing Machinery, 207–211.

Digital Library

[94]

Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 5099–5110.

[95]

Duyu Tang, Bing Qin, and Ting Liu. 2016. Aspect level sentiment classification with deep memory network. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 214–224.

[96]

Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. 2018. Why self-attention? A targeted evaluation of neural machine translation architectures. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4263–4272.

[97]

Yi Tay, Anh Tuan Luu, Aston Zhang, Shuohang Wang, and Siu Cheung Hui. 2019. Compositional de-attention networks. Adv. Neural Inf. Process. Syst. 32 (2019), 6135–6145.

Digital Library

[98]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2020. Training data-efficient image transformers & distillation through attention. arXiv:2012.12877. Retrieved from https://arxiv.org/abs/2012.12877.

[99]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. arXiv:2012.12877. Retrieved from https://arxiv.org/abs/2012.12877.

[100]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc., 5998–6008.

Digital Library

[101]

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations.

[102]

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems, Vol. 28. MIT Press, 2692–2700.

Digital Library

[103]

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5797–5808.

[104]

Feng Wang and David MJ Tax. 2016. Survey on the attention based RNN model and its applications in computer vision. arXiv:1601.06823. Retrieved from https://arxiv.org/abs/1601.06823.

[105]

Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. 2020e. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In Proceedings of the European Conference on Computer Vision. Springer, 108–126.

Digital Library

[106]

Kai Wang, Weizhou Shen, Yunyi Yang, Xiaojun Quan, and Rui Wang. 2020c. Relational graph attention network for aspect-based sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3229–3238.

[107]

Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and H. Linformer Ma. 2020a. Self-attention with linear complexity. arXiv:2006.04768. Retrieved from https://arxiv.org/abs/2006.04768.

[108]

Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier, and Xiaokui Xiao. 2017. Coupled multi-layer attentions for co-extraction of aspect and opinion terms. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 3316–3322.

Digital Library

[109]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020d. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. arXiv:2002.10957. Retrieved from https://arxiv.org/abs/2002.10957.

[110]

Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019a. KGAT: Knowledge graph attention network for recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, 950–958.

Digital Library

[111]

Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S. Yu. 2019b. Heterogeneous graph attention network. In Proceedings of the World Wide Web Conference. 2022–2032.

Digital Library

[112]

Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Li Zhao. 2016. Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 606–615.

[113]

Yue Wang, Jing Li, Michael Lyu, and Irwin King. 2020b. Cross-media keyphrase prediction: A unified framework with multi-modality multi-head attention and image wordings. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3311–3324.

[114]

Geoffrey S. Watson. 1964. Smooth regression analysis. Ind. J. Stat. Ser. A (1964), 359–372.

[115]

Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv:1410.3916. Retrieved from https://arxiv.org/abs/1410.3916.

[116]

Chuhan Wu, Fangzhao Wu, Junxin Liu, and Yongfeng Huang. 2019. Hierarchical user and item representation with three-tier attention for recommendation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1818–1826.

[117]

Le Wu, Lei Chen, Richang Hong, Yanjie Fu, Xing Xie, and Meng Wang. 2020. A hierarchical attention model for social contextual image recommendation. IEEE Trans. Knowl. Data Eng. 32, 10 (2020), 1854–1867.

[118]

Wenyi Xiao, Huan Zhao, Haojie Pan, Yangqiu Song, Vincent W. Zheng, and Qiang Yang. 2021. Social explorative attention based recommendation for content distribution platforms. Data Min. Knowl. Discov. 35, 2 (2021), 533–567.

[119]

Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, Volume 48. JMLR.org, 2397–2406.

Digital Library

[120]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Proceedings of Machine Learning Research, Vol. 37. 2048–2057.

[121]

Song Xu, Haoran Li, Peng Yuan, Youzheng Wu, Xiaodong He, and Bowen Zhou. 2020. Self-attention guided copy mechanism for abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1355–1362.

[122]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, Vol. 32.

Digital Library

[123]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1480–1489.

[124]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4507–4515.

Digital Library

[125]

Haochao Ying, Fuzhen Zhuang, Fuzheng Zhang, Yanchi Liu, Guandong Xu, Xing Xie, Hui Xiong, and Jian Wu. 2018a. Sequential recommender system based on hierarchical attention network. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. AAAI Press, 3926–3932.

Digital Library

[126]

Haochao Ying, Fuzhen Zhuang, Fuzheng Zhang, Yanchi Liu, Guandong Xu, Xing Xie, Hui Xiong, and Jian Wu. 2018b. Sequential recommender system based on hierarchical attention networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3926–3932.

Digital Library

[127]

Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13, 3 (2018), 55–75.

[128]

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6281–6290.

[129]

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention recurrent network for human communication comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence. 5642–5649.

[130]

B. Zhang, D. Xiong, J. Xie, and J. Su. 2020. Neural machine translation with GRU-gated attention model. IEEE Trans. Neural Netw. Learn. Syst. 31, 11 (2020), 4688–4698.

[131]

Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019a. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning. 7354–7363.

[132]

Ruochi Zhang, Yuesong Zou, and Jian Ma. 2020. Hyper-SAGNN: A self-attention based graph neural network for hypergraphs. In Proceedings of the International Conference on Learning Representations.

[133]

Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019b. Deep learning based recommender system: A survey and new perspectives. Comput. Surv. 52, 1, Article 5 (2019).

Digital Library

[134]

Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. 2020. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[135]

Shenjian Zhao and Zhihua Zhang. 2018. Attention-via-attention neural machine translation. In Association for the Advancement of Artificial Intelligence.

[136]

Zhou Zhao, Ben Gao, Vicent W. Zheng, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Link prediction via ranking metric dual-level attention network learning. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 3525–3531.

Digital Library

[137]

Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. 2020. GMAN: A graph multi-attention network for traffic prediction. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. AAAI Press, 1234–1241.

[138]

Chang Zhou, Jinze Bai, Junshuai Song, Xiaofei Liu, Zhengchao Zhao, Xiusi Chen, and Jun Gao. 2018. ATRank: An attention-based user behavior modeling framework for recommendation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 4564–4571.

[139]

Feng Zhu, Yan Wang, Chaochao Chen, Guanfeng Liu, and Xiaolin Zheng. 2020. A graphical and attentional framework for dual-target cross-domain recommendation. In Proceedings of the 29th International Joint Conference on Artificial Intelligence. 3001–3008.

[140]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations.

Cited By

Rohlfs C(2025)Generalization in neural networks: A broad surveyNeurocomputing10.1016/j.neucom.2024.128701611(128701)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128701
Gao YAi DWang YCao KSong HFan JXiao DZhang TWang YYang J(2025)Spatio-Temporal correspondence attention network for vessel segmentation in X-ray coronary angiographyBiomedical Signal Processing and Control10.1016/j.bspc.2024.10679299(106792)Online publication date: Jan-2025
https://doi.org/10.1016/j.bspc.2024.106792
Wang WJia M(2024)A facial expression recognition network based on attention double branch enhanced fusionPeerJ Computer Science10.7717/peerj-cs.226610(e2266)Online publication date: 28-Aug-2024
https://doi.org/10.7717/peerj-cs.2266
Show More Cited By

Index Terms

An Attentive Survey of Attention Models
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Attention, please! A survey of neural attention models in deep learning
Abstract
In humans, Attention is a core property of all perceptual and cognitive operations. Given our limited ability to process competing sources, attention mechanisms select, modulate, and focus on the information most relevant to behavior. For decades, ...
Structurally Guided Channel Attention Networks: SGCA-Net
PETRA '21: Proceedings of the 14th PErvasive Technologies Related to Assistive Environments Conference

In this paper, we propose Structurally Guided Channel Attention Networks (SGCA-Net), a principled way to guide the channel attention of CNNs. Convolution operator constructs features maps by using both channel and spatial information within the ...
Attention to Attention

Organizational theory and research has increased attention to the determinants and consequences of attention in organizations. Attention is not, however, a unitary concept but is used differently in various metatheories: the behavioral theory of the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 12, Issue 5

October 2021

383 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/3484925

Editor:
Huan Liu
Arizona State University, USA

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2021

Accepted: 01 May 2021

Revised: 01 April 2021

Received: 01 November 2020

Published in TIST Volume 12, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

313
Total Citations
View Citations
4,366
Total Downloads

Downloads (Last 12 months)1,025
Downloads (Last 6 weeks)88

Reflects downloads up to 07 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rohlfs C(2025)Generalization in neural networks: A broad surveyNeurocomputing10.1016/j.neucom.2024.128701611(128701)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128701
Gao YAi DWang YCao KSong HFan JXiao DZhang TWang YYang J(2025)Spatio-Temporal correspondence attention network for vessel segmentation in X-ray coronary angiographyBiomedical Signal Processing and Control10.1016/j.bspc.2024.10679299(106792)Online publication date: Jan-2025
https://doi.org/10.1016/j.bspc.2024.106792
Wang WJia M(2024)A facial expression recognition network based on attention double branch enhanced fusionPeerJ Computer Science10.7717/peerj-cs.226610(e2266)Online publication date: 28-Aug-2024
https://doi.org/10.7717/peerj-cs.2266
Zhang GHu J(2024) Enhanced industrial text classification via hyper variational graph-guided global context integration PeerJ Computer Science10.7717/peerj-cs.178810(e1788)Online publication date: 5-Jan-2024
https://doi.org/10.7717/peerj-cs.1788
Mirza MTalha Siddiqui M(2024)A Comparative Analysis of Attention Mechanism in RNN-LSTMs for Improved Image Captioning PerformanceInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24OCT678(1341-1348)Online publication date: 3-Oct-2024
https://doi.org/10.38124/ijisrt/IJISRT24OCT678
Khabti JAlAhmadi SSoudani A(2024)Optimal Channel Selection of Multiclass Motor Imagery Classification Based on Fusion Convolutional Neural Network with Attention BlocksSensors10.3390/s2410316824:10(3168)Online publication date: 16-May-2024
https://doi.org/10.3390/s24103168
Li RYan AYang SHe DZeng XLiu H(2024)Human Pose Estimation Based on Efficient and Lightweight High-Resolution Network (EL-HRNet)Sensors10.3390/s2402039624:2(396)Online publication date: 9-Jan-2024
https://doi.org/10.3390/s24020396
Zhan WLuo FLuo HLi JWu YYin ZWu YWu P(2024)Time-Series-Based Spatiotemporal Fusion Network for Improving Crop Type MappingRemote Sensing10.3390/rs1602023516:2(235)Online publication date: 7-Jan-2024
https://doi.org/10.3390/rs16020235
Sultana TMandal ASaha HSultan MHossain M(2024)Intent Identification by Semantically Analyzing the Search QueryModelling10.3390/modelling50100165:1(292-314)Online publication date: 22-Feb-2024
https://doi.org/10.3390/modelling5010016
Mu SLiu BGu JLien CNadia N(2024)Research on Stock Index Prediction Based on the Spatiotemporal Attention BiLSTM ModelMathematics10.3390/math1218281212:18(2812)Online publication date: 11-Sep-2024
https://doi.org/10.3390/math12182812
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents