JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Tian, Yuandong; Wang, Yiping; Zhang, Zhenyu; Chen, Beidi; Du, Simon

Computer Science > Machine Learning

arXiv:2310.00535 (cs)

[Submitted on 1 Oct 2023 (v1), last revised 15 Mar 2024 (this version, v3)]

Title:JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Authors:Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, Simon Du

View PDF HTML (experimental)

Abstract:We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the self-attention layer in Transformers, producing a modified dynamics of MLP layers only. JoMA removes unrealistic assumptions in previous analysis (e.g., lack of residual connection) and predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations, while in the linear case, it is consistent with existing works that show attention becomes sparse over time. We leverage JoMA to qualitatively explains how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model. Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre-trained models (OPT, Pythia) verify our theoretical findings. Code can be found in this https URL.

Comments:	ICLR'24 camera ready. Improve theorem 3 and theorem 4. Polish writing and add code link
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2310.00535 [cs.LG]
	(or arXiv:2310.00535v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2310.00535

Submission history

From: Yuandong Tian [view email]
[v1] Sun, 1 Oct 2023 01:21:35 UTC (2,879 KB)
[v2] Tue, 3 Oct 2023 04:23:26 UTC (2,875 KB)
[v3] Fri, 15 Mar 2024 02:03:21 UTC (3,568 KB)

Computer Science > Machine Learning

Title:JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators