Upcycling Large Language Models into Mixture of Experts

He, Ethan; Khattar, Abhinav; Prenger, Ryan; Korthikanti, Vijay; Yan, Zijie; Liu, Tong; Fan, Shiqing; Aithal, Ashwath; Shoeybi, Mohammad; Catanzaro, Bryan

Computer Science > Computation and Language

arXiv:2410.07524 (cs)

[Submitted on 10 Oct 2024]

Title:Upcycling Large Language Models into Mixture of Experts

Authors:Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, Bryan Catanzaro

View PDF HTML (experimental)

Abstract:Upcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models. However, optimal techniques for upcycling at scale remain unclear. In this work, we conduct an extensive study of upcycling methods and hyperparameters for billion-parameter scale language models. We propose a novel "virtual group" initialization scheme and weight scaling approach to enable upcycling into fine-grained MoE architectures. Through ablations, we find that upcycling outperforms continued dense model training. In addition, we show that softmax-then-topK expert routing improves over topK-then-softmax approach and higher granularity MoEs can help improve accuracy. Finally, we upcycled Nemotron-4 15B on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens: the continuous trained model achieved 65.3% MMLU, whereas the upcycled model achieved 67.6%. Our results offer insights and best practices to effectively leverage upcycling for building MoE language models.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2410.07524 [cs.CL]
	(or arXiv:2410.07524v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.07524

Submission history

From: Ethan He [view email]
[v1] Thu, 10 Oct 2024 01:36:03 UTC (3,403 KB)

Computer Science > Computation and Language

Title:Upcycling Large Language Models into Mixture of Experts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Upcycling Large Language Models into Mixture of Experts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators