D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Que, Haoran; Liu, Jiaheng; Zhang, Ge; Zhang, Chenchen; Qu, Xingwei; Ma, Yinghao; Duan, Feiyu; Bai, Zhiqi; Wang, Jiakai; Zhang, Yuanxing; Tan, Xu; Fu, Jie; Su, Wenbo; Wang, Jiamang; Qu, Lin; Zheng, Bo

Computer Science > Computation and Language

arXiv:2406.01375 (cs)

[Submitted on 3 Jun 2024]

Title:D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Authors:Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, Xu Tan, Jie Fu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng

View PDF HTML (experimental)

Abstract:Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2406.01375 [cs.CL]
	(or arXiv:2406.01375v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.01375

Submission history

From: Jiaheng Liu [view email]
[v1] Mon, 3 Jun 2024 14:40:31 UTC (1,211 KB)

Computer Science > Computation and Language

Title:D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators