Accelerating Production LLMs with Combined Token/Embedding Speculators

Wertheimer, Davis; Rosenkranz, Joshua; Parnell, Thomas; Suneja, Sahil; Ranganathan, Pavithra; Ganti, Raghu; Srivatsa, Mudhakar

Computer Science > Computation and Language

arXiv:2404.19124 (cs)

[Submitted on 29 Apr 2024 (v1), last revised 6 Jun 2024 (this version, v2)]

Title:Accelerating Production LLMs with Combined Token/Embedding Speculators

Authors:Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja, Pavithra Ranganathan, Raghu Ganti, Mudhakar Srivatsa

View PDF HTML (experimental)

Abstract:This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.

Comments:	Original upload 4/29/24, updated 6/6/24 with additional references to concurrent work
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2404.19124 [cs.CL]
	(or arXiv:2404.19124v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.19124

Submission history

From: Davis Wertheimer [view email]
[v1] Mon, 29 Apr 2024 21:59:07 UTC (2,397 KB)
[v2] Thu, 6 Jun 2024 18:38:34 UTC (2,398 KB)

Computer Science > Computation and Language

Title:Accelerating Production LLMs with Combined Token/Embedding Speculators

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Accelerating Production LLMs with Combined Token/Embedding Speculators

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators