Accelerating Large Language Model Decoding with Speculative Sampling

Feb 2, 2023•Charlie Chen, Sebastian Borgeaud, G. Irving +3•View PDF

TL;DR Highlight

A technique where a small draft model pre-generates tokens and a large model verifies them in parallel — speeding up LLM inference by up to 25x.

Who Should Read

ML engineers or infrastructure developers who need to reduce LLM serving costs and response latency. Especially teams running 70B+ large models in production.

Core Mechanics

Core idea: a small draft model generates K tokens ahead, then the large target model scores them all in parallel in a single forward pass — confirming multiple tokens at once
Modified Rejection Sampling accepts/rejects draft tokens while mathematically preserving the target model's distribution — zero output quality loss
Chinchilla 70B + 4B draft model combo achieves up to 2.46x speedup on HumanEval (code gen) and up to 2x on XSum (summarization)
Especially effective for code generation — small models accurately predict repetitive patterns like 'for i in range(len(arr)):' so acceptance rate is high
Applicable without modifying target model architecture, and can be combined with other optimizations like quantization or multi-query attention
Too large K (draft length) can backfire — K=3 is optimal for XSum, and larger K increases latency variance

Evidence

Chinchilla 70B HumanEval: 14.1ms/token → 5.73ms/token, 2.46x speedup, ROUGE/pass@100 metrics unchanged
XSum summarization: nucleus sampling 14.1ms/token → 7.52ms/token (1.92x), greedy → 7.00ms/token (2.01x)
HumanEval and greedy XSum speeds exceed the theoretical upper bound for auto-regressive sampling set by hardware memory bandwidth
Draft model (4B) at 1.8ms/token, target model (Chinchilla 70B) at 14.1ms/token — ~8x speed difference sufficient to offset drafting overhead

How to Apply

When enabling speculative decoding in serving frameworks like vLLM or TGI — set the draft model as a smaller version of the target (typically 1/10-1/20 size) sharing the same tokenizer
Apply first to tasks with repetitive patterns like code autocomplete or SQL generation for maximum effect — acceptance rate is much higher than natural language summarization, so set K=4-5
K value needs per-domain tuning — for natural language tasks try K=3-4, for code tasks try K=4-6, and measure actual latency to decide

Code Example

snippet

# Example of using Speculative Decoding with HuggingFace Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Target model (large model)
target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-70b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")

# Draft model (small model, uses the same tokenizer)
draft_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

inputs = tokenizer("def fibonacci(n):", return_tensors="pt")

# Enable speculative decoding: pass the draft model to the assistant_model parameter
outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,  # key parameter for speculative decoding
    max_new_tokens=200,
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# When using with vLLM
# vllm serve meta-llama/Llama-2-70b-hf \
#   --speculative-model meta-llama/Llama-2-7b-hf \
#   --num-speculative-tokens 4

Terminology

Speculative SamplingA method where a small model 'guesses' answers ahead, then the large model reviews them all at once and accepts correct ones. Like an intern writing a draft for a senior to review at once.

Draft ModelA fast but less accurate small language model. Plays the role of 'draft writer' to reduce the burden on the large target model.

Target ModelThe large language model that produces the desired quality output. Validates and finally approves the draft model's candidates.

Rejection SamplingA probabilistic sampling method. Accepts proposed samples based on acceptance probability, rejecting ones that are too unlikely.

KV CacheMemory storing key/value computations from previous tokens in transformers. Enables faster generation by reusing cached values instead of recomputing.

Related Resources

Original Abstract (Expand)

We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.