Accelerating Large Language Model Decoding with Speculative Sampling
TL;DR Highlight
A technique where a small draft model pre-generates tokens and a large model verifies them in parallel — speeding up LLM inference by up to 25x.
Who Should Read
ML engineers or infrastructure developers who need to reduce LLM serving costs and response latency. Especially teams running 70B+ large models in production.
Core Mechanics
- Core idea: a small draft model generates K tokens ahead, then the large target model scores them all in parallel in a single forward pass — confirming multiple tokens at once
- Modified Rejection Sampling accepts/rejects draft tokens while mathematically preserving the target model's distribution — zero output quality loss
- Chinchilla 70B + 4B draft model combo achieves up to 2.46x speedup on HumanEval (code gen) and up to 2x on XSum (summarization)
- Especially effective for code generation — small models accurately predict repetitive patterns like 'for i in range(len(arr)):' so acceptance rate is high
- Applicable without modifying target model architecture, and can be combined with other optimizations like quantization or multi-query attention
- Too large K (draft length) can backfire — K=3 is optimal for XSum, and larger K increases latency variance
Evidence
- Chinchilla 70B HumanEval: 14.1ms/token → 5.73ms/token, 2.46x speedup, ROUGE/pass@100 metrics unchanged
- XSum summarization: nucleus sampling 14.1ms/token → 7.52ms/token (1.92x), greedy → 7.00ms/token (2.01x)
- HumanEval and greedy XSum speeds exceed the theoretical upper bound for auto-regressive sampling set by hardware memory bandwidth
- Draft model (4B) at 1.8ms/token, target model (Chinchilla 70B) at 14.1ms/token — ~8x speed difference sufficient to offset drafting overhead
How to Apply
- When enabling speculative decoding in serving frameworks like vLLM or TGI — set the draft model as a smaller version of the target (typically 1/10-1/20 size) sharing the same tokenizer
- Apply first to tasks with repetitive patterns like code autocomplete or SQL generation for maximum effect — acceptance rate is much higher than natural language summarization, so set K=4-5
- K value needs per-domain tuning — for natural language tasks try K=3-4, for code tasks try K=4-6, and measure actual latency to decide
Code Example
# Example of using Speculative Decoding with HuggingFace Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Target model (large model)
target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-70b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")
# Draft model (small model, uses the same tokenizer)
draft_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
# Enable speculative decoding: pass the draft model to the assistant_model parameter
outputs = target_model.generate(
**inputs,
assistant_model=draft_model, # key parameter for speculative decoding
max_new_tokens=200,
do_sample=True,
temperature=0.8,
top_p=0.95,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# When using with vLLM
# vllm serve meta-llama/Llama-2-70b-hf \
# --speculative-model meta-llama/Llama-2-7b-hf \
# --num-speculative-tokens 4Terminology
Related Resources
Original Abstract (Expand)
We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.