SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification
TL;DR Highlight
A system where small assistant models predict token candidates organized into a tree structure, then a large LLM verifies them all in parallel — achieving up to 3.5x speedup.
Who Should Read
ML infrastructure engineers or MLOps professionals who need to reduce LLM serving latency. Especially teams running vLLM or TGI inference servers and experiencing throughput bottlenecks.
Core Mechanics
- Existing LLM inference generates tokens one-by-one sequentially (autoregressive decoding) which is slow — SpecInfer groups multiple candidate tokens predicted by small models (SSM) into a tree and has the large LLM verify them all in parallel
- Two methods to build the token tree: expanding top-k tokens from one SSM, or merging outputs from multiple SSMs — tree width 5 raises stochastic decoding verification success rate from 52-57% to 96-97%
- Tree Attention + Topology-aware Causal Mask processes all tokens in the tree in parallel with a single GPU kernel — drastically reduces kernel launch overhead vs running separate kernels per sequence
- Multi-step Speculative Sampling (MSS) mathematically guarantees the same output distribution as the original LLM in stochastic decoding while improving average verified tokens per step by 1.27-1.28x over naive sampling
- For distributed LLM inference (multi-GPU): 1.5-2.8x faster than vLLM/HuggingFace TGI/FasterTransformer; for offloading-based inference (vs FlexGen): 2.6-3.5x faster — same output quality
- Effect maximized at small batch sizes (BS=1-2); as batch size grows, GPU idle resources decrease and the effect diminishes — optimized for real-time low-latency serving
Evidence
- LLaMA-65B multi-node (8× A10 GPU) distributed inference: 2.4-2.8x per-token latency reduction vs FasterTransformer (Figure 7)
- OPT-30B offloading inference: 3.5x (BS=1) to 2.7x (BS=16) per-token latency reduction vs FlexGen (Figure 8)
- Stochastic decoding: tree width=1 (sequence-based) vs width=5 — verification success rate 52-57% → 96-97% (Table 1)
- MSS vs Naive Sampling: 1.27-1.28x more verified tokens per step, Alpaca baseline 1.87 → 2.38 tokens/step (Table 3)
How to Apply
- If serving LLaMA/OPT models with vLLM or TGI, attach a lightweight version from the same model family (e.g., LLaMA-68M as SSM for LLaMA-7B serving) as SSM and switch to SpecInfer for immediate latency improvement in batch size 1-4 range
- If using CPU offloading like FlexGen due to GPU memory constraints, SpecInfer's offloading mode is especially effective — reduces the number of CPU↔GPU data transfers, expect 3x+ speedup on OPT-13B/30B
- The open-source implementation (FlexFlow-based) supports direct import of HuggingFace models — clone from https://github.com/flexflow/FlexFlow and tune expansion config (e.g., <1,1,3,1,1,1,1,1>) to find optimal tree width for your model/dataset
Code Example
# SpecInfer basic usage example (FlexFlow-based)
# Installation: git clone --recursive https://github.com/goliaro/specinfer-ae.git
# Download models
# ./download_models.sh
# Run server GPU experiments
# ./server_gpu_experiments.sh
# Python API example (FlexFlow inference mode)
import flexflow.serve as ff
# Configure LLM (main model) + SSM (small speculative model)
llm = ff.LLM("huggyllama/llama-7b")
ssm = ff.SSM("JackFram/llama-68m")
# Token tree expansion config: <1,1,3,1,1,1,1,1>
# Branch at 3rd step with width=3
generation_config = ff.GenerationConfig(
do_sample=False, # greedy decoding
tree_expansion=[1,1,3,1,1,1,1,1] # expansion config
)
# Start serving
llm.compile(ssms=[ssm], generation_config=generation_config)
result = llm.generate("Machine learning is")
print(result.output_text)Terminology
Related Resources
Original Abstract (Expand)
This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality. Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.5-2.8× for distributed LLM inference and by 2.6-3.5× for offloading-based LLM inference, while preserving the same generative performance. SpecInfer is publicly available at https://github.com/flexflow/FlexFlow/