SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

May 16, 2023•Xupeng Miao, Gabriele Oliaro, Zhihao Zhang +12•View PDF

TL;DR Highlight

A system where small assistant models predict token candidates organized into a tree structure, then a large LLM verifies them all in parallel — achieving up to 3.5x speedup.

Who Should Read

ML infrastructure engineers or MLOps professionals who need to reduce LLM serving latency. Especially teams running vLLM or TGI inference servers and experiencing throughput bottlenecks.

Core Mechanics

Existing LLM inference generates tokens one-by-one sequentially (autoregressive decoding) which is slow — SpecInfer groups multiple candidate tokens predicted by small models (SSM) into a tree and has the large LLM verify them all in parallel
Two methods to build the token tree: expanding top-k tokens from one SSM, or merging outputs from multiple SSMs — tree width 5 raises stochastic decoding verification success rate from 52-57% to 96-97%
Tree Attention + Topology-aware Causal Mask processes all tokens in the tree in parallel with a single GPU kernel — drastically reduces kernel launch overhead vs running separate kernels per sequence
Multi-step Speculative Sampling (MSS) mathematically guarantees the same output distribution as the original LLM in stochastic decoding while improving average verified tokens per step by 1.27-1.28x over naive sampling
For distributed LLM inference (multi-GPU): 1.5-2.8x faster than vLLM/HuggingFace TGI/FasterTransformer; for offloading-based inference (vs FlexGen): 2.6-3.5x faster — same output quality
Effect maximized at small batch sizes (BS=1-2); as batch size grows, GPU idle resources decrease and the effect diminishes — optimized for real-time low-latency serving

Evidence

LLaMA-65B multi-node (8× A10 GPU) distributed inference: 2.4-2.8x per-token latency reduction vs FasterTransformer (Figure 7)
OPT-30B offloading inference: 3.5x (BS=1) to 2.7x (BS=16) per-token latency reduction vs FlexGen (Figure 8)
Stochastic decoding: tree width=1 (sequence-based) vs width=5 — verification success rate 52-57% → 96-97% (Table 1)
MSS vs Naive Sampling: 1.27-1.28x more verified tokens per step, Alpaca baseline 1.87 → 2.38 tokens/step (Table 3)

How to Apply

If serving LLaMA/OPT models with vLLM or TGI, attach a lightweight version from the same model family (e.g., LLaMA-68M as SSM for LLaMA-7B serving) as SSM and switch to SpecInfer for immediate latency improvement in batch size 1-4 range
If using CPU offloading like FlexGen due to GPU memory constraints, SpecInfer's offloading mode is especially effective — reduces the number of CPU↔GPU data transfers, expect 3x+ speedup on OPT-13B/30B
The open-source implementation (FlexFlow-based) supports direct import of HuggingFace models — clone from https://github.com/flexflow/FlexFlow and tune expansion config (e.g., <1,1,3,1,1,1,1,1>) to find optimal tree width for your model/dataset

Code Example

snippet

# SpecInfer basic usage example (FlexFlow-based)
# Installation: git clone --recursive https://github.com/goliaro/specinfer-ae.git

# Download models
# ./download_models.sh

# Run server GPU experiments
# ./server_gpu_experiments.sh

# Python API example (FlexFlow inference mode)
import flexflow.serve as ff

# Configure LLM (main model) + SSM (small speculative model)
llm = ff.LLM("huggyllama/llama-7b")
ssm = ff.SSM("JackFram/llama-68m")

# Token tree expansion config: <1,1,3,1,1,1,1,1>
# Branch at 3rd step with width=3
generation_config = ff.GenerationConfig(
    do_sample=False,          # greedy decoding
    tree_expansion=[1,1,3,1,1,1,1,1]  # expansion config
)

# Start serving
llm.compile(ssms=[ssm], generation_config=generation_config)
result = llm.generate("Machine learning is")
print(result.output_text)

Terminology

Speculative DecodingBefore the large model directly generates an answer, a small model 'guesses' what tokens will come, and the large model verifies all at once. Like an assistant writing a draft for the professor to review.

SSM (Small Speculative Model)A lightweight model 100-1000x smaller than the LLM. Proposes candidate tokens quickly though less accurately — the 'draft writer' role.

Autoregressive DecodingThe standard LLM generation method that produces tokens one at a time sequentially. Hard to parallelize and slow since each token requires the previous one.

Token TreeA tree structure where multiple candidate token sequences from SSMs are organized together for the LLM to verify all in parallel in a single forward pass.

Tree AttentionAn attention mechanism extended to process tree-structured token sequences. Enables the LLM to verify multiple candidate paths simultaneously.

Related Resources

Original Abstract (Expand)

This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality. Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.5-2.8× for distributed LLM inference and by 2.6-3.5× for offloading-based LLM inference, while preserving the same generative performance. SpecInfer is publicly available at https://github.com/flexflow/FlexFlow/