EAGLE-3: Training-Time Test를 통한 LLM 추론 가속화 확장

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Mar 3, 2025•Yuhui Li, Fangyun Wei, Chao Zhang +1•View PDF

TL;DR Highlight

드래프트 모델 아키텍처를 개선해 LLM 추론 속도를 최대 6.5배까지 높이는 Speculative Decoding 기법

Who Should Read

LLM 서빙 비용과 레이턴시를 줄이고 싶은 ML 엔지니어 또는 백엔드 개발자. DeepSeek-R1 같은 추론 모델을 프로덕션에 배포하면서 속도 최적화를 고민하는 분께 특히 유용.

Core Mechanics

EAGLE-2의 '피처 예측 제약'이 학습 데이터를 늘려도 성능이 안 오르는 병목이었음을 발견 — EAGLE-3는 이 제약을 제거하고 토큰을 직접 예측
Training-time test 기법: 학습 중에 드래프트 모델 자신의 출력을 다음 스텝 입력으로 넣어 테스트 환경을 시뮬레이션 → 추론 시 분포 이탈 문제 해결
타겟 모델의 최상위 레이어 피처만 쓰던 기존 방식 대신, 저/중/고 레이어 피처를 융합(concat + FC layer)해 더 풍부한 정보 활용
데이터 스케일링 법칙 발견: 학습 데이터를 8배 늘릴수록 속도가 비례해서 오름 — 이런 스케일링 커브는 EAGLE 계열에서 처음 관측됨
LLaMA-Instruct 3.1 8B 기준 EAGLE-2 대비 약 1.4배 추가 가속, 최대 6.5x 속도 달성 (HumanEval 코드 생성 태스크)
SGLang 프레임워크에서 배치 사이즈 64에서도 38% 처리량 향상 — 기존 EAGLE은 배치 24부터 오히려 성능 저하

Evidence

EAGLE-3는 Vicuna 13B 기준 MT-bench에서 5.58x, HumanEval에서 6.47x 속도 달성 (EAGLE-2 대비 각각 1.31x, 1.30x 향상)
SGLang + H100 환경에서 배치 사이즈 1 기준 throughput: SGLang 기본 158토큰/s → EAGLE-2 244토큰/s → EAGLE-3 373토큰/s
LLaMA-Instruct 3.1 8B에서 피처 제약 제거만으로 speedup 3.16x→3.82x, 피처 융합까지 추가하면 4.40x (ablation study 결과)
데이터 8배 확장 시 EAGLE-2 acceptance length 4.0 → EAGLE-3 6.0 이상으로 증가, EAGLE-2는 거의 변화 없음

How to Apply

vLLM이나 SGLang을 사용 중이라면 EAGLE-3 드래프트 모델을 붙이는 것만으로 추가 코드 수정 없이 속도 향상 가능 — GitHub에서 공개된 가중치와 코드를 사용하면 됨
DeepSeek-R1 같은 추론 모델을 서빙할 때 레이턴시가 문제라면, EAGLE-3 드래프트 모델을 학습할 때 수학 특화 데이터(OpenThoughts-114k-math 등)를 추가해 도메인 맞춤 가속 가능
배치 서빙 환경(배치 사이즈 16~64)에서도 throughput 개선이 필요하다면 EAGLE-3를 고려 — 기존 speculative decoding은 대배치에서 역효과 났지만 EAGLE-3는 배치 64까지 효과 유지

Code Example

snippet

# SGLang에서 EAGLE-3 사용 예시
# 1. 레포 클론
# git clone https://github.com/SafeAILab/EAGLE

# 2. SGLang 서버 실행 시 EAGLE-3 드래프트 모델 지정
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --speculative-algorithm EAGLE \
  --speculative-draft-model-path [EAGLE-3 draft model path] \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 8 \
  --speculative-num-draft-tokens 16

# 3. vLLM에서 사용 시
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_model="[EAGLE-3 draft model path]",
    num_speculative_tokens=3,
    use_v2_block_manager=True,
)

sampling_params = SamplingParams(temperature=0, max_tokens=256)
outputs = llm.generate(["Tell me about speculative decoding"], sampling_params)

Terminology

Speculative Decoding작은 보조 모델(드래프트 모델)이 토큰 여러 개를 미리 예측하고, 큰 모델이 한 번에 검증하는 방식. 마치 문서 초안을 빠르게 잡고 나중에 한 번에 교정하는 것과 비슷.

드래프트 모델메인 LLM보다 훨씬 작은 모델로, 토큰 후보를 빠르게 생성하는 역할. 이 후보를 큰 모델이 검증해서 맞으면 그대로 사용.

Acceptance Rate드래프트 모델이 예측한 토큰을 타겟 모델이 얼마나 수락하는지 비율. 높을수록 속도가 빨라짐.

Feature Fusion모델 내부의 여러 레이어에서 나오는 벡터를 합치는 것. 앞 레이어는 문법/구조, 뒤 레이어는 의미/맥락 정보를 담고 있어 합치면 더 풍부한 표현 가능.

Training-time Test학습할 때 실제 추론 상황(드래프트 모델 출력이 다음 입력으로 들어오는 상황)을 시뮬레이션하는 기법. 학교 모의고사처럼 실전과 똑같은 환경에서 연습하는 것.

Autoregressive GenerationLLM이 토큰을 한 번에 하나씩 순서대로 생성하는 방식. 앞 토큰이 나와야 다음 토큰을 생성할 수 있어 본질적으로 느림.

Throughput단위 시간당 처리할 수 있는 토큰 수. 배치 서빙 환경에서 비용 효율성을 나타내는 핵심 지표.

Related Resources

https://github.com/SafeAILab/EAGLE

Original Abstract (Expand)

The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achieves a 1.38x throughput improvement at a batch size of 64. The code is available at https://github.com/SafeAILab/EAGLE.