DistServe: Prefill과 Decoding을 분리해서 LLM Serving Goodput 최적화하기

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Jan 18, 2024•Yinmin Zhong, Shengyu Liu, Junda Chen +5•View PDF

TL;DR Highlight

LLM 추론의 Prefill과 Decoding 단계를 별도 GPU로 분리해서 vLLM 대비 최대 7.4배 더 많은 요청을 처리하는 서빙 아키텍처.

Who Should Read

LLM 서비스의 응답 지연(TTFT/TPOT)과 GPU 비용 효율을 동시에 개선하려는 MLOps/인프라 엔지니어. vLLM이나 DeepSpeed로 LLM을 프로덕션 배포 중인데 SLO를 맞추려고 GPU를 과도하게 프로비저닝하고 있는 상황에 특히 유용.

Core Mechanics

기존 vLLM 같은 시스템은 Prefill(프롬프트 처리)과 Decoding(토큰 생성)을 같은 GPU에서 배치 처리하는데, 이게 서로 간섭을 일으켜 TTFT(첫 토큰 나오는 시간)와 TPOT(토큰당 생성 시간) 둘 다 망가짐
DistServe는 Prefill과 Decoding을 아예 다른 GPU에 배정해서 간섭을 원천 차단하고, 각 단계에 맞는 parallelism 전략(intra-op vs inter-op)을 독립적으로 적용
Prefill은 compute-bound라 intra-op parallelism(텐서 병렬)이 유리하고, Decoding은 memory-bandwidth-bound라 inter-op parallelism(파이프라인 병렬)으로 throughput을 선형 확장하는 게 더 효과적
KV Cache를 Prefill GPU에서 Decoding GPU로 전송하는 오버헤드가 걱정될 수 있는데, 실제로는 전체 지연의 0.1% 미만 - NVLINK 활용하면 사실상 무시 가능한 수준
자동 placement 알고리즘이 TTFT/TPOT SLO 조건과 클러스터 bandwidth를 고려해서 prefill:decoding GPU 비율과 parallelism을 자동으로 최적화 (탐색 시간 최대 1.3분)
워크로드 패턴이 바뀌면 주기적으로 re-planning을 트리거해서 새 환경에 맞게 배치 전략을 재최적화

Evidence

vLLM 대비 최대 7.4배 더 많은 요청 처리 또는 12.6배 더 타이트한 SLO 달성 (90% SLO attainment 기준)
OPT-175B ShareGPT 워크로드에서 KV Cache 전송 오버헤드가 전체 지연의 0.1% 미만, 95% 요청이 30ms 이하의 전송 지연 경험
Summarization 태스크(OPT-66B)에서 vLLM 대비 4.3배 높은 요청 처리율과 12.6배 더 엄격한 SLO 지원
시뮬레이터 정확도: 실제 시스템과의 SLO attainment 오차 2% 이하로 배치 알고리즘의 신뢰성 검증

How to Apply

vLLM을 쓰고 있는데 TPOT가 SLO를 못 맞추는 경우: Prefill 전용 인스턴스와 Decoding 전용 인스턴스를 분리 배포하고, Prefill에는 tensor parallelism, Decoding에는 pipeline parallelism을 각각 적용해보면 된다
챗봇(낮은 TTFT 중요)과 문서 요약(낮은 TPOT 중요) 같은 다른 SLO 요구사항을 가진 앱을 동시에 서빙할 때: 각 앱별로 TTFT/TPOT 목표값을 설정하고 DistServe의 placement 알고리즘으로 GPU 할당 비율을 자동 산출하면 over-provisioning 없이 SLO 달성 가능
intra-node NVLINK는 있지만 cross-node 대역폭이 제한적인 클러스터 환경(예: 25Gbps)이면 Low Node-Affinity 알고리즘을 사용해서 같은 노드 안에 Prefill/Decoding 인스턴스를 배치하고 NVLINK로 KV Cache를 전송하면 오버헤드 최소화 가능

Code Example

snippet

# DistServe GitHub: https://github.com/LLMServe/DistServe

# DistServe 배포 예시 (개념적 흐름)
# 1. 워크로드 특성 프로파일링
workload = {
    'avg_input_length': 755,   # ShareGPT 기준
    'avg_output_length': 200,
    'arrival_rate': 5.0,        # req/s
    'ttft_slo': 2.5,            # seconds (OPT-66B chatbot)
    'tpot_slo': 0.15            # seconds
}

# 2. Placement 알고리즘으로 최적 GPU 배치 탐색
# DistServe가 자동으로 아래와 같은 설정을 찾아줌
# OPT-66B ShareGPT 결과 예시:
optimal_placement = {
    'prefill_instance': {
        'tensor_parallelism': 4,   # intra-op: TTFT 줄이기 위해
        'pipeline_parallelism': 1,
        'num_gpus': 4
    },
    'decoding_instance': {
        'tensor_parallelism': 2,
        'pipeline_parallelism': 2,  # inter-op: throughput 선형 확장
        'num_gpus': 4
    },
    'prefill_to_decoding_ratio': '1:1'  # 워크로드에 따라 2:1도 가능
}

# 3. KV Cache 전송: Prefill 완료 후 Decoding 인스턴스가 'pull' 방식으로 가져감
# (push 방식 대신 pull을 써서 메모리 오버로드 방지)

# OpenAI API 호환 인터페이스로 클라이언트 요청
import openai
client = openai.OpenAI(
    base_url='http://distserve-endpoint:8000/v1',
    api_key='dummy'
)
response = client.chat.completions.create(
    model='opt-66b',
    messages=[{'role': 'user', 'content': 'Summarize this article...'}],
    max_tokens=200
)

Terminology

TTFTTime To First Token. 사용자가 질문을 보내고 나서 첫 번째 글자가 나타날 때까지 걸리는 시간. 챗봇에서 '답변이 빠르다'고 느끼는 데 가장 중요한 지표.

TPOTTime Per Output Token. 두 번째 토큰부터 각 토큰이 생성되는 데 걸리는 평균 시간. 사람 읽기 속도(분당 250단어)보다 빠르면 체감상 충분히 빠름.

GoodputSLO(응답시간 목표)를 지키면서 GPU 1개당 처리할 수 있는 최대 요청 수. 단순 throughput과 달리 '품질을 보장하면서의 처리량'이라 비용 효율 지표로 더 적합.

SLOService Level Objective. '95%의 요청은 0.5초 안에 첫 토큰이 나와야 한다'처럼 서비스 품질 목표를 수치로 정의한 것.

KV CacheLLM이 이전 토큰들을 다시 계산하지 않으려고 중간 결과(Key-Value)를 GPU 메모리에 저장해두는 것. Decoding 단계에서 필수인데 메모리를 많이 잡아먹음.

intra-op parallelism행렬 곱셈 같은 연산 하나를 여러 GPU에 쪼개서 동시에 계산하는 방식(텐서 병렬). GPU간 통신이 많지만 단일 요청 지연을 줄일 수 있음.

inter-op parallelismLLM의 레이어들을 여러 GPU에 파이프라인처럼 나눠 배치하는 방식(파이프라인 병렬). 통신이 적고 처리량을 GPU 수에 비례해 늘릴 수 있음.

Continuous Batching새로운 요청이 오면 현재 처리 중인 배치에 끼워넣어 GPU 활용률을 높이는 기법. vLLM이 채택한 방식으로 기존 static batching보다 효율적이지만 Prefill/Decoding 간섭 문제가 있음.

Related Resources

DistServe GitHub Repository

Original Abstract (Expand)

DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4x more requests or 12.6x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for>90% of requests.