PagedAttention을 활용한 대규모 언어 모델 서빙의 효율적인 메모리 관리

Efficient Memory Management for Large Language Model Serving with PagedAttention

Sep 12, 2023•Woosuk Kwon, Zhuohan Li, Siyuan Zhuang +6•View PDF

TL;DR Highlight

OS의 가상 메모리 기법을 LLM 서빙에 적용해 KV cache 메모리 낭비를 없애고 처리량을 2~4배 높인 vLLM 논문

Who Should Read

LLM API 서버를 직접 운영하거나 셀프 호스팅을 고민 중인 ML 엔지니어 및 백엔드 개발자. 특히 GPU 메모리 부족으로 배치 크기를 늘리지 못해 처리량이 제한되는 상황에 있는 사람.

Core Mechanics

기존 시스템은 KV cache(트랜스포머가 이전 토큰 정보를 저장하는 메모리)를 연속된 메모리 공간에 미리 최대 크기로 할당해서 실제 사용량이 20~38%에 불과했음
PagedAttention은 OS의 페이징 기법처럼 KV cache를 고정 크기 블록으로 나눠 비연속 메모리에 저장 - 내부/외부 단편화를 거의 제거
같은 프롬프트를 공유하는 여러 요청(병렬 샘플링, beam search)은 KV cache 블록을 물리적으로 공유하고 copy-on-write로 분기 처리 - beam search에서 최대 55% 메모리 절약
vLLM은 FasterTransformer 대비 최대 22배, Orca 대비 2~4배 높은 처리량 달성 (같은 레이턴시 기준)
GPU 메모리 부족 시 CPU RAM으로 swap하거나 KV cache를 재계산하는 두 가지 선점(preemption) 전략 지원
GPT, OPT, LLaMA 등 주요 모델 지원하며 OpenAI API 호환 인터페이스 제공 - 바로 드롭인 가능

Evidence

ShareGPT 데이터셋 기준 vLLM은 Orca (Oracle) 대비 1.7~2.7배, Orca (Max) 대비 2.7~8배 높은 request rate 처리
OPT-13B 기준 vLLM은 Orca (Oracle)보다 2.2배, Orca (Max)보다 4.3배 많은 요청을 동시에 배치 처리 (평균 배치 크기 7개 → 30.42개)
beam search (width=6) 에서 KV cache 블록 공유로 메모리 37.6~55.2% 절약, parallel sampling에서 6.1~9.8% 절약
PagedAttention의 attention 커널 자체 오버헤드는 FasterTransformer 대비 20~26% 높지만 end-to-end 성능은 압도적으로 우세

How to Apply

pip install vllm 후 OpenAI API 호환 서버로 바로 띄울 수 있음 - 기존 GPT API 클라이언트 코드 수정 없이 엔드포인트만 바꾸면 됨
system prompt처럼 여러 요청이 같은 prefix를 공유하는 경우, vLLM의 shared prefix 기능을 쓰면 prefix KV cache를 재계산 없이 재사용 가능 - few-shot 프롬프트 서빙 시 처리량 최대 3.58배 향상
병렬 샘플링(n>1)이나 beam search를 쓰는 경우 기존 시스템 대비 vLLM의 이점이 더 커짐 - 코드 어시스턴트처럼 여러 후보를 동시에 생성하는 서비스에 특히 유리

Code Example

snippet

# vLLM 설치 및 OpenAI 호환 서버 실행
pip install vllm

# 서버 시작 (기존 OpenAI 클라이언트 그대로 사용 가능)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-13b-chat-hf \
    --tensor-parallel-size 1

# 클라이언트 코드 (엔드포인트만 바꾸면 됨)
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

# 병렬 샘플링 - KV cache 공유로 메모리 효율 극대화
response = client.chat.completions.create(
    model="meta-llama/Llama-2-13b-chat-hf",
    messages=[{"role": "user", "content": "파이썬으로 피보나치 수열을 구현해줘"}],
    n=4,          # 4개 후보 동시 생성, 프롬프트 KV cache 공유
    temperature=0.8,
    max_tokens=512,
)

for i, choice in enumerate(response.choices):
    print(f"--- 후보 {i+1} ---")
    print(choice.message.content)

# Python API로 직접 사용
from vllm import LLM, SamplingParams

llm = LLM(model="facebook/opt-13b")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

prompts = [
    "번역해줘: Hello, how are you?",
    "번역해줘: What is your name?",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Terminology

KV cache트랜스포머가 토큰을 하나씩 생성할 때 이전에 계산한 key/value 값을 저장해두는 메모리. 매번 재계산하지 않으려고 쓰는 캐시인데, 시퀀스가 길어질수록 엄청나게 커짐.

PagedAttentionOS가 프로세스 메모리를 페이지 단위로 관리하듯, KV cache를 고정 크기 블록으로 쪼개서 비연속 메모리에 저장하는 attention 알고리즘. 메모리 낭비를 없애는 핵심 아이디어.

내부 단편화미리 큰 메모리를 예약했는데 실제로 다 쓰지 못해서 생기는 낭비. 방 10개 짜리 호텔을 혼자 예약했는데 1개만 쓰는 것과 같음.

외부 단편화메모리 여기저기에 빈 공간이 있지만 연속된 큰 공간이 없어서 새 요청을 못 받는 상태. 주차장에 빈칸은 많은데 버스 주차할 연속 공간이 없는 것과 비슷.

beam searchLLM이 가장 그럴듯한 출력을 찾기 위해 상위 k개 후보를 동시에 추적하며 생성하는 디코딩 알고리즘. 번역이나 요약 품질을 높이는 데 씀.

copy-on-write여러 프로세스가 같은 메모리를 공유하다가, 한쪽이 수정할 때만 복사본을 만드는 기법. 실제로 달라질 때까지 복사 비용을 아끼는 것.

preemptionGPU 메모리가 부족할 때 처리 중인 요청을 잠시 중단시키고 그 메모리를 다른 요청에 할당하는 스케줄링 기법.

Related Resources

https://github.com/vllm-project/vllm

Original Abstract (Expand)

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2--4× with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm.