KVLink: KV Cache 재사용으로 LLM 추론 가속화

KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

Feb 21, 2025•Jingbo Yang, Bairu Hou, Wei Wei +2•View PDF

TL;DR Highlight

RAG에서 같은 문서를 매 쿼리마다 다시 인코딩하는 낭비를 없애고, 문서별 KV Cache를 미리 계산해 재사용하면서도 정확도 손실을 최소화하는 방법.

Who Should Read

RAG 파이프라인에서 동일 문서가 여러 쿼리에 반복 사용될 때 추론 비용과 지연을 줄이고 싶은 백엔드/ML 엔지니어. 특히 LLM 서빙 인프라를 최적화하거나 Time-to-First-Token을 개선해야 하는 상황에 있는 개발자.

Core Mechanics

문서마다 KV Cache를 독립적으로 미리 계산해 저장해두고, 쿼리 시점에 불러와 이어붙이는 방식으로 중복 인코딩을 제거
독립 인코딩 시 발생하는 위치 정보 불일치 문제를 RoPE(회전 위치 인코딩) 재적용으로 해결 — 저장 시엔 위치 제거, 추론 시 실제 위치에 맞게 재계산
'Link Token'이라는 학습 가능한 특수 토큰을 문서 사이에 삽입해 독립 인코딩된 문서들 간 attention 단절 문제를 복원
Llama-3.2-1B/3B, Llama-3.1-8B 세 모델에서 모두 기존 최고 성능 방법(BlockAttention)을 평균 4% 이상 앞섬
KV Cache를 CPU에 저장했다가 GPU로 로딩하는 방식으로도 표준 디코딩 대비 TTFT를 최대 96% 단축
LLMLinGUA나 ANLLMS 같은 KV Cache 압축 기법과 결합해 스토리지 오버헤드도 추가로 감소 가능

Evidence

NaturalQuestions에서 BlockAttention 대비 6.6% 향상, HotpotQA에서 7.3% 향상 (Llama-3.2-1B 기준)
컨텍스트 5,000 토큰 기준 TTFT를 표준 디코딩 0.885초 → KVLink 0.027초로 96% 단축
기존 방법(PROMPTCACHE, CacheBlend)은 독립 인코딩 시 최대 35% 정확도 하락이 보고된 반면, KVLink5는 완전 인코딩 upper bound와 평균 1~2%p 내 차이
Llama3.1-8B A100 기준 GPU 비용 비교: 100만 요청당 표준 디코딩 440 USD → KVLink 16 USD (약 27배 절감)

How to Apply

RAG 시스템에서 지식베이스 문서를 사전에 각각 독립적으로 LLM에 넣어 KV Cache를 계산·저장해두고, 쿼리가 들어올 때 해당 문서들의 Cache만 로딩해 이어붙이는 구조로 변경하면 된다
Link Token 수(0/1/5)는 정확도-속도 트레이드오프에 따라 조절 — 속도 최우선이면 KVLINK0, 정확도 우선이면 KVLINK5 선택
스토리지 부담이 크다면 LLMLinGUA로 문서를 먼저 압축(50~75%)한 뒤 그 압축 결과의 KV Cache만 저장하거나, 자주 쓰이는 문서만 캐싱하는 LRU/LFU 전략을 병행하면 된다

Code Example

snippet

# KVLink 적용 흐름 (개념 코드)

# 1. 오프라인: 문서별 KV Cache 사전 계산
for doc in knowledge_base:
    # 문서를 단독으로 인코딩 (위치 임베딩 제거 후 저장)
    kv_cache = model.encode_document(doc, remove_position_embedding=True)
    cache_store[doc.id] = kv_cache  # CPU 메모리 또는 디스크에 저장

# 2. 온라인 추론: 캐시 로딩 → 위치 재적용 → link token 계산
def kvlink_inference(query, retrieved_doc_ids):
    # 미리 계산된 KV Cache 로딩
    cached_kvs = [cache_store[doc_id] for doc_id in retrieved_doc_ids]
    
    # 위치 임베딩 재적용 (각 문서의 실제 글로벌 위치에 맞게)
    repositioned_kvs = reapply_rope(cached_kvs)
    
    # Link token KV 계산 (런타임에 소량 계산)
    link_token_kvs = model.compute_link_tokens(repositioned_kvs, num_link_tokens=5)
    
    # 최종 KV = [재정렬된 문서 캐시] + [link token KV]
    full_kv = concat(repositioned_kvs, link_token_kvs)
    
    # 쿼리만 새로 인코딩
    return model.generate(query, kv_cache=full_kv)

Terminology

KV Cache트랜스포머가 이전 토큰들을 처리한 결과(Key, Value 행렬)를 저장해두는 공간. 이걸 재활용하면 같은 내용을 반복 계산하지 않아도 된다. 마치 계산기에서 중간 결과를 메모리에 저장해 재사용하는 것과 같다.

RoPERotary Position Embedding의 약자. 각 토큰에 '몇 번째 위치에 있는지' 정보를 회전 행렬로 인코딩하는 방식. KVLink는 이 위치 정보를 저장 시엔 빼고 나중에 다시 붙이는 트릭을 쓴다.

TTFTTime-to-First-Token. 사용자가 입력을 보낸 후 LLM이 첫 글자를 출력하기까지 걸리는 시간. 사용자 체감 응답속도와 직결된다.

PrefillingLLM이 응답을 생성하기 전에 입력된 프롬프트 전체를 한꺼번에 처리하는 단계. 문서가 길수록 이 단계가 느려진다.

RAGRetrieval-Augmented Generation. LLM에게 질문할 때 관련 문서를 검색해 함께 넘겨주는 방식. 모델이 모르는 최신 정보나 특정 도메인 지식을 보완할 수 있다.

BlockAttention독립적으로 인코딩된 문서 KV Cache를 바로 붙여 쓰는 기존 방법. Link Token 없이 단순 파인튜닝만 한 방식으로, KVLink의 비교 대상 중 가장 강력한 베이스라인.

Link TokenKVLink가 도입한 학습 가능한 특수 토큰. 독립적으로 인코딩된 문서들 사이에 삽입되어, 런타임에 앞선 모든 문서를 attention할 수 있도록 문서 간 연결 다리 역할을 한다.

Related Resources

https://github.com/UCSB-NLP-Chang/KVLink

Original Abstract (Expand)

We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we investigate a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation when using KV caches computed independently for each document, KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention across independently encoded documents. Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 96% compared to standard LLM inference, making it a scalable and efficient solution for context reuse. Additionally, KVLink can be combined with KV cache compression to further save cache loading and storage overhead while outperforming the baselines.