KV-Fold: 긴 컨텍스트 추론을 위한 One-Step KV-Cache Recurrence

TL;DR Highlight

모델 수정 없이 KV 캐시를 청크 간 누산기로 쓰면 128K 토큰까지 100% 정확도로 정보를 검색할 수 있다.

Who Should Read

긴 문서 처리나 대용량 코드베이스 분석에서 컨텍스트 한계를 맞닥뜨린 백엔드/MLOps 개발자. 모델을 재학습하지 않고 긴 컨텍스트 추론을 구현하고 싶은 경우.

Core Mechanics

긴 시퀀스를 청크로 나눠서 각 청크 처리 시 이전 청크들의 KV 캐시를 prefix로 붙여서 attention하는 방식 — 함수형 프로그래밍의 foldl과 동일한 구조.
모델 파라미터 변경, 추가 학습, 특수 메모리 토큰 없이 순수 inference 프로토콜만으로 동작함.
청크 경계마다 오류가 쌓일 것 같지만, 실제로는 초반 몇 스텝에서만 drift가 생기고 이후엔 flat plateau로 안정화됨 — 오류 누적이 아님.
이 안정화 현상은 수치 정밀도 10,000배 변경(bf16→fp32)에도 drift가 고작 2.8%만 줄어드는 걸 보면, 수치 오차가 아니라 구조적 특성임.
Llama-3.1-8B, Qwen2.5-7B-Instruct, OLMoE-1B-7B 세 가지 다른 아키텍처 모두에서 동일한 plateau 패턴이 나타나 특정 모델에 국한된 현상이 아님.
StreamingLLM은 메모리는 아끼지만 캐시 밖으로 나간 토큰은 검색 불가 — KV-Fold는 메모리를 더 쓰는 대신 임의 거리의 정보를 정확히 검색 가능.

Evidence

Needle-in-a-haystack 벤치마크에서 16K~128K 토큰, chain depth 최대 511까지 152번 시도 전부 100% exact-match 달성 (Llama-3.1-8B-Instruct 기준).
128K 토큰 처리 시 단일 full-attention forward는 attention score 행렬만 ~1TB로 A100 40GB에서 불가능 — KV-Fold는 peak 35.57GB, 171초로 완료.
KV-Fold NLL 2.46 vs StreamingLLM NLL 2.66 (T=128K) — StreamingLLM은 d=31 이상에서 retrieval 0%, KV-Fold는 d=511까지 유지.
Qwen2.5-7B에서 depth 15~60 구간 총 drift 변화량이 -0.0003 nats로 사실상 0 — 깊은 chain에서도 오류 누적 없음.

How to Apply

긴 문서를 256~1024 토큰 청크로 나누고, 각 청크 forward pass 시 이전 청크들의 KV 캐시를 prefix로 concat해서 넘기면 됨 — 모델 코드 변경 없이 inference 루프만 수정.
128K 이상 코드베이스 분석이나 긴 로그 파일 처리 시, StreamingLLM 대신 KV-Fold를 쓰면 오래된 위치의 정보도 정확히 검색 가능 — 단, 메모리는 0.13 KB/token씩 선형 증가함을 감안해야 함.
int8 양자화와 조합하면 128K depth 511에서 93% retrieval 유지 가능 — GPU 메모리가 부족할 때 KV 캐시를 int8로 저장해서 운용하는 방식으로 절충 가능.

Code Example

snippet

# KV-Fold 핵심 루프 (HuggingFace transformers 스타일 pseudocode)
import torch

def kv_fold_inference(model, token_chunks, device):
    """
    token_chunks: List[Tensor], 각 청크는 shape [1, chunk_size]
    """
    past_key_values = None  # 누산기 (accumulator)
    
    for chunk_idx, chunk in enumerate(token_chunks):
        chunk = chunk.to(device)
        
        with torch.no_grad():
            outputs = model(
                input_ids=chunk,
                past_key_values=past_key_values,  # 이전 청크들의 KV 캐시를 prefix로
                use_cache=True,
                return_dict=True
            )
        
        # KV 캐시를 누적해서 다음 청크로 전달
        past_key_values = outputs.past_key_values
        
        # 마지막 청크면 logits 사용
        if chunk_idx == len(token_chunks) - 1:
            logits = outputs.logits
    
    return logits, past_key_values

# 사용 예시
# tokenizer로 긴 시퀀스를 chunk_size=256으로 분할
chunk_size = 256
long_text = "...매우 긴 문서..."
tokens = tokenizer(long_text, return_tensors='pt')['input_ids'][0]
chunks = [tokens[i:i+chunk_size].unsqueeze(0) for i in range(0, len(tokens), chunk_size)]

logits, final_kv = kv_fold_inference(model, chunks, device='cuda')

Terminology

KV CacheTransformer가 이전에 처리한 토큰들의 Key-Value 쌍을 저장해둔 메모리. 다시 계산하지 않고 꺼내 쓸 수 있어서 생성 속도를 높임.

KV-FoldKV 캐시를 함수형 프로그래밍의 foldl(왼쪽부터 차례로 누산)처럼 청크마다 계속 이어붙여 가는 방식. 쌓인 캐시가 누산기 역할을 함.

Needle-in-a-Haystack긴 문서 중간에 특정 사실('마법의 숫자는 12345')을 숨겨두고, 나중에 모델이 그걸 정확히 찾아내는지 테스트하는 벤치마크.

DriftKV-Fold 방식과 전체 full-attention 방식의 예측 차이. 이 값이 커질수록 청킹으로 인한 정보 손실이 크다는 의미.

Plateau오류나 drift가 더 이상 늘어나지 않고 평평하게 유지되는 구간. KV-Fold에서 초반 몇 스텝 후 drift가 plateau에 도달해 안정화됨.

StreamingLLM최근 토큰 일부 + attention sink 토큰만 유지하는 스트리밍 방식. 메모리는 적게 쓰지만 캐시 밖의 오래된 내용은 검색 불가.

RoPERotary Position Embedding. 토큰의 위치 정보를 회전 행렬로 인코딩하는 방식. 긴 시퀀스에서 위치 정보를 정렬하는 데 중요.

NLLNegative Log-Likelihood. 모델이 다음 토큰을 얼마나 잘 예측하는지 나타내는 손실값. 낮을수록 좋음.

Related Resources

Original Abstract (Expand)

We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.