Temporal Guidance: LLM 디코딩을 시간 축으로 개선하는 방법

Temporal Guidance for Large Language Models

Jan 29, 2026•Hong-Kai Zheng, Piji Li•View PDF

TL;DR Highlight

외부 모델 없이 LLM 자체의 '과거 예측'을 활용해 반복·환각을 줄이고 추론 품질을 높이는 디코딩 기법.

Who Should Read

LLM 추론 품질(반복, 환각, 코드 생성 정확도)을 개선하려는 ML 엔지니어나 LLM 서빙 최적화를 담당하는 백엔드 개발자. 특히 Qwen3, Llama 같은 오픈소스 모델을 직접 서빙하는 경우에 유용.

Core Mechanics

LLM은 바로 직전 토큰에 강하게 의존(locality bias)하는데, 이 특성을 역이용해 '과거 시점 예측'을 약한 amateur로 쓰는 Contrastive Decoding 전략 제안
MTP(Multi-Token Prediction, 여러 토큰을 동시에 예측하는 기법)의 auxiliary head를 amateur로 재활용 — 추가 모델 불필요
MTP가 없는 일반 LLM용으로 경량 어댑터 cMTPP(Conditional MTP Projector) 설계. 백본은 frozen, 어댑터만 학습 (3000 스텝, RTX A5000 8장으로 몇 시간)
DoLa(레이어 대조 방식)는 소형 모델에서 성능이 폭락하는 반면, TeGu는 소형 모델(Qwen3-1.7B, Llama-3.2-3B)에서도 안정적
MiMo-7B처럼 native MTP head가 있는 모델엔 학습 없이 바로 적용 가능 (training-free)
반복 생성 문제(Rep-4 Rate)를 baseline 대비 43% 감소, DoLa 대비 32.7% 감소

Evidence

Qwen3-1.7B 기준 GSM8K 72.48% → 75.51% (+3.03%), IFEval 15.16% → 26.99% (+11.83%)
Qwen3-8B Math500에서 최고 CD(21.40%) 대비 TeGu 24.20%, IFEval 29.57% → 34.20%
메모리 오버헤드: 표준 CD가 VRAM 30% 증가(17.72→23.11GB)인 반면 TeGu는 base 대비 2~15% 지연 증가에 그침
Wikitext-2 Rep-4 Rate: Greedy 35.84% → DoLa 30.35% → TeGu(α=0.3) 20.43%

How to Apply

Qwen3, MiMo처럼 native MTP head가 있는 모델이면 cMTPP 학습 없이 바로 TeGu 디코딩 적용 가능 — HuggingFace custom generate에 logits 조작 로직만 추가
MTP 없는 일반 모델(예: Llama-3.2)은 cMTPP를 fineweb-edu로 3000 스텝 파이튜닝 후 TeGu 적용. 소형 모델엔 α=0.1~0.2로 보수적으로 설정
수학/코딩/instruction-following 태스크엔 효과적이나 사실성(TruthfulQA) 개선이 목적이라면 DoLa가 더 적합 — 태스크 목적에 맞게 선택

Code Example

snippet

# TeGu 핵심 로직 (pseudo-code, HuggingFace 기반)
import torch
import torch.nn.functional as F

def tegu_next_token(
    model,
    input_ids,
    cmtpp,          # Conditional MTP Projector
    alpha=0.2,
    tau=0.1,        # Adaptive Plausibility Constraint threshold
    k=1             # bi-step: 바로 이전 스텝의 hidden state 사용
):
    with torch.no_grad():
        # Expert: 현재 컨텍스트로 예측
        out = model(input_ids, output_hidden_states=True)
        h_current = out.hidden_states[-1][:, -1, :]  # last hidden state
        logits_exp = model.lm_head(h_current)        # expert logits

        # Amateur: k 스텝 전 hidden state로 MTP 예측
        h_past = get_cached_hidden(step=current_step - k)  # cache에서 꺼냄
        logits_amt = cmtpp(h_past, k=k)              # cMTPP로 amateur logits

        # Adaptive Plausibility Constraint
        max_prob = F.softmax(logits_exp, dim=-1).max()
        mask = F.softmax(logits_exp, dim=-1) < tau * max_prob

        # TeGu 공식: (1 + alpha) * log_exp - alpha * log_amt
        log_exp = F.log_softmax(logits_exp, dim=-1)
        log_amt = F.log_softmax(logits_amt, dim=-1)
        guided = log_exp + alpha * (log_exp - log_amt)
        guided[mask] = float('-inf')  # APC 적용

        return guided.argmax(dim=-1)

Terminology

Contrastive Decoding전문가(큰 모델)와 초보자(작은 모델)의 예측 차이를 증폭시켜 더 좋은 답을 고르는 디코딩 전략. 두 점수의 차이가 클수록 '전문가만 아는 정보'를 강조하는 방식.

MTP (Multi-Token Prediction)다음 토큰 하나만 예측하는 게 아니라 여러 토큰을 동시에 예측하는 기법. DeepSeek-R1, Qwen3 등 최신 모델들이 채택.

DoLa별도 모델 없이 같은 모델의 얕은 레이어(초보)와 깊은 레이어(전문가)를 비교해 환각을 줄이는 기법. 소형 모델에서 불안정한 단점이 있음.

KV Cache트랜스포머가 이전에 계산한 Key/Value 벡터를 저장해두는 메모리. 저장해두면 매번 재계산 안 해도 돼서 추론이 빨라짐.

Knowledge Distillation큰 모델(선생)의 출력 분포를 작은 모델(학생)이 따라 배우게 하는 학습법. 단순히 정답만 맞히는 것보다 선생의 '자신감 분포'까지 학습.

Locality BiasLLM이 멀리 있는 문맥보다 바로 직전 토큰들에 훨씬 더 강하게 의존하는 경향. 이 특성 때문에 과거 시점 예측이 자연스럽게 '맥락 결핍 모델'이 됨.

Greedy Decoding매 스텝 가장 확률 높은 토큰을 선택하는 가장 단순한 생성 방식. 반복 루프에 빠지기 쉬운 단점이 있음.

Related Resources

https://arxiv.org/abs/2601.21744

Original Abstract (Expand)

Contrastive Decoding (CD) enhances the generation quality of large language models (LLMs) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive decoding methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that LLMs exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.