Temporal Guidance for Large Language Models

Jan 29, 2026•Hong-Kai Zheng, Piji Li•View PDF

TL;DR Highlight

A decoding technique that reduces repetition and hallucination while improving reasoning quality using the LLM's own 'past predictions' — no external model needed

Who Should Read

ML engineers improving LLM inference quality (repetition, hallucination, code generation accuracy) or backend developers handling LLM serving optimization. Particularly useful when self-hosting open-source models like Qwen3 or Llama.

Core Mechanics

LLMs strongly depend on the immediately preceding token (locality bias) — this paper exploits this by using 'past-step predictions' as a weak amateur in Contrastive Decoding
Reuses MTP (Multi-Token Prediction) auxiliary heads as amateurs — no additional model needed
Designs lightweight adapter cMTPP (Conditional MTP Projector) for regular LLMs without MTP. Backbone stays frozen, only adapter trains (3000 steps, a few hours on 8 RTX A5000s)
DoLa (layer contrastive method) collapses on small models, while TeGu remains stable on small models (Qwen3-1.7B, Llama-3.2-3B)
For models with native MTP heads like MiMo-7B, applies directly without any training (training-free)
Reduces repetition (Rep-4 Rate) by 43% vs baseline, 32.7% vs DoLa

Evidence

Qwen3-1.7B: GSM8K 72.48% → 75.51% (+3.03%), IFEval 15.16% → 26.99% (+11.83%)
Qwen3-8B Math500: best CD (21.40%) vs TeGu 24.20%, IFEval 29.57% → 34.20%
Memory overhead: standard CD increases VRAM 30% (17.72→23.11GB) while TeGu adds only 2-15% latency over base
Wikitext-2 Rep-4 Rate: Greedy 35.84% → DoLa 30.35% → TeGu (α=0.3) 20.43%

How to Apply

For models with native MTP heads like Qwen3 or MiMo, apply TeGu decoding directly without cMTPP training — just add logits manipulation logic to HuggingFace custom generate
For regular models without MTP (e.g., Llama-3.2), fine-tune cMTPP on fineweb-edu for 3000 steps then apply TeGu. Set α=0.1-0.2 conservatively for small models
Effective for math/coding/instruction-following tasks, but DoLa is better for factuality (TruthfulQA) — choose based on task purpose

Code Example

snippet

# TeGu Core Logic (pseudo-code, HuggingFace-based)
import torch
import torch.nn.functional as F

def tegu_next_token(
    model,
    input_ids,
    cmtpp,          # Conditional MTP Projector
    alpha=0.2,
    tau=0.1,        # Adaptive Plausibility Constraint threshold
    k=1             # bi-step: use hidden state from the immediately previous step
):
    with torch.no_grad():
        # Expert: predict with current context
        out = model(input_ids, output_hidden_states=True)
        h_current = out.hidden_states[-1][:, -1, :]  # last hidden state
        logits_exp = model.lm_head(h_current)        # expert logits

        # Amateur: MTP prediction using hidden state from k steps ago
        h_past = get_cached_hidden(step=current_step - k)  # retrieve from cache
        logits_amt = cmtpp(h_past, k=k)              # amateur logits via cMTPP

        # Adaptive Plausibility Constraint
        max_prob = F.softmax(logits_exp, dim=-1).max()
        mask = F.softmax(logits_exp, dim=-1) < tau * max_prob

        # TeGu formula: (1 + alpha) * log_exp - alpha * log_amt
        log_exp = F.log_softmax(logits_exp, dim=-1)
        log_amt = F.log_softmax(logits_amt, dim=-1)
        guided = log_exp + alpha * (log_exp - log_amt)
        guided[mask] = float('-inf')  # apply APC

        return guided.argmax(dim=-1)

Terminology

Contrastive DecodingA decoding strategy that amplifies the difference between expert (large model) and amateur (small model) predictions to pick better answers. The bigger the score gap, the more it emphasizes 'information only the expert knows.'

MTP (Multi-Token Prediction)Predicting multiple tokens simultaneously instead of just the next one. Adopted by modern models like DeepSeek-R1 and Qwen3.

DoLaA technique that reduces hallucination by comparing shallow layers (amateur) and deep layers (expert) within the same model without a separate model. Unstable on small models.

KV CacheMemory storing previously computed Key/Value vectors in transformers. Caching avoids recomputation, speeding up inference.

Knowledge DistillationHaving a small model (student) learn to match the output distribution of a large model (teacher). Learning the teacher's 'confidence distribution' rather than just correct answers.

Locality BiasLLMs' tendency to depend much more heavily on immediately preceding tokens than distant context. This property naturally makes past-step predictions 'context-deficient models.'

Greedy DecodingThe simplest generation method selecting the highest-probability token at each step. Prone to falling into repetitive loops.

Related Resources

https://arxiv.org/abs/2601.21744

Original Abstract (Expand)

Contrastive Decoding (CD) enhances the generation quality of large language models (LLMs) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive decoding methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that LLMs exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.