Temporal Guidance for Large Language Models
TL;DR Highlight
A decoding technique that reduces repetition and hallucination while improving reasoning quality using the LLM's own 'past predictions' — no external model needed
Who Should Read
ML engineers improving LLM inference quality (repetition, hallucination, code generation accuracy) or backend developers handling LLM serving optimization. Particularly useful when self-hosting open-source models like Qwen3 or Llama.
Core Mechanics
- LLMs strongly depend on the immediately preceding token (locality bias) — this paper exploits this by using 'past-step predictions' as a weak amateur in Contrastive Decoding
- Reuses MTP (Multi-Token Prediction) auxiliary heads as amateurs — no additional model needed
- Designs lightweight adapter cMTPP (Conditional MTP Projector) for regular LLMs without MTP. Backbone stays frozen, only adapter trains (3000 steps, a few hours on 8 RTX A5000s)
- DoLa (layer contrastive method) collapses on small models, while TeGu remains stable on small models (Qwen3-1.7B, Llama-3.2-3B)
- For models with native MTP heads like MiMo-7B, applies directly without any training (training-free)
- Reduces repetition (Rep-4 Rate) by 43% vs baseline, 32.7% vs DoLa
Evidence
- Qwen3-1.7B: GSM8K 72.48% → 75.51% (+3.03%), IFEval 15.16% → 26.99% (+11.83%)
- Qwen3-8B Math500: best CD (21.40%) vs TeGu 24.20%, IFEval 29.57% → 34.20%
- Memory overhead: standard CD increases VRAM 30% (17.72→23.11GB) while TeGu adds only 2-15% latency over base
- Wikitext-2 Rep-4 Rate: Greedy 35.84% → DoLa 30.35% → TeGu (α=0.3) 20.43%
How to Apply
- For models with native MTP heads like Qwen3 or MiMo, apply TeGu decoding directly without cMTPP training — just add logits manipulation logic to HuggingFace custom generate
- For regular models without MTP (e.g., Llama-3.2), fine-tune cMTPP on fineweb-edu for 3000 steps then apply TeGu. Set α=0.1-0.2 conservatively for small models
- Effective for math/coding/instruction-following tasks, but DoLa is better for factuality (TruthfulQA) — choose based on task purpose
Code Example
# TeGu Core Logic (pseudo-code, HuggingFace-based)
import torch
import torch.nn.functional as F
def tegu_next_token(
model,
input_ids,
cmtpp, # Conditional MTP Projector
alpha=0.2,
tau=0.1, # Adaptive Plausibility Constraint threshold
k=1 # bi-step: use hidden state from the immediately previous step
):
with torch.no_grad():
# Expert: predict with current context
out = model(input_ids, output_hidden_states=True)
h_current = out.hidden_states[-1][:, -1, :] # last hidden state
logits_exp = model.lm_head(h_current) # expert logits
# Amateur: MTP prediction using hidden state from k steps ago
h_past = get_cached_hidden(step=current_step - k) # retrieve from cache
logits_amt = cmtpp(h_past, k=k) # amateur logits via cMTPP
# Adaptive Plausibility Constraint
max_prob = F.softmax(logits_exp, dim=-1).max()
mask = F.softmax(logits_exp, dim=-1) < tau * max_prob
# TeGu formula: (1 + alpha) * log_exp - alpha * log_amt
log_exp = F.log_softmax(logits_exp, dim=-1)
log_amt = F.log_softmax(logits_amt, dim=-1)
guided = log_exp + alpha * (log_exp - log_amt)
guided[mask] = float('-inf') # apply APC
return guided.argmax(dim=-1)Terminology
Related Resources
Original Abstract (Expand)
Contrastive Decoding (CD) enhances the generation quality of large language models (LLMs) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive decoding methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that LLMs exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.