Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

Mar 12, 2026•Samy Jelassi, Mujin Kwun, Rosie Zhao +5•View PDF

TL;DR Highlight

EBFT: a new fine-tuning method that matches feature statistics of model outputs to ground-truth instead of token-level training like SFT — significantly improves over SFT on several benchmarks.

Who Should Read

Researchers working on fine-tuning techniques for LLMs, especially those looking for alternatives or improvements to standard SFT.

Core Mechanics

Proposed Evidence-Based Feature Tuning (EBFT) — instead of minimizing token-level cross-entropy like SFT, it aligns the distribution of model output features with ground truth
EBFT operates on intermediate representations rather than token predictions, making it less sensitive to exact token-level noise in training data
The feature alignment objective is computed in closed form, making it computationally efficient
Particularly effective for tasks where exact token match is a poor proxy for quality (open-ended generation, reasoning)
Can be combined with existing SFT or used as a replacement
Shows strong improvements on math reasoning, instruction following, and code generation benchmarks

Evidence

EBFT outperforms standard SFT on multiple benchmarks including math reasoning and instruction following
Improvements are most pronounced on tasks with open-ended outputs where exact token match is a weak signal
Computational overhead compared to SFT is minimal — no significant training time increase
Combines well with other training techniques (DPO, RLHF) as a complementary stage

How to Apply

Use EBFT as a drop-in replacement or complement to SFT in your fine-tuning pipeline
Most beneficial for tasks with open-ended outputs or where your training data has noisy token-level labels
The feature alignment is computed on intermediate transformer layers — target the layers most relevant to your task

Code Example

snippet

# EBFT core reward computation pseudocode (based on Algorithm 1)
import torch
import torch.nn.functional as F

def compute_ebft_rewards(rollout_features, gt_feature, n_samples):
    """
    rollout_features: [n_samples, feature_dim] - features of model generations
    gt_feature: [feature_dim] - feature of ground-truth
    n_samples: number of samples per prompt
    """
    # Whitening: compute second-moment matrix
    Sigma = rollout_features.T @ rollout_features / n_samples  # [d, d]
    # square root of pseudo-inverse
    U, S, Vh = torch.linalg.svd(Sigma)
    S_inv_sqrt = torch.where(S > 1e-6, 1.0 / S.sqrt(), torch.zeros_like(S))
    whitening_mat = U @ torch.diag(S_inv_sqrt) @ Vh  # [d, d]
    
    # Whitened features
    w_rollouts = rollout_features @ whitening_mat.T  # [n, d]
    w_gt = gt_feature @ whitening_mat.T  # [d]
    
    # Normalized alignment term (Variant i)
    w_rollouts_norm = F.normalize(w_rollouts, dim=-1)
    w_gt_norm = F.normalize(w_gt.unsqueeze(0), dim=-1)
    alignment = 2 * (w_rollouts_norm * w_gt_norm).sum(dim=-1)  # [n]
    
    # Diversity term
    rewards = []
    for j in range(n_samples):
        other_idx = [k for k in range(n_samples) if k != j]
        diversity = 2 / (n_samples - 1) * sum(
            (w_rollouts[j] * w_rollouts[k]).sum() 
            for k in other_idx
        )
        r_j = alignment[j] - diversity
        rewards.append(r_j)
    
    rewards = torch.stack(rewards)  # [n]
    
    # RLOO baseline: set baseline as the mean reward of other samples
    baseline = torch.stack([
        rewards[torch.arange(n_samples) != j].mean() 
        for j in range(n_samples)
    ])
    
    return rewards - baseline  # advantage

Terminology

SFT (Supervised Fine-Tuning)Standard fine-tuning where the model learns by predicting the next token in training examples, minimizing cross-entropy loss.

Feature StatisticsStatistical properties of intermediate neural network activations (means, covariances) that characterize how the model is representing information.

Cross-Entropy LossThe standard training objective for language models — measures how well predicted token probabilities match the ground-truth next tokens.

EBFT (Evidence-Based Feature Tuning)The proposed method that aligns feature-level distributions between model outputs and ground truth rather than token-level predictions.

Related Resources

Original Abstract (Expand)

Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose energy-based fine-tuning (EBFT), which uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and uses the resulting embeddings to perform an on-policy policy-gradient update. We present a theoretical perspective connecting EBFT to KL-regularized feature-matching and energy-based modeling. Empirically, across Q&A coding, unstructured coding, and translation, EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy than both methods.