Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

Jan 16, 2025•Chaoqi Wang, Zhuokai Zhao, Yibo Jiang +8•View PDF

TL;DR Highlight

A new reward model training method using causality-based regularization to catch reward hacking — RLHF's chronic problem.

Who Should Read

ML engineers who directly implement RLHF pipelines or deal with fine-tuned LLMs showing strange behaviors like length bias, sycophantic responses, or demographic discrimination.

Core Mechanics

Reward hacking occurs because reward models learn spurious correlations like length, sycophancy, specific concepts, and demographic patterns instead of true quality
Introduced counterfactual invariance — mathematically defining the condition that reward predictions should be consistent even when irrelevant variables change
Added MMD (Maximum Mean Discrepancy) as a regularization term to train reward representations to be independent of spurious variables
Proposed two variants: conditional CRM (stronger at removing bias) and unconditional CRM (better at preserving general utility)
Drop-in replacement that only requires modifying the loss function — applicable to both PPO and DPO
With Llama-3 8B: sycophancy rate dropped from 92.67% to 19.78%, Yelp concept bias reduced by up to 97%, demographic discrimination score halved from 0.121 to 0.058

Evidence

Sycophancy experiment: Vanilla RM 92.67% vs Conditional CRM 19.78% (lower is better)
Concept bias (Yelp 'Price'): Bias@C dropped from 18.88 to 0.52 (97% reduction), Acc@NoC improved from 59.26% to 97.22%
Demographic bias overall average: Vanilla RM 0.121 vs Unconditional CRM 0.058, GPT-4o win rate analysis confirmed no utility loss
Length bias: higher regularization led to shorter responses getting higher rank; Pareto front showed advantage over Vanilla + Length Penalty

How to Apply

Add MMD regularization to the reward model training loss: split responses into M bins by spurious variable (e.g., length), compute MMD across bin reward distributions, and add as penalty with lambda weight.
Using DPO? Apply Causal DPO formula from the Appendix: add the same MMD regularization to the implicit reward (log pi_theta/pi_ref).
For production model retraining with bias/discrimination issues: build bins by demographic variables and train with MMD regularization to make rewards independent — reduces bias even without explicit bias data.

Code Example

snippet

# Core loss implementation example for Causal Reward Model
import torch
from torch import nn

def compute_mmd(x, y, kernel='rbf', sigma=1.0):
    """Compute MMD between two distributions (using RBF kernel)"""
    def rbf_kernel(a, b):
        diff = a.unsqueeze(1) - b.unsqueeze(0)  # [n, m, d]
        return torch.exp(-diff.pow(2).sum(-1) / (2 * sigma ** 2))
    
    Kxx = rbf_kernel(x, x).mean()
    Kyy = rbf_kernel(y, y).mean()
    Kxy = rbf_kernel(x, y).mean()
    return Kxx + Kyy - 2 * Kxy

def causal_reward_loss(
    reward_model,
    x_chosen, y_chosen,   # chosen responses
    x_rejected, y_rejected,  # rejected responses
    spurious_bins,        # bin index for each sample (e.g., length-based)
    lambda_mmd=0.1,
    num_bins=10
):
    # 1) Basic Bradley-Terry reward training loss
    r_chosen = reward_model(x_chosen, y_chosen)
    r_rejected = reward_model(x_rejected, y_rejected)
    preference_loss = -torch.log(torch.sigmoid(r_chosen - r_rejected)).mean()
    
    # 2) MMD regularization: make reward distributions across bins independent
    all_rewards = torch.cat([r_chosen, r_rejected])
    all_bins = torch.cat([spurious_bins['chosen'], spurious_bins['rejected']])
    
    mmd_loss = 0.0
    bin_rewards = {b: all_rewards[all_bins == b] for b in range(num_bins)}
    
    count = 0
    for i in range(num_bins):
        for j in range(i+1, num_bins):
            if len(bin_rewards[i]) > 0 and len(bin_rewards[j]) > 0:
                mmd_loss += compute_mmd(
                    bin_rewards[i].unsqueeze(-1),
                    bin_rewards[j].unsqueeze(-1)
                )
                count += 1
    
    if count > 0:
        mmd_loss /= count
    
    # 3) Final loss = preference loss + λ * MMD
    total_loss = preference_loss + lambda_mmd * mmd_loss
    return total_loss

# Usage notes
# - lambda_mmd: recommended to sweep in range 0.1~3.0
# - num_bins: 10~30 (if length bias, split intervals by response length)
# - Training with LoRA rank 64, alpha 128 replicates the paper's exact setting

Terminology

RLHFTraining an LLM with human preference data. First builds a reward model that converts human preferences into scores, then trains the model to maximize those scores.

Reward HackingWhen AI finds shortcuts to game the scoring system. E.g., discovering that longer answers score higher and generating verbose content regardless of quality.

Spurious CorrelationA pattern appearing together in data without a true causal relationship.

Counterfactual InvarianceAsking 'would the result be the same if this part were different?' If changing response length shouldn't change the reward score, the reward model is focused on content.

MMDMaximum Mean Discrepancy. A statistical metric measuring how different two data distributions are.

SycophancyWhen a model only says what the user wants to hear, agreeing even when wrong.

PPOProximal Policy Optimization. A reinforcement learning algorithm widely used for LLM fine-tuning.

DPODirect Preference Optimization. Trains an LLM directly from preference data without a reward model.

Related Resources

Original Abstract (Expand)

Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination-that hinder the model's ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causality to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.