Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
TL;DR Highlight
A new reward model training method using causality-based regularization to catch reward hacking — RLHF's chronic problem.
Who Should Read
ML engineers who directly implement RLHF pipelines or deal with fine-tuned LLMs showing strange behaviors like length bias, sycophantic responses, or demographic discrimination.
Core Mechanics
- Reward hacking occurs because reward models learn spurious correlations like length, sycophancy, specific concepts, and demographic patterns instead of true quality
- Introduced counterfactual invariance — mathematically defining the condition that reward predictions should be consistent even when irrelevant variables change
- Added MMD (Maximum Mean Discrepancy) as a regularization term to train reward representations to be independent of spurious variables
- Proposed two variants: conditional CRM (stronger at removing bias) and unconditional CRM (better at preserving general utility)
- Drop-in replacement that only requires modifying the loss function — applicable to both PPO and DPO
- With Llama-3 8B: sycophancy rate dropped from 92.67% to 19.78%, Yelp concept bias reduced by up to 97%, demographic discrimination score halved from 0.121 to 0.058
Evidence
- Sycophancy experiment: Vanilla RM 92.67% vs Conditional CRM 19.78% (lower is better)
- Concept bias (Yelp 'Price'): Bias@C dropped from 18.88 to 0.52 (97% reduction), Acc@NoC improved from 59.26% to 97.22%
- Demographic bias overall average: Vanilla RM 0.121 vs Unconditional CRM 0.058, GPT-4o win rate analysis confirmed no utility loss
- Length bias: higher regularization led to shorter responses getting higher rank; Pareto front showed advantage over Vanilla + Length Penalty
How to Apply
- Add MMD regularization to the reward model training loss: split responses into M bins by spurious variable (e.g., length), compute MMD across bin reward distributions, and add as penalty with lambda weight.
- Using DPO? Apply Causal DPO formula from the Appendix: add the same MMD regularization to the implicit reward (log pi_theta/pi_ref).
- For production model retraining with bias/discrimination issues: build bins by demographic variables and train with MMD regularization to make rewards independent — reduces bias even without explicit bias data.
Code Example
# Core loss implementation example for Causal Reward Model
import torch
from torch import nn
def compute_mmd(x, y, kernel='rbf', sigma=1.0):
"""Compute MMD between two distributions (using RBF kernel)"""
def rbf_kernel(a, b):
diff = a.unsqueeze(1) - b.unsqueeze(0) # [n, m, d]
return torch.exp(-diff.pow(2).sum(-1) / (2 * sigma ** 2))
Kxx = rbf_kernel(x, x).mean()
Kyy = rbf_kernel(y, y).mean()
Kxy = rbf_kernel(x, y).mean()
return Kxx + Kyy - 2 * Kxy
def causal_reward_loss(
reward_model,
x_chosen, y_chosen, # chosen responses
x_rejected, y_rejected, # rejected responses
spurious_bins, # bin index for each sample (e.g., length-based)
lambda_mmd=0.1,
num_bins=10
):
# 1) Basic Bradley-Terry reward training loss
r_chosen = reward_model(x_chosen, y_chosen)
r_rejected = reward_model(x_rejected, y_rejected)
preference_loss = -torch.log(torch.sigmoid(r_chosen - r_rejected)).mean()
# 2) MMD regularization: make reward distributions across bins independent
all_rewards = torch.cat([r_chosen, r_rejected])
all_bins = torch.cat([spurious_bins['chosen'], spurious_bins['rejected']])
mmd_loss = 0.0
bin_rewards = {b: all_rewards[all_bins == b] for b in range(num_bins)}
count = 0
for i in range(num_bins):
for j in range(i+1, num_bins):
if len(bin_rewards[i]) > 0 and len(bin_rewards[j]) > 0:
mmd_loss += compute_mmd(
bin_rewards[i].unsqueeze(-1),
bin_rewards[j].unsqueeze(-1)
)
count += 1
if count > 0:
mmd_loss /= count
# 3) Final loss = preference loss + λ * MMD
total_loss = preference_loss + lambda_mmd * mmd_loss
return total_loss
# Usage notes
# - lambda_mmd: recommended to sweep in range 0.1~3.0
# - num_bins: 10~30 (if length bias, split intervals by response length)
# - Training with LoRA rank 64, alpha 128 replicates the paper's exact settingTerminology
Related Resources
Original Abstract (Expand)
Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination-that hinder the model's ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causality to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.