Teaching Language Models to Critique via Reinforcement Learning
TL;DR Highlight
A dedicated RL-trained Critic model gives better code revision feedback than GPT-4o.
Who Should Read
ML engineers or AI product developers building automatic feedback and revision loops for LLM-based code generation pipelines. Especially those thinking about test-time compute scaling to boost performance.
Core Mechanics
- Separated Generator and Critic — trained Critic alone with RL. Qwen2.5-Coder-32B-based Critic improves performance even when attached to stronger models like GPT-4o (weak-to-strong generalization)
- 2-stage training: first SFT using code execution results (sandbox) as hints, then RL with GRPO — PPO was abandoned due to unstable value network
- Self-critique is nearly ineffective (7.88% to 8.36%); CTRL achieves 11.76% while keeping wrong-answer rate low at 0.85%
- Iterative critique-revision enables test-time scaling — 5 iterations achieve 16.24% Pass@1 on CodeContests (+106.1% vs zero-shot 7.88%)
- CTRL Critic trained only on coding problems achieves 64.3% accuracy on JudgeBench (general knowledge, math, reasoning) — matching Claude-3.5-Sonnet
- Self-critique tends to make minimal changes (average similarity 0.482); CTRL makes structural changes (0.313) — this drives the performance difference
Evidence
- 5 iterations on CodeContests: Pass@1 16.24% — 106.1% relative improvement vs zero-shot 7.88%
- With GPT-4o as Generator: CTRL Critic achieves higher Pass@1 than GPT-4o Critic (25.45% vs 20.61% at 5 iterations)
- JudgeBench coding accuracy: CTRL 65.7%, Claude-3.5-Sonnet 64.3%, GPT-4o 56.6%
- Hard problems at 6 iterations: +233.3% pass rate improvement (Easy +73.2%, Medium +161.1%)
How to Apply
- In automated code review pipelines, have a separate Critic model generate 3-step feedback (analyze > suggest improvements > correctness judgment) — more actionable than just error messages.
- If you have a sandbox, auto-generate SFT data using execution results as hints, then RL-train with GRPO — builds a Critic model without human labels.
- At inference time, run the Critic multiple times and aggregate Correct/Incorrect via majority voting to use as a reward model.
Code Example
# CTRL-style Critique prompt template (based on paper Appendix C.2)
CRITIQUE_PROMPT = """
You are tasked with analyzing an answer to a problem and providing constructive feedback.
Do NOT provide direct solutions.
Problem description:
<problem>
{problem}
</problem>
Answer:
<answer>
{answer}
</answer>
Structure your response using the following format:
Analysis:
{{Analysis of strengths and weaknesses}}
Improvement suggestions:
{{Actionable suggestions for improvement}}
Overall judgment: {{Correct/Incorrect}}
"""
# Version with hint added when execution feedback is available
HINTED_CRITIQUE_PROMPT = """
You are tasked with analyzing an answer to a problem and providing constructive feedback.
Do NOT provide direct solutions.
Please carefully reason about the hint to guide the user.
**Important: Do NOT mention 'the hint' in your feedback.**
Problem description:
<problem>
{problem}
</problem>
Answer:
<answer>
{solution}
</answer>
Hint:
<hint>
{hint} # e.g., input/expected/actual of failed test cases
</hint>
Analysis:
...
Improvement suggestions:
...
Overall judgment: Correct/Incorrect
"""
# Aggregating critic judgments (majority voting)
from collections import Counter
def aggregate_judgments(critiques: list[str]) -> str:
judgments = []
for c in critiques:
if 'Overall judgment: Correct' in c:
judgments.append('Correct')
elif 'Overall judgment: Incorrect' in c:
judgments.append('Incorrect')
return Counter(judgments).most_common(1)[0][0]Terminology
Related Resources
Original Abstract (Expand)
Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.