Teaching Language Models to Critique via Reinforcement Learning

Feb 5, 2025•Zhihui Xie, Jie Chen, Liyu Chen +3•View PDF

TL;DR Highlight

A dedicated RL-trained Critic model gives better code revision feedback than GPT-4o.

Who Should Read

ML engineers or AI product developers building automatic feedback and revision loops for LLM-based code generation pipelines. Especially those thinking about test-time compute scaling to boost performance.

Core Mechanics

Separated Generator and Critic — trained Critic alone with RL. Qwen2.5-Coder-32B-based Critic improves performance even when attached to stronger models like GPT-4o (weak-to-strong generalization)
2-stage training: first SFT using code execution results (sandbox) as hints, then RL with GRPO — PPO was abandoned due to unstable value network
Self-critique is nearly ineffective (7.88% to 8.36%); CTRL achieves 11.76% while keeping wrong-answer rate low at 0.85%
Iterative critique-revision enables test-time scaling — 5 iterations achieve 16.24% Pass@1 on CodeContests (+106.1% vs zero-shot 7.88%)
CTRL Critic trained only on coding problems achieves 64.3% accuracy on JudgeBench (general knowledge, math, reasoning) — matching Claude-3.5-Sonnet
Self-critique tends to make minimal changes (average similarity 0.482); CTRL makes structural changes (0.313) — this drives the performance difference

Evidence

5 iterations on CodeContests: Pass@1 16.24% — 106.1% relative improvement vs zero-shot 7.88%
With GPT-4o as Generator: CTRL Critic achieves higher Pass@1 than GPT-4o Critic (25.45% vs 20.61% at 5 iterations)
JudgeBench coding accuracy: CTRL 65.7%, Claude-3.5-Sonnet 64.3%, GPT-4o 56.6%
Hard problems at 6 iterations: +233.3% pass rate improvement (Easy +73.2%, Medium +161.1%)

How to Apply

In automated code review pipelines, have a separate Critic model generate 3-step feedback (analyze > suggest improvements > correctness judgment) — more actionable than just error messages.
If you have a sandbox, auto-generate SFT data using execution results as hints, then RL-train with GRPO — builds a Critic model without human labels.
At inference time, run the Critic multiple times and aggregate Correct/Incorrect via majority voting to use as a reward model.

Code Example

snippet

# CTRL-style Critique prompt template (based on paper Appendix C.2)

CRITIQUE_PROMPT = """
You are tasked with analyzing an answer to a problem and providing constructive feedback.
Do NOT provide direct solutions.

Problem description:
<problem>
{problem}
</problem>

Answer:
<answer>
{answer}
</answer>

Structure your response using the following format:

Analysis:
{{Analysis of strengths and weaknesses}}

Improvement suggestions:
{{Actionable suggestions for improvement}}

Overall judgment: {{Correct/Incorrect}}
"""

# Version with hint added when execution feedback is available
HINTED_CRITIQUE_PROMPT = """
You are tasked with analyzing an answer to a problem and providing constructive feedback.
Do NOT provide direct solutions.
Please carefully reason about the hint to guide the user.
**Important: Do NOT mention 'the hint' in your feedback.**

Problem description:
<problem>
{problem}
</problem>

Answer:
<answer>
{solution}
</answer>

Hint:
<hint>
{hint}  # e.g., input/expected/actual of failed test cases
</hint>

Analysis:
...
Improvement suggestions:
...
Overall judgment: Correct/Incorrect
"""

# Aggregating critic judgments (majority voting)
from collections import Counter

def aggregate_judgments(critiques: list[str]) -> str:
    judgments = []
    for c in critiques:
        if 'Overall judgment: Correct' in c:
            judgments.append('Correct')
        elif 'Overall judgment: Incorrect' in c:
            judgments.append('Incorrect')
    return Counter(judgments).most_common(1)[0][0]

Terminology

GRPOGroup Relative Policy Optimization. Samples multiple answers to the same problem and learns by comparing them — no separate value network needed.

PPOProximal Policy Optimization. One of the most widely used RL algorithms. Can be unstable on complex tasks due to the value network.

SFTSupervised Fine-Tuning. Show the model gold-standard examples and have it imitate them.

Pass@1The percentage of passing tests on the first generation.

Test-time ScalingA strategy to improve performance by spending more compute at inference time without changing model parameters.

Weak-to-strong GeneralizationWhen a smaller/weaker model can effectively supervise a larger/stronger model.

Compounding ErrorWhen initial mistakes accumulate across repeated revisions, getting worse over time.

Related Resources

https://critic-rl.github.io

Original Abstract (Expand)

Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.