Self-Challenging Language Model Agents

Jun 2, 2025•Yifei Zhou, Sergey Levine, J. Weston +2•View PDF

TL;DR Highlight

No human training data needed — an LLM agent creates its own problems, solves them, and doubles its tool-use success rate through self-improvement.

Who Should Read

ML engineers looking to train tool-use agents with RL, or AI researchers wanting to boost small model agent performance without labeling costs.

Core Mechanics

Agent plays 'challenger' role to explore the environment and generate training tasks, then 'executor' role to solve them with RL — fully automated with no human involvement
Code-as-Task (CaT): tasks defined as instruction + verification function + example solution + failure cases — quality auto-filtered by code executor
Self-improvement: Llama-3.1-8B average Pass@1 doubles from 12.0% to 23.5%
Also works for distillation: large model generates tasks, small model learns from them

Evidence

Self-improvement: Llama-3.1-8B average Pass@1 12.0% → 23.5% (roughly 2x), outperforming PAE by +10.6pp
Distillation: +20.2pp absolute Pass@1 improvement without any human tasks (4-environment average)
PPO on Calculation environment: 43.2% Pass@1 (vs. 20.3% zero-shot, more than 2x improvement)

How to Apply

If you want RL for a new tool-use environment but have no training tasks: deploy agent in challenger mode with API docs, auto-generate tasks in CaT format (instruction + verification function + example solution + failure cases) → filter with code executor → executor RL training.
For distilling large models to small ones: let the large model generate CaT tasks, then train the small model with rejection fine-tuning or PPO on those tasks — no human labeling needed.

Code Example

snippet

# Code-as-Task (CaT) structure example
# After the challenger agent explores the environment, it outputs tasks in the following format

"""
<instruction>
Your name is Olivia Nguyen and your email is olivia4794@example.com.
For #W112, return the Skateboard via paypal_77.
</instruction>

<evaluation_function>
def evaluate():
    success = True
    order = get_order_details("#W112")
    success = success and order["return_items"][0] == "6843647669"
    success = success and order["return_payment_method_id"] == "paypal_77"
    return success
</evaluation_function>

<solution>
return_delivered_order_items(
    order_id="#W112",
    item_ids=["6843647669"],
    payment_method_id="paypal_77"
)
</solution>

<failure_case>
# Wrong payment method
return_delivered_order_items(
    order_id="#W112",
    item_ids=["6843647669"],
    payment_method_id="credit_card_77"  # Incorrect value
)
</failure_case>
"""

# Automatic filtering conditions (all must pass for a valid task)
# 1) evaluation_function code is executable
# 2) example solution → evaluate() == True
# 3) all failure_cases → evaluate() == False

Terminology

RL (Reinforcement Learning)Learning by trial and error to maximize rewards. Like learning a video game by dying and retrying to get higher scores.

SFT (Supervised Fine-Tuning)Learning by studying correct examples. Like studying from model answers in school.

Rejection Fine-TuningGenerate multiple outputs, keep only the correct ones, and train on those. A simple way to combine SFT with RL-like self-improvement.

Related Resources

Original Abstract (Expand)

Large language models are quickly becoming the foundation for intelligent agents that are capable of using tools. However, training such agents is challenging because it requires human creation and annotation of a diverse set of tasks, tools, and evaluation criteria. In this paper, we propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself. The agent first plays the role of challenger and generates a task after interacting with the given tools. The tasks take the form of a novel general class of problems termed Code-as-Task, which are defined by an instruction, a verification function and solution and failure cases which serve as tests, allowing to filter only for high-quality tasks. The agent then takes an executor role and trains on those tasks with reinforcement learning using the evaluation feedback as a reward. Evaluation on two existing multi-turn tool-use agent benchmarks, M3ToolEval and TauBench, shows the Self-Challenging framework achieves over a two-fold improvement in Llama-3.1-8B-Instruct, despite using only self-generated training data.