Self-Challenging Language Model Agents
TL;DR Highlight
No human training data needed — an LLM agent creates its own problems, solves them, and doubles its tool-use success rate through self-improvement.
Who Should Read
ML engineers looking to train tool-use agents with RL, or AI researchers wanting to boost small model agent performance without labeling costs.
Core Mechanics
- Agent plays 'challenger' role to explore the environment and generate training tasks, then 'executor' role to solve them with RL — fully automated with no human involvement
- Code-as-Task (CaT): tasks defined as instruction + verification function + example solution + failure cases — quality auto-filtered by code executor
- Self-improvement: Llama-3.1-8B average Pass@1 doubles from 12.0% to 23.5%
- Also works for distillation: large model generates tasks, small model learns from them
Evidence
- Self-improvement: Llama-3.1-8B average Pass@1 12.0% → 23.5% (roughly 2x), outperforming PAE by +10.6pp
- Distillation: +20.2pp absolute Pass@1 improvement without any human tasks (4-environment average)
- PPO on Calculation environment: 43.2% Pass@1 (vs. 20.3% zero-shot, more than 2x improvement)
How to Apply
- If you want RL for a new tool-use environment but have no training tasks: deploy agent in challenger mode with API docs, auto-generate tasks in CaT format (instruction + verification function + example solution + failure cases) → filter with code executor → executor RL training.
- For distilling large models to small ones: let the large model generate CaT tasks, then train the small model with rejection fine-tuning or PPO on those tasks — no human labeling needed.
Code Example
# Code-as-Task (CaT) structure example
# After the challenger agent explores the environment, it outputs tasks in the following format
"""
<instruction>
Your name is Olivia Nguyen and your email is olivia4794@example.com.
For #W112, return the Skateboard via paypal_77.
</instruction>
<evaluation_function>
def evaluate():
success = True
order = get_order_details("#W112")
success = success and order["return_items"][0] == "6843647669"
success = success and order["return_payment_method_id"] == "paypal_77"
return success
</evaluation_function>
<solution>
return_delivered_order_items(
order_id="#W112",
item_ids=["6843647669"],
payment_method_id="paypal_77"
)
</solution>
<failure_case>
# Wrong payment method
return_delivered_order_items(
order_id="#W112",
item_ids=["6843647669"],
payment_method_id="credit_card_77" # Incorrect value
)
</failure_case>
"""
# Automatic filtering conditions (all must pass for a valid task)
# 1) evaluation_function code is executable
# 2) example solution → evaluate() == True
# 3) all failure_cases → evaluate() == FalseTerminology
Related Resources
Original Abstract (Expand)
Large language models are quickly becoming the foundation for intelligent agents that are capable of using tools. However, training such agents is challenging because it requires human creation and annotation of a diverse set of tasks, tools, and evaluation criteria. In this paper, we propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself. The agent first plays the role of challenger and generates a task after interacting with the given tools. The tasks take the form of a novel general class of problems termed Code-as-Task, which are defined by an instruction, a verification function and solution and failure cases which serve as tests, allowing to filter only for high-quality tasks. The agent then takes an executor role and trains on those tasks with reinforcement learning using the evaluation feedback as a reward. Evaluation on two existing multi-turn tool-use agent benchmarks, M3ToolEval and TauBench, shows the Self-Challenging framework achieves over a two-fold improvement in Llama-3.1-8B-Instruct, despite using only self-generated training data.