Self-Challenging: LLM 에이전트가 스스로 학습 태스크를 만들어 자가 개선하는 프레임워크

Self-Challenging Language Model Agents

Jun 2, 2025•Yifei Zhou, Sergey Levine, J. Weston +2•View PDF

TL;DR Highlight

사람이 만든 학습 데이터 없이, LLM 에이전트가 직접 문제를 출제하고 풀면서 tool-use 성공률을 2배로 끌어올리는 자가 개선 방법.

Who Should Read

Tool-use 에이전트를 RL로 학습시키려는 ML 엔지니어, 또는 라벨링 비용 없이 소형 모델의 에이전트 성능을 높이고 싶은 AI 연구자.

Core Mechanics

에이전트가 'challenger' 역할로 환경을 직접 탐색한 뒤 학습 태스크를 생성하고, 'executor' 역할로 그 태스크를 풀며 RL 학습 — 사람 개입 없이 전 과정이 자동화됨
Code-as-Task(CaT): 태스크를 instruction + 검증 함수 + 예시 솔루션 + 실패 케이스 4요소로 정의해, 코드 실행기로 품질을 자동 필터링 → False Positive(엉터리 보상) 거의 완전 제거
PAE(기존 SOTA)와 달리 challenger가 환경을 먼저 탐색 후 태스크를 만들기 때문에, 초기 관측만으로 태스크를 뽑는 PAE가 망가지는 부분 관측 환경(Retail, Airline 등)에서도 잘 작동
자가 개선 설정: Llama-3.1-8B가 자기 자신이 만든 태스크로만 RL 학습 → 평균 성공률 2배 달성
Distillation 설정: 8B challenger가 태스크 생성 → 70B가 trajectory 생성 → 8B SFT. 인간 태스크 없이 평균 20.2%p 절대 향상
태스크 수 스케일링이 태스크당 rollout 수 스케일링보다 out-of-distribution 성능에 훨씬 효과적 — 다양성이 핵심

Evidence

자가 개선: Llama-3.1-8B 평균 Pass@1 12.0% → 23.5% (약 2배), PAE 대비 +10.6%p
Distillation: 인간 태스크 없이 평균 Pass@1 +20.2%p 절대 향상 (4개 환경 평균)
PPO 적용 시 Calculation 환경 43.2% Pass@1 달성 (zero-shot 20.3% 대비 2배 이상)
CaT 완전 필터 통과율은 5.2%에 불과하지만, 그 결과 False Positive가 0%로 수렴 (PAE는 FP 다수 존재)

How to Apply

새 tool-use 환경에 RL을 도입하고 싶은데 학습 태스크가 없다면: 에이전트를 challenger 모드로 API 문서와 함께 환경에 풀어 탐색하게 하고, CaT 4요소(instruction + 검증함수 + 예시솔루션 + 실패케이스) 형식으로 태스크를 자동 생성 → 코드 실행기로 필터링 후 executor RL 학습
큰 모델을 작은 모델로 distillation하고 싶은데 도메인 태스크가 없다면: 작은 모델로 CaT 태스크를 생성하고 큰 모델이 trajectory를 생성하게 한 뒤 SFT — 실패 trajectory도 포함하면 더 효과적
자체 태스크 생성 파이프라인의 품질이 의심된다면: PAE처럼 instruction만 생성하지 말고, 검증 함수를 코드로 작성하고 솔루션과 실패 케이스로 자동 검증하는 CaT 구조를 추가해 노이즈 학습 데이터를 걸러낼 것

Code Example

snippet

# Code-as-Task (CaT) 구조 예시
# challenger 에이전트가 환경 탐색 후 아래 형식으로 태스크를 출력

"""
<instruction>
Your name is Olivia Nguyen and your email is olivia4794@example.com.
For #W112, return the Skateboard via paypal_77.
</instruction>

<evaluation_function>
def evaluate():
    success = True
    order = get_order_details("#W112")
    success = success and order["return_items"][0] == "6843647669"
    success = success and order["return_payment_method_id"] == "paypal_77"
    return success
</evaluation_function>

<solution>
return_delivered_order_items(
    order_id="#W112",
    item_ids=["6843647669"],
    payment_method_id="paypal_77"
)
</solution>

<failure_case>
# 잘못된 결제 수단
return_delivered_order_items(
    order_id="#W112",
    item_ids=["6843647669"],
    payment_method_id="credit_card_77"  # 틀린 값
)
</failure_case>
"""

# 자동 필터링 조건 (모두 통과해야 유효한 태스크)
# 1) evaluation_function 코드가 실행 가능
# 2) example solution → evaluate() == True
# 3) 모든 failure_case → evaluate() == False

Terminology

RL (Reinforcement Learning)에이전트가 시행착오로 보상을 최대화하도록 학습하는 방법. 게임에서 죽었다 살아나며 점수 올리는 법을 익히는 것과 같음.

SFT (Supervised Fine-Tuning)모범답안 데이터를 보여주고 따라하게 하는 학습법. 학교에서 예제 풀이 보고 따라 푸는 것과 비슷.

Rejection Fine-Tuning여러 번 시도한 trajectory 중 성공한 것만 골라 SFT하는 방법. 실패작은 버리고 잘 된 것만 학습시키는 방식.

PPO (Proximal Policy Optimization)RL 학습 시 모델이 너무 급격히 바뀌지 않도록 변화량을 제한하는 안정적인 알고리즘.

GRPOPPO를 단순화한 변형 알고리즘. DeepSeekMath에서 제안. 인프라가 가볍지만 하이퍼파라미터에 민감해 불안정할 수 있음.

Pass@kk번 시도했을 때 한 번이라도 성공할 확률. Pass@1은 단 1회 시도 성공률, Pass@4는 4번 중 최소 1번 성공 여부.

False Positive / False NegativeFP: 실제로는 실패했는데 검증기가 성공으로 판정 (엉터리 보상). FN: 실제로는 성공했거나 태스크 자체가 불가능한데 실패로 판정 (억울한 패널티).

Distillation큰 모델(선생)의 지식을 작은 모델(학생)에게 전수하는 학습 방법. 비싼 70B 모델의 행동을 저렴한 8B 모델이 흉내내게 훈련.

Related Resources

Original Abstract (Expand)

Large language models are quickly becoming the foundation for intelligent agents that are capable of using tools. However, training such agents is challenging because it requires human creation and annotation of a diverse set of tasks, tools, and evaluation criteria. In this paper, we propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself. The agent first plays the role of challenger and generates a task after interacting with the given tools. The tasks take the form of a novel general class of problems termed Code-as-Task, which are defined by an instruction, a verification function and solution and failure cases which serve as tests, allowing to filter only for high-quality tasks. The agent then takes an executor role and trains on those tasks with reinforcement learning using the evaluation feedback as a reward. Evaluation on two existing multi-turn tool-use agent benchmarks, M3ToolEval and TauBench, shows the Self-Challenging framework achieves over a two-fold improvement in Llama-3.1-8B-Instruct, despite using only self-generated training data.