WebAnchor: 첫 번째 계획 단계를 고정해 Long-Horizon 웹 추론을 안정화하는 방법

WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning

Jan 6, 2026•Xinmiao Yu, Liwen Zhang, Xiaocheng Feng +4•View PDF

TL;DR Highlight

웹 검색 AI 에이전트가 첫 번째 계획을 잘 세우면 전체 작업 성공률이 급격히 오르는 현상을 발견하고, 이를 강화학습으로 최적화한 프레임워크.

Who Should Read

LLM 기반 웹 검색 에이전트나 Deep Research 파이프라인을 개발하는 ML 엔지니어. 특히 멀티스텝 에이전트의 계획 품질을 개선하거나 RL 파인튜닝 전략을 고민하는 개발자.

Core Mechanics

첫 번째 계획 단계가 전체 태스크 성공률에 압도적인 영향을 미치는 'plan anchor' 현상 발견 — 첫 스텝이 틀리면 BrowseComp 기준 30.9%까지 성능이 폭락함
기존 GRPO(강화학습 알고리즘)는 보상을 전체 궤적에 균등 분배해서 첫 번째 계획 단계를 제대로 최적화 못 하는 문제가 있음
Anchor-GRPO는 계획(Stage 1)과 실행(Stage 2)을 분리해 학습 — Stage 1에서는 자체 플레이 경험에서 추출한 'Plan Rubrics(평가 기준)'로 첫 스텝만 집중 최적화
Plan Rubrics는 Goal Alignment, Subgoal Coverage, Tool Appropriateness 등 6개 차원으로 계획 품질을 0~5점으로 채점해 dense reward(촘촘한 보상 신호)로 활용
Stage 2에서는 최종 정답 일치 여부만 보는 sparse reward로 실행 단계를 최적화해 계획과 실행이 일관되게 유지됨
모델 크기(3B→30B)와 컨텍스트 길이(32k→64k) 모두 늘릴수록 성능이 일관되게 향상되는 강한 확장성 확인

Evidence

WebAnchor-30B가 GAIA 벤치마크에서 Pass@1 76.4% 달성 — OpenAI-o3(70.5%), WebSailor-32B(53.2%) 모두 초과
WebAnchor-30B가 BrowseComp에서 Pass@1 46.0% — 베이스라인 GRPO(44.0%), First-step GRPO(43.3%) 대비 개선
첫 스텝이 틀렸을 때 Pass@8 평균 하락: BC-ZH 28.76%, BC-EN 30.89%, GAIA 23.63%
Planner Dense Reward(rubric 기반) Pass@1 46.0% vs Naive Plan Reward 44.2% vs 0-1 Terminal Reward 43.3% — 구조화된 dense reward가 확실히 우위

How to Apply

멀티스텝 에이전트를 RL로 학습시킬 때 첫 번째 reasoning step에만 gradient를 흘리는 masked credit assignment를 도입해보면 됨 — 나머지 스텝은 gradient 차단
성공/실패 에이전트 궤적을 모아 LLM으로 계획 품질 rubric을 자동 생성하고, 이 rubric을 reward 모델로 사용하는 파이프라인을 RAG 에이전트나 코드 에이전트에도 적용 가능
두 단계 RL 학습 구조(계획 최적화 → 실행 최적화)는 OpenAI Deep Research 같은 long-horizon 리서치 에이전트 파인튜닝 시 안정성 개선 전략으로 바로 참고 가능

Code Example

snippet

# Plan Rubrics 평가 프롬프트 예시 (논문 Appendix A.2.2 기반)
prompt = """
You are tasked with evaluating the following plan for a web information seeking task.
Score each dimension from 0 to 5:

- Goal Alignment: 계획이 사용자가 실제로 원하는 것에 집중하는가?
- Subgoal Coverage: 문제를 얼마나 철저하게 서브태스크로 분해했는가?
- Tool Appropriateness: 올바른 소스/도구를 선택했는가?
- Logical Ordering: 추론 흐름이 자연스럽고 효율적인가?
- Actionability: 실제로 실행 가능한 구체적 지시인가?
- Clarity and Conciseness: 읽고 따라하기 쉬운가?

Query: {query}
Plan: {plan}

Output JSON:
{{
  "Goal Alignment": {{"score": 0-5, "comment": "..."}},
  "Subgoal Coverage": {{"score": 0-5, "comment": "..."}},
  "Tool Appropriateness": {{"score": 0-5, "comment": "..."}},
  "Logical Ordering": {{"score": 0-5, "comment": "..."}},
  "Actionability": {{"score": 0-5, "comment": "..."}},
  "Clarity and Conciseness": {{"score": 0-5, "comment": "..."}},
  "total_score": 0-30,
  "overall_comment": "..."
}}
"""

# 사용 예시
from anthropic import Anthropic
client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": prompt.format(
            query="Who was the first person to climb Everest without oxygen?",
            plan="1. Search for Everest no-oxygen summit records. 2. Check Wikipedia and mountaineering databases. 3. Verify date and nationality."
        )
    }]
)
print(response.content[0].text)

Terminology

GRPO그룹 상대적 정책 최적화(Group Relative Policy Optimization). 여러 개의 답변을 동시에 생성하고, 그 중 좋은 것과 나쁜 것을 비교해서 모델을 학습시키는 강화학습 기법.

Long-Horizon여러 단계를 거쳐야 완료되는 긴 작업. '오늘 점심 뭐 먹지'는 short-horizon, '회사 창업해서 상장하기'는 long-horizon에 해당.

Plan Anchor이 논문이 발견한 현상. 첫 번째 계획 단계가 닻(anchor)처럼 전체 작업 방향을 고정해버림 — 첫 단추를 잘못 끼우면 뒤에서 아무리 잘해도 망하는 것과 같음.

Sparse Reward중간 과정은 무시하고 최종 결과에만 보상을 주는 방식. 바둑으로 치면 중간 수에 점수 안 주고 게임 이겼을 때만 +1점 주는 것.

Dense Reward매 단계마다 세밀한 피드백을 주는 보상 방식. 중간 계획 품질도 점수로 환산해서 학습 신호로 사용함.

ReActReasoning + Acting의 합성어. LLM이 생각(Thought) → 행동(Action) → 관찰(Observation)을 반복하며 문제를 푸는 에이전트 패턴.

Pass@1한 번 시도해서 정답을 맞힐 확률. Pass@3은 3번 시도 중 한 번이라도 맞힐 확률.

Related Resources

Original Abstract (Expand)

Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking, with reinforcement learning (RL) becoming a key optimization paradigm. However, planning remains a bottleneck, as existing methods struggle with long-horizon strategies. Our analysis reveals a critical phenomenon, plan anchor, where the first reasoning step disproportionately impacts downstream behavior in long-horizon web reasoning tasks. Current RL algorithms, fail to account for this by uniformly distributing rewards across the trajectory. To address this, we propose Anchor-GRPO, a two-stage RL framework that decouples planning and execution. In Stage 1, the agent optimizes its first-step planning using fine-grained rubrics derived from self-play experiences and human calibration. In Stage 2, execution is aligned with the initial plan through sparse rewards, ensuring stable and efficient tool usage. We evaluate Anchor-GRPO on four benchmarks: BrowseComp, BrowseComp-Zh, GAIA, and XBench-DeepSearch. Across models from 3B to 30B, Anchor-GRPO outperforms baseline GRPO and First-step GRPO, improving task success and tool efficiency. Notably, WebAnchor-30B achieves 46.0% pass@1 on BrowseComp and 76.4% on GAIA. Anchor-GRPO also demonstrates strong scalability, getting higher accuracy as model size and context length increase.