O-Researcher: Multi-Agent Distillation과 Agentic RL로 만든 오픈엔디드 Deep Research 모델

O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL

Jan 7, 2026•Yi Yao, He Zhu, Piaohong Wang +12•View PDF

TL;DR Highlight

멀티 에이전트가 자동으로 고품질 학습 데이터를 생성하고 RL로 다듬어, 오픈소스 모델로 GPT-5와 OpenAI O3를 능가하는 딥 리서치 시스템을 구축했다.

Who Should Read

오픈소스 LLM을 리서치 에이전트로 파인튜닝하거나, 웹 검색·크롤링을 포함한 복잡한 멀티스텝 추론 파이프라인을 설계하는 ML 엔지니어. 고품질 합성 학습 데이터 자동 생성 파이프라인에 관심 있는 연구자.

Core Mechanics

쿼리를 독립적인 서브태스크로 분해해 병렬 처리 → 순차 실행 대비 GPT-5 전체 점수 42.92 → 49.60 향상, Comprehensiveness 40.59 → 49.61
Qwen-2.5-72B-Instruct를 베이스로 SFT + GRPO(그룹 상대 정책 최적화) 2단계 학습만으로 GPT-5(46.77), OpenAI O3(43.71)를 능가하는 48.48 달성
5,000개 seed 쿼리에서 멀티 에이전트 워크플로우로 후보 생성 → 룰 기반 하드 필터링 → LLM-as-a-Judge 시맨틱 필터링 → 휴먼 스팟체크 순서로 걸러 3,500+ 고품질 SFT 데이터만 남김
RL 보상 함수를 품질(Rbase, 가중치 0.6) + 도구 사용 적절성(Rtool, 0.2) + 포맷(Rformat, 0.2)으로 설계해 SFT 단계에서 발생한 인용 정확도 하락(44.27% → 29.13%)을 31.99%로 회복
컨텍스트 길이 32k → 64k 확장 시 성능 대폭 향상, 64k → 128k는 수확 체감 — 학습 데이터 길이 설계에 실용적 가이드라인 제공
추론 스텝 10개가 최적: 5개 대비 성능 향상(48.80 → 49.61), 20개 대비 비용 절감, 성능 차이는 미미

Evidence

O-Researcher-RL RACE 점수 48.48로 오픈소스 딥 리서치 모델 SOTA — GPT-5(46.77), OpenAI O3(43.71), Tongyi-Deep Research(45.66), MiroThinker(41.79) 모두 상회
DeepResearchGym-Commercial-100에서 O-Researcher-72B: Clarity 100.00(만점), Insight 99.3, Citation Precision 51.45 — 전체 카테고리 통틀어 최고 인용 정밀도
병렬 실행 워크플로우 적용 시 GPT-5 전체 48.88(Gemini-2.5-Pro Deep Research) 수준에 근접, 미적용 시 42.92로 6점 이상 차이
베이스 모델(Qwen-2.5-72B-Instruct) Effective Citations 8.96 → O-Researcher-RL 26.01로 약 3배 향상, 전체 RACE 33.38 → 48.48로 +15.10점 개선

How to Apply

복잡한 리서치 쿼리를 처리할 때 '플래너가 서브태스크 분해 → 각 서브태스크를 독립 에이전트가 병렬로 Think-Search-Observe 루프 실행 → summarizer가 통합' 패턴을 도입하면 단일 LLM 프롬프팅 대비 Comprehensiveness와 Insight가 크게 향상됨
에이전트 학습 데이터를 만들 때 최종 답변만 수집하지 말고, <subtask_list> → <think> → <plan> → <web_search> → <observation> → <subtask_answer> → <suggested_answer> 전체 트레이스를 XML 태그로 직렬화해 SFT 데이터로 활용
RL 보상 설계 시 '품질 60% + 도구 사용 적절성 20% + 포맷 20%' 가중치 조합을 참고하고, 도구 호출 횟수에 하한(2회 미만 0점)과 상한(8회 초과 -1점) 페널티를 두면 과도한 검색과 불충분한 검색 둘 다 억제 가능

Code Example

snippet

# O-Researcher 스타일 딥 리서치 프롬프트 템플릿

SYSTEM_PROMPT = """
You are a deep research assistant. Use the following tools to answer questions.

Available Tools:
- <web_search>query1 | query2&serp_num=10</web_search>
- <crawl_page>https://example.com</crawl_page>

Workflow:
1. Start with <subtask_list> to decompose the main query into orthogonal sub-problems
2. For each subtask, follow: <think> → <plan> → tool calls → <observation> → <subtask_answer>
3. After all subtasks, synthesize into <suggested_answer>

Rules:
- <think> must appear before any plan or tool call
- Minimum 5 tool invocations, maximum 8 per subtask
- Final answer must include Introduction, Body, Conclusion, References
- Every key fact must include a citation like [1]
"""

# 예시 트레이스 구조
example_trace = """
<subtask_list>
1. Analyze the historical background of [topic]
2. Examine current state-of-the-art approaches
3. Compare performance metrics across methods
</subtask_list>

<subtask>
Analyze the historical background of [topic]
</subtask>
<think>
I need to first understand the foundational work. Let me search for seminal papers.
</think>
<plan>
1. Search for early papers on [topic]
2. Crawl key reference pages
3. Synthesize timeline
</plan>
<web_search>history of [topic] seminal papers | [topic] survey 2024&serp_num=10</web_search>
<observation>
[search results here]
</observation>
<think>
Based on results, I should dig deeper into [specific aspect].
</think>
<crawl_page>https://relevant-paper-url.com</crawl_page>
<observation>
[page content]
</observation>
<subtask_answer>
[Synthesized answer for this subtask with citations [1][2]]
</subtask_answer>

<suggested_answer>
## Introduction
...
## Body
...
## Conclusion
...
## References
[1]. https://url - Paper Title
</suggested_answer>
"""

Terminology

GRPOGroup Relative Policy Optimization — RL 학습에서 여러 응답을 그룹으로 묶어 서로 비교해 좋고 나쁨을 상대적으로 평가하는 방식. 절대 점수가 아닌 그룹 내 상대 순위로 보상을 계산해 학습이 안정적임.

SFTSupervised Fine-Tuning — 모범 답안을 보여주고 따라하게 하는 학습법. 학교에서 예제 풀이 보고 따라 푸는 것과 비슷하며, 이 논문에서는 멀티 에이전트가 생성한 전체 추론 트레이스를 학습 데이터로 사용.

RLAIFReinforcement Learning from AI Feedback — 사람 대신 AI가 '이 응답이 더 좋다'고 평가해서 RL 학습 신호를 만드는 방법. 사람 annotation 비용을 크게 줄일 수 있지만 평가 AI의 품질에 민감함.

Rejective Sampling여러 후보 출력을 생성한 뒤 기준 미달인 것을 걸러내고 좋은 것만 남기는 데이터 정제 방식. 이 논문에서는 룰 기반 → LLM 심사 → 인간 검수 3단계 필터링으로 구현.

LLM-as-a-Judge다른 LLM이 생성한 답변을 또 다른 LLM이 채점하는 방식. 인간 평가자를 대체해 대규모 자동 품질 평가에 사용되며, 이 논문에서는 Qwen3 기반 모델이 심사위원 역할.

Deep Research Agent (DRA)단순 Q&A가 아니라 웹 검색, 크롤링, 멀티스텝 추론을 거쳐 긴 리서치 리포트를 자율적으로 생성하는 AI 에이전트. OpenAI Deep Research, Perplexity Deep Research 등이 대표적.

KL DivergenceRL 학습 중 정책 모델이 레퍼런스 모델에서 너무 멀리 벗어나지 않도록 잡아주는 제약. 안전벨트처럼 학습 안정성을 유지시켜줌.

Related Resources

Original Abstract (Expand)

The performance gap between closed-source and open-source large language models (LLMs) is largely attributed to disparities in access to high-quality training data. To bridge this gap, we introduce a novel framework for the automated synthesis of sophisticated, research-grade instructional data. Our approach centers on a multi-agent workflow where collaborative AI agents simulate complex tool-integrated reasoning to generate diverse and high-fidelity data end-to-end. Leveraging this synthesized data, we develop a two-stage training strategy that integrates supervised fine-tuning with a novel reinforcement learning method, designed to maximize model alignment and capability. Extensive experiments demonstrate that our framework empowers open-source models across multiple scales, enabling them to achieve new state-of-the-art performance on the major deep research benchmark. This work provides a scalable and effective pathway for advancing open-source LLMs without relying on proprietary data or models.