텍스트에서 Tool-Use Trajectory 합성하기: GEM 파이프라인

Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text

Jan 15, 2026•Zhihao Xu, Rumei Li, Jiahuan Li +4•View PDF

TL;DR Highlight

API 명세 없이 위키·블로그 같은 일반 텍스트에서 LLM 에이전트 학습용 멀티턴 tool-use 대화 데이터를 자동 생성하는 파이프라인.

Who Should Read

LLM 기반 에이전트를 파인튜닝하려는데 고품질 멀티턴 tool-use 학습 데이터 부족으로 막힌 ML 엔지니어나 AI 연구자. 특히 특정 도메인 에이전트를 구축하면서 데이터 수집 비용이 부담스러운 팀.

Core Mechanics

기존 방식은 API 세트를 미리 정의해야 했지만, 이 논문은 위키·블로그 같은 일반 텍스트에서 바로 멀티턴 tool-use 데이터를 추출하는 새로운 패러다임 제시
약 14%의 텍스트 세그먼트가 멀티스텝 워크플로우를 포함하고 있어, 대규모 텍스트 코퍼스가 충분한 데이터 소스로 활용 가능
GEM 파이프라인 4단계: 텍스트 필터링 → 워크플로우·툴 정의 추출 → GLM-4.6으로 대화 생성 → 복잡도 정제(Refinement)
Refinement 단계가 핵심: 평균 메시지 수 30→46, 툴 종류 5→8.6, 툴 호출 횟수 7.8→16.3으로 데이터 복잡도 대폭 향상
Qwen3-32B-GEM이 GPT-4.1(38.88%), DeepSeek-V3.2-Exp(37.38%)를 BFCL V3에서 44.88%로 초과 달성
GEM 파이프라인 자체를 Qwen3-8B에 증류한 Trajectory Synthesizer를 별도 학습해 저비용 대량 데이터 생성 가능

Evidence

Qwen3-32B-GEM BFCL V3 Overall 44.88% — GPT-4.1(38.88%), DeepSeek-V3.2(37.38%) 대비 아웃도메인 데이터만으로 독점 모델 초과
τ2-bench Retail Pass@4: Qwen3-32B-GEM 86.84% vs 인도메인 학습 데이터인 MUA 80.70% 초과
Refinement 유무 비교: Qwen3-32B Overall 32.50%(미적용) → 44.88%(적용), +12.38%p 향상
Trajectory Synthesizer(Qwen3-8B 기반)가 GLM-4.6 풀 파이프라인 대비 BFCL 28.38% vs 30.25%로 유사 품질 유지하며 비용 대폭 절감

How to Apply

WikiHow, Ultra-FineWeb 같은 공개 텍스트 코퍼스를 확보하고, 멀티스텝 절차를 담은 문서만 필터링(논문 기준 약 14% 해당)해 에이전트 학습 데이터 소스로 활용
강력한 모델(GPT-4o, Claude 등)로 4단계 파이프라인(필터링→툴 추출→대화 생성→Refinement)을 구현해 커스텀 도메인 tool-use SFT 데이터 생성 — 특히 Refinement 생략 시 성능이 12%p 이상 떨어지므로 필수
10K 수준의 고품질 데이터를 먼저 생성한 후, 소형 모델(8B)을 Trajectory Synthesizer로 SFT 학습시켜 이후 저비용 대량 생산에 활용

Code Example

snippet

# GEM 파이프라인 Stage 1: 멀티스텝 워크플로우 포함 여부 판단 프롬프트
prompt_filter = """
Determine whether the following text contains multi-step operations involving
the use of an APP, website, computer, or other machine.
If it contains, generate one sentence summary and identify:
- platform: operator / computer / phone / machine / other
- domain: computers_and_electronics / health / shopping / ...
- task_category: customer_support / developer_tools / databases / ...

Output:
<multi_step>False</multi_step>
or
<multi_step>True</multi_step>
<summary>...</summary>
<domain>...</domain>
<platform>...</platform>
<task>...</task>

Text: {text}
"""

# Stage 2: 워크플로우 & 툴 추출 (OpenAI schema 포맷)
prompt_tool_extract = """
You are a program design expert.
Given a workflow description, design functions to translate it into a program.

1. Extract all intermediate steps
2. Convert every step to a function and represent as execution graph
   e.g., (login)->(search_query)->(update_item)
3. Generate API tool definitions in OpenAI JSON schema format
   - Each tool: single, coherent capability
   - Parameters: self-explanatory names, explicit types
   - Include both read and write tools (get_*, update_*)

Workflow Description: {text}

Output format:
<workflow>
  <steps>Step1: ...\nStep2: ...</steps>
  <execution_graph>(api1)->(api2, api3)->...</execution_graph>
  <tools>[{"name": "api_name", "description": "", "inputSchema": {...}}]</tools>
</workflow>
"""

Terminology

SFT모범답안 데이터를 보여주고 따라 학습하게 하는 지도 학습 파인튜닝(Supervised Fine-Tuning). 학교에서 예제 풀이 보고 따라 푸는 것과 비슷.

TrajectoryAI 에이전트가 사용자와 주고받은 대화 + 툴 호출 + 툴 응답의 전체 흐름. 에이전트가 '어떤 순서로 무엇을 했는지'의 기록.

BFCLBerkeley Function Calling Leaderboard. LLM이 함수(API)를 얼마나 정확하게 호출하는지 평가하는 공개 벤치마크.

Multi-turn사용자와 AI가 여러 번 주고받는 대화. 한 번에 끝나는 게 아니라 맥락을 이어가며 여러 단계에 걸쳐 작업을 수행하는 방식.

OpenAI schema함수/툴을 JSON으로 정의하는 표준 포맷. 이름, 파라미터 타입, 설명을 명시해 LLM이 언제 어떻게 툴을 호출할지 이해하게 함.

τ-bench / τ2-bench항공사·쇼핑몰 같은 실제 도메인에서 사용자-에이전트 상호작용을 시뮬레이션해 에이전트 능력을 종합 평가하는 벤치마크.

Distillation큰 모델(teacher)의 출력 결과를 학습 데이터로 삼아 작은 모델(student)을 훈련시키는 기법. 큰 모델의 능력을 저비용으로 작은 모델에 이전.

Original Abstract (Expand)

Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow&tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on {\tau} - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.