XSkill: Multimodal Agent의 Experience와 Skill 기반 Continual Learning

XSkill: Continual Learning from Experience and Skills in Multimodal Agents

Mar 12, 2026•Guanyu Jiang, Zhaochen Su, Xiaoye Qu +2•View PDF

TL;DR Highlight

파라미터 업데이트 없이 과거 경험(action-level)과 스킬(task-level) 두 가지를 쌓아서 멀티모달 에이전트가 스스로 계속 똑똑해지는 프레임워크

Who Should Read

멀티모달 에이전트 시스템을 개발하면서 '이전에 했던 작업 경험을 어떻게 재활용할까'를 고민하는 AI 엔지니어. 특히 tool-use 에이전트에서 실패 패턴을 자동으로 학습시키고 싶은 개발자.

Core Mechanics

Experience(action-level)와 Skill(task-level) 두 스트림으로 지식을 분리 저장 — Experience는 JSON, Skill은 Markdown 문서로 관리
학습 없이(training-free) 과거 trajectory에서 지식을 추출해 새 태스크에 재사용하는 continual learning 루프 구현
여러 rollout을 cross-rollout critique으로 비교 분석해서 성공/실패 원인을 자동으로 Experience로 변환
새 태스크 추론 시 쿼리를 2~3개 subtask로 분해해 Experience를 더 정밀하게 검색하고, 현재 이미지 맥락에 맞게 rewrite해서 주입
Gemini-3-Flash로 쌓은 지식을 GPT-5-mini, o4-mini에 그대로 이식 가능한 cross-model transfer 확인
Skill은 syntax error를 29.9% → 15.3%로 줄여 tool 실행 안정성을 높이고, Experience는 더 적합한 tool을 선택하는 유연성을 부여

Evidence

Gemini-2.5-Pro 기준 Average@4 기준 tool-only 대비 최대 +6.71p 향상, TIR-Bench에서 최강 베이스라인 Agent-KB 대비 +11.13p 달성
VisualToolBench에서 전체 tool 실행 에러율 29.9%(168건) → 15.3%(95건)로 감소, syntax error 114건 → 71건으로 줄어듦
rollout 수가 1→4로 늘수록 Average@4 25.35% → 30.49%, Pass@4 41.12% → 46.73%로 일관되게 향상(Gemini-2.5-Pro)
zero-shot cross-task transfer에서 모든 베이스라인 대비 평균 +2~3p 우세, out-of-distribution 태스크에서도 일관적 개선

How to Apply

에이전트가 태스크를 실행한 trajectory를 N번(권장 4번) 수집한 뒤, 성공/실패 rollout을 비교 분석해 '조건→행동' 형태의 Experience(64단어 이하)를 JSON으로 자동 추출하는 파이프라인을 붙이면 된다.
자주 쓰이는 tool 패턴을 Markdown 기반 Skill 문서로 정리하고, 새 태스크 추론 시 쿼리를 subtask로 분해해 cosine similarity로 top-k Experience를 검색 후 현재 이미지 맥락에 맞게 rewrite해서 system prompt에 주입하면 즉시 적용 가능하다.
한 모델(예: Gemini-3-Flash)로 지식을 쌓고 더 저렴한 모델(GPT-5-mini, o4-mini)에 그 지식을 이식하면, 비싼 모델 없이도 성능을 끌어올릴 수 있다.

Code Example

snippet

# Experience 추출 후 검색 예시 (핵심 로직 스케치)
import json
from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text: str) -> list[float]:
    res = client.embeddings.create(model="text-embedding-3-small", input=text)
    return res.data[0].embedding

def cosine_sim(a, b):
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Experience Bank (JSON)
experience_bank = [
    {"id": "E1", "condition": "When image is upside-down",
     "action": "Rotate image before analysis using img.rotate(180)",
     "embedding": embed("When image is upside-down rotate image before analysis")},
    {"id": "E2", "condition": "When object is too small to identify",
     "action": "Crop and zoom in with code interpreter before searching",
     "embedding": embed("When object is too small crop and zoom in")},
]

def retrieve_experiences(query: str, top_k: int = 3, threshold: float = 0.0):
    q_emb = embed(query)
    scored = [(e, cosine_sim(q_emb, e["embedding"])) for e in experience_bank]
    scored = [(e, s) for e, s in scored if s > threshold]
    scored.sort(key=lambda x: -x[1])
    return [e for e, _ in scored[:top_k]]

def decompose_task(task_description: str, image_context: str) -> list[str]:
    """태스크를 2-3개 subtask 쿼리로 분해"""
    prompt = f"""Decompose this visual task into 2-3 abstract subtask queries for experience retrieval.
Task: {task_description}
Image context: {image_context}
Output JSON array of query strings only."""
    res = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    return json.loads(res.choices[0].message.content)

# 사용 예시
task = "What is the prototype of the two mascots in the corner of the picture?"
image_ctx = "Image appears to be rotated 180 degrees, small mascots in corner"

subtask_queries = decompose_task(task, image_ctx)
all_experiences = []
for q in subtask_queries:
    exps = retrieve_experiences(q, top_k=3)
    all_experiences.extend(exps)

# 중복 제거 후 system prompt에 주입
unique_exps = {e['id']: e for e in all_experiences}.values()
injection = "\n".join([f"[{e['id']}] {e['condition']}: {e['action']}" for e in unique_exps])
print("Injected experiences:\n", injection)

Terminology

Continual Learning에이전트가 새로운 태스크를 경험할수록 점점 똑똑해지는 학습 방식. 사람이 실수를 반복하지 않기 위해 경험을 쌓는 것과 같음.

Trajectory에이전트가 태스크를 수행하면서 거친 '행동-결과' 시퀀스 전체 기록. 마치 문제를 푸는 과정을 녹화한 영상처럼 어떤 도구를 쓰고 어떤 결과가 나왔는지 담겨 있음.

Rollout에이전트가 동일한 태스크를 한 번 처음부터 끝까지 실행하는 것. 여러 번 rollout하면 성공/실패 다양한 경우를 모아 비교할 수 있음.

MLLMMultimodal Large Language Model. 텍스트뿐 아니라 이미지도 함께 이해하고 처리할 수 있는 대형 언어 모델. GPT-4o나 Gemini 같은 모델이 해당됨.

POMDPPartially Observable Markov Decision Process. 에이전트가 환경의 전체 상태를 알 수 없는 상황에서 의사결정하는 수학적 모델. 실제 세계에서 이미지 하나만 보고 전체 맥락을 파악해야 하는 상황을 표현함.

Cosine Similarity두 벡터(텍스트 임베딩)가 얼마나 비슷한 방향을 가리키는지 측정하는 지표. 0이면 완전히 다르고, 1이면 완전히 같은 의미.

Cross-Rollout Critique여러 번의 실행 결과를 서로 비교 분석해서 '왜 성공했고 왜 실패했는지' 공통 원인을 추출하는 과정. 여러 번 시험을 보고 나서 오답노트를 만드는 것과 비슷.

Related Resources

Original Abstract (Expand)

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.