Memento-Skills: LLM 파라미터 업데이트 없이 에이전트가 스스로 에이전트를 설계하는 시스템

Memento-Skills: Let Agents Design Agents

Mar 19, 2026•Huichi Zhou, Siyuan Guo, Anjie Liu +14•View PDF

TL;DR Highlight

LLM 파라미터를 건드리지 않고, 실행 가능한 'Skill' 파일을 외부 메모리로 쌓아 에이전트가 스스로 진화하는 시스템

Who Should Read

LLM 에이전트를 프로덕션에 배포하면서 지속적 학습(continual learning)이나 태스크별 성능 개선을 고민하는 ML 엔지니어 또는 에이전트 시스템 개발자. 파인튜닝 없이 에이전트 능력을 계속 향상시키고 싶은 팀에 특히 유용.

Core Mechanics

LLM 가중치(파라미터)를 전혀 수정하지 않고, 마크다운 파일로 저장된 'Skill'을 외부 메모리로 활용해 에이전트가 경험을 통해 스스로 성장
Read(관련 Skill 검색) → Act(실행) → Feedback(결과 평가) → Write(Skill 업데이트) 루프로 continual learning 구현
단순 코사인 유사도(semantic similarity) 기반 검색이 아닌, 실제 실행 성공 여부를 기준으로 학습한 Memento-Qwen 라우터 사용 — BM25 대비 Recall@1 기준 87.5% 향상
실패 시 LLM이 어떤 Skill이 실패 원인인지 자동으로 파악(failure attribution)하고, 해당 Skill 파일을 직접 수정하거나 새 Skill 생성
GAIA 벤치마크에서 3라운드 후 전체 정확도 26.2% 상대적 향상, Humanity's Last Exam(HLE)에서 116.2% 상대적 향상 달성
같은 도메인 내 Skill 전이가 핵심 — HLE처럼 과목 카테고리가 명확할수록 cross-task 전이 효과가 극대화됨 (Biology 학습 Skill이 미보유 Biology 문제에 재사용)

Evidence

GAIA 테스트셋에서 Memento-Skills 66.0% vs Read-Write 베이스라인 52.3% — 13.7%p 차이
HLE 테스트셋에서 Memento-Skills 38.7% vs Read-Write 베이스라인 17.9% — 2배 이상 차이
Memento-Qwen 라우터: Recall@1 0.60 (BM25 0.32, Qwen3-Embedding 0.54) — 최강 베이스라인 대비 11% 상대적 향상
HLE 학습 후 Skill 라이브러리가 5개(초기 atomic skills)에서 235개로 성장, GAIA 학습 후는 41개

How to Apply

기존 에이전트에 'Skill 폴더(SKILL.md + 헬퍼 스크립트)'를 외부 메모리로 붙이고, 실행 성공/실패 피드백마다 해당 Skill을 LLM이 자동 수정하도록 Write 루프를 추가하면 파인튜닝 없이 에이전트 성능이 점진적으로 개선됨
도메인이 명확한 작업(예: 특정 산업의 고객지원, 코드 리뷰, 의료 QA)에서 Skill 라이브러리를 도메인별로 클러스터링해 운용하면 cross-task transfer 효과를 극대화할 수 있음
라우터를 semantic embedding만으로 구현하지 말고, 실제 실행 성공 여부를 레이블로 삼는 InfoNCE loss로 fine-tuning하면 '비슷해 보이지만 틀린 Skill'을 잡아내는 hard negative 구분 능력이 크게 향상됨

Code Example

snippet

# Memento-Skills 설치 및 실행
git clone https://github.com/Memento-Teams/Memento-Skills.git
cd Memento-Skills
python -m venv .venv && source .venv/bin/activate
pip install -e .
memento agent

# config.json 설정 예시
{
  "llm": {
    "active_profile": "default",
    "profiles": {
      "default": {
        "model": "your-provider/your-model",
        "api_key": "your-api-key",
        "base_url": "https://your-api-url/v1"
      }
    }
  },
  "env": {
    "TAVILY_API_KEY": "your-search-api-key"
  }
}

# 라우터 학습용 Synthetic Query 생성 프롬프트 (Appendix C 기반)
# positive query: 해당 Skill을 선택해야 하는 시나리오
# hard negative: 같은 도메인이지만 이 Skill이 최선이 아닌 시나리오
prompt = """
Target skill:
- name: {skill_name}
- description: {description}

Generate:
- {n_pos} positive queries: target skill SHOULD be selected
- {n_neg} hard negative queries: same domain BUT skill is NOT the best tool

Return JSON: {"positive_queries": [...], "negative_queries": [...]}
"""

Terminology

SRDPStateful Reflective Decision Process의 약자. LLM 에이전트가 과거 경험을 메모리에 저장해 미래 결정에 활용하는 이론적 프레임워크. 일반 MDP(의사결정 모델)에 '기억'을 추가한 버전.

InfoNCE좋은 예시는 가깝게, 나쁜 예시는 멀게 학습하는 대조학습(contrastive learning) 손실 함수. '이 쿼리엔 이 Skill이 맞고, 저 Skill은 아니야'를 학습시킬 때 사용.

Skill MemoryLLM 에이전트의 외부 기억 저장소. 코드, 프롬프트, 선언적 명세(SKILL.md)를 하나의 폴더로 묶어 재사용 가능한 능력 단위로 저장함. 캐시처럼 꺼내 쓰고 업데이트 가능.

continual learning모델 파라미터를 재학습하지 않고, 새 경험이 쌓일수록 에이전트 능력이 계속 향상되는 학습 방식. 학생이 노트 필기를 업데이트하며 실력을 키우는 것과 유사.

BM25키워드 빈도 기반의 고전적 텍스트 검색 알고리즘. 구글 검색 이전 세대의 방식으로, 단어가 많이 겹치면 유사하다고 판단하지만 의미나 실행 적합성은 고려 못 함.

KL-regularised policy최적 행동을 찾되, 기존 행동에서 너무 멀리 벗어나지 않도록 제약을 거는 강화학습 기법. 온도(τ) 파라미터로 얼마나 확신 있게 선택할지 조절 가능.

t-SNE고차원 데이터를 2D/3D로 시각화하는 기법. 비슷한 Skill들이 2D 지도에서 같은 동네에 모이도록 배치해, Skill 라이브러리가 어떻게 클러스터링되는지 한눈에 볼 수 있게 해줌.

failure attribution에이전트 실행이 실패했을 때 여러 Skill 중 어떤 Skill이 원인인지 자동으로 찾아내는 과정. 버그 추적에서 어떤 함수가 문제인지 pinpoint하는 것과 유사.

Related Resources

Original Abstract (Expand)

We introduce Memento-Skills, a generalist, continually-learnable LLM agent system that functions as an agent-designing agent: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with stateful prompts, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the Read--Write Reflective Learning mechanism introduced in Memento~2~wang2025memento2. In the read phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the write phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables continual learning without updating LLM parameters, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to design agents end-to-end for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the General AI Assistants benchmark and Humanity's Last Exam demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.