Agentic Context Engineering: 자가 개선 언어 모델을 위한 진화하는 Context 설계

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Oct 6, 2025•Qizheng Zhang, Changran Hu, Shubhangi Upasani +10•View PDF

TL;DR Highlight

프롬프트를 한 번 쓰고 버리지 말고, 경험이 쌓일수록 자동으로 더 좋아지는 '살아있는 플레이북'으로 만들자.

Who Should Read

LLM 에이전트를 프로덕션에 붙이면서 '왜 같은 실수를 반복하지?'를 고민하는 백엔드/AI 개발자. 특히 시스템 프롬프트를 수동으로 개선하거나 RAG 없이 도메인 지식을 주입하려는 팀.

Core Mechanics

기존 프롬프트 최적화 도구들은 '짧고 범용적인 지시문'으로 수렴하는 brevity bias(간결함 편향) 문제가 있어서, 실제로 필요한 도메인 특화 노하우가 날아간다
LLM이 누적된 context를 통째로 재작성하면 18,282토큰짜리 context가 122토큰으로 쪼그라들면서 성능이 오히려 baseline보다 나빠지는 context collapse 현상이 발생한다
ACE는 Generator(실행) → Reflector(반성) → Curator(정리) 3단계로 역할을 분리해서, 매번 전체를 재작성하는 대신 새로 배운 것만 '델타'로 추가한다
context를 bullet 단위로 관리하면서 각 bullet마다 '도움됐다/해가 됐다' 카운터를 유지 → 시간이 지날수록 좋은 전략은 강화되고 나쁜 전략은 pruning된다
라벨(정답) 없이도 코드 실행 성공/실패 같은 환경 피드백만으로 자가 개선이 가능해서, supervised data 없이도 에이전트가 스스로 나아진다
DeepSeek-V3.1(오픈소스 소형 모델)로 GPT-4.1 기반 IBM CUGA(프로덕션 에이전트) 수준을 달성하고, 어려운 test-challenge split에서는 오히려 앞선다

Evidence

AppWorld 벤치마크에서 베이스라인 대비 평균 +17.1% 정확도 향상 (온라인 적응, 라벨 없음)
GEPA 대비 오프라인 적응 레이턴시 82.3% 감소, 롤아웃 수 75.1% 감소
Dynamic Cheatsheet 대비 온라인 적응 레이턴시 91.5% 감소, 토큰 비용 83.6% 감소
금융 분석 Formula 벤치마크에서 베이스라인 67.5% → ACE 85.5%(+18.0%p, 라벨 있음 오프라인)

How to Apply

시스템 프롬프트를 단일 텍스트 블록 대신 bullet ID가 붙은 항목 목록으로 관리하고, 매 실행 후 실패 케이스를 Reflector(별도 LLM 호출)에게 넘겨 '뭘 잘못했는지 + 다음엔 어떻게 해야 하는지'를 추출해 Curator가 새 bullet만 추가하게 설계한다
에이전트가 API 호출 실패, assertion error 등 실행 피드백을 남기는 환경이라면 라벨 없이도 적용 가능 — 코드 실행 결과를 Reflector 입력으로 바로 넘기면 된다
context가 너무 커지면 semantic embedding으로 유사 bullet을 de-duplication하는 lazy refinement를 추가해 context 윈도우 초과를 방지한다

Code Example

snippet

# ACE Curator가 playbook에 새 bullet을 추가하는 핵심 로직 예시
import json
from sentence_transformers import SentenceTransformer, util

class ACEPlaybook:
    def __init__(self):
        self.bullets = []  # {id, section, content, helpful, harmful}
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self._counter = 0

    def add_bullet(self, section: str, content: str):
        self._counter += 1
        bullet_id = f"ctx-{self._counter:05d}"
        self.bullets.append({
            "id": bullet_id,
            "section": section,
            "content": content,
            "helpful": 0,
            "harmful": 0
        })
        return bullet_id

    def update_feedback(self, bullet_ids: list[str], tag: str):
        """Reflector가 분류한 helpful/harmful 피드백을 반영"""
        for b in self.bullets:
            if b["id"] in bullet_ids:
                if tag == "helpful":
                    b["helpful"] += 1
                elif tag == "harmful":
                    b["harmful"] += 1

    def deduplicate(self, threshold=0.92):
        """Grow-and-Refine: 유사 bullet 병합으로 context collapse 방지"""
        contents = [b["content"] for b in self.bullets]
        embeddings = self.model.encode(contents, convert_to_tensor=True)
        to_remove = set()
        for i in range(len(self.bullets)):
            for j in range(i + 1, len(self.bullets)):
                if j in to_remove:
                    continue
                sim = util.cos_sim(embeddings[i], embeddings[j]).item()
                if sim > threshold:
                    # helpful 카운터 낮은 쪽 제거
                    loser = j if self.bullets[i]["helpful"] >= self.bullets[j]["helpful"] else i
                    to_remove.add(loser)
        self.bullets = [b for i, b in enumerate(self.bullets) if i not in to_remove]

    def render(self) -> str:
        """Generator에게 주입할 playbook 텍스트 생성"""
        lines = []
        for b in self.bullets:
            lines.append(f"[{b['id']}] helpful={b['helpful']} harmful={b['harmful']} :: {b['content']}")
        return "\n".join(lines)

# --- 사용 예시 ---
playbook = ACEPlaybook()
playbook.add_bullet("strategies", "항상 Phone 앱 contacts에서 관계(roommate 등)를 조회할 것. 거래 내역 파싱으로 추측 금지.")
playbook.add_bullet("apis", "페이지네이션은 while True + break 패턴 사용. for i in range(10) 절대 금지.")

# Reflector 피드백 반영
playbook.update_feedback(["ctx-00001"], "helpful")
playbook.update_feedback(["ctx-00002"], "helpful")

# 새 insight 추가 후 dedup
playbook.add_bullet("strategies", "항상 contacts API로 관계를 확인하고 heuristic 사용 금지.")
playbook.deduplicate()  # 유사 bullet 제거

print(playbook.render())

Terminology

context adaptation모델 가중치를 바꾸지 않고 입력(프롬프트, 시스템 메시지, 메모리 등)을 수정해서 모델 동작을 개선하는 방법. 소프트웨어로 치면 코드 재배포 없이 설정 파일만 바꾸는 것과 비슷.

brevity bias프롬프트 최적화 도구들이 점점 '짧고 범용적인' 지시문으로 수렴하는 현상. 요약하면서 중요한 도메인 노하우가 날아가는 문제.

context collapseLLM이 긴 context를 통째로 재작성할 때 핵심 정보를 버리고 극도로 짧게 압축해버리는 현상. 18,000토큰 → 122토큰으로 줄어들면서 성능이 오히려 급락하는 사례가 대표적.

ICLIn-Context Learning. 학습 없이 입력 프롬프트 안에 예시를 넣어서 모델이 패턴을 파악하게 하는 방법. 몇 가지 예제를 같이 보여주면 모델이 알아서 따라 하는 것.

GEPAGenetic-Pareto 알고리즘 기반 프롬프트 최적화 도구. 여러 프롬프트 후보를 진화시켜가며 성능이 좋은 것을 선별하는 방식.

KV cacheLLM이 이전에 처리한 context를 재계산 없이 재사용하기 위해 저장해두는 캐시. 같은 시스템 프롬프트가 반복 사용될 때 속도와 비용을 크게 절감.

ReActReasoning + Acting의 합성어. LLM이 '생각(추론) → 행동(API 호출 등) → 관찰 → 다시 생각'을 반복하는 에이전트 실행 패턴.

delta update전체를 다시 쓰지 않고 바뀐 부분만 추가·수정하는 방식. git의 diff/patch 개념과 동일 — ACE는 새로 배운 것만 bullet 단위로 추가해서 기존 지식을 보존한다.

Related Resources

Original Abstract (Expand)

Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation -- modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. Building on the adaptive memory introduced by Dynamic Cheatsheet, we introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.