Code as Agent Harness: Executable, Verifiable, Stateful Agent 시스템을 향해

TL;DR Highlight

LLM 에이전트에서 코드를 단순 출력물이 아닌 추론·행동·환경 모델링의 실행 인프라로 재정의한 102페이지짜리 서베이

Who Should Read

LangChain, OpenHands, Claude Code 같은 AI 에이전트 시스템을 설계하거나 코딩 어시스턴트 파이프라인을 구축하는 백엔드/ML 엔지니어. 멀티에이전트 워크플로우나 자동화 DevOps 파이프라인을 고민하는 개발자.

Core Mechanics

코드를 'LLM이 생성하는 결과물'이 아니라 'Agent가 추론하고 행동하고 환경을 모델링하는 실행 기반(harness)'으로 보는 새로운 프레임을 제시함. 코드의 핵심 가치는 실행 가능성(executable), 검사 가능성(inspectable), 상태 유지(stateful) 세 가지.
Harness Interface 3계층으로 정리: ① Code for Reasoning(PoT, PAL, CoC처럼 추론을 코드로 외부화), ② Code for Acting(SayCan, Voyager처럼 행동을 프로그램으로 표현), ③ Code for Environment(SWE-bench, WorldCoder처럼 환경 상태를 코드로 표현).
Planning 메커니즘을 4가지로 분류: Linear Decomposition(단계별 분해), Structure-grounded(의존성 그래프 기반), Search-based(MCTS 등 탐색), Orchestration-based(다중 에이전트 역할 분담). 각각 다른 복잡도의 태스크에 적합.
Memory를 단순 벡터 DB가 아닌 5종 상태관리 레이어로 구분: Working Memory(현재 태스크 상태), Semantic Memory(코드베이스 증거), Experiential Memory(과거 경험 재사용), Long-term Memory, Multi-Agent Memory. 컨텍스트 압축(Context Compaction)도 별도 메커니즘으로 다룸.
Plan-Execute-Verify(PEV) 루프를 핵심 제어 패턴으로 제시. 샌드박스 실행 + 권한 계층(읽기전용/샌드박스편집/전체접근) + 결정론적 센서(테스트, 린터, 정적분석)로 에이전트 행동을 통제하는 구조.
멀티에이전트 확장 시 공유 코드 아티팩트(레포, 테스트, 실행 트레이스)가 Manager-Planner-Coder-Reviewer-Tester 역할 간 협력 기반이 됨. 중앙집중형/분산형/스트리밍 등 워크플로우 토폴로지별 패턴도 정리.

Evidence

SWE-agent와 RepairAgent 비교 실험에서, 동일 기반 모델이라도 Working Memory(상호작용 상태 및 실행 피드백 구성 방식)에 따라 레포지토리 수준 버그 수정 성능이 크게 달라짐을 보임.
CodeAdapt 실험에서 코드 실행 인터페이스와 긴밀하게 결합된 LLM이 추론 전용으로 설계된 특화 모델을 성능에서 앞서는 결과를 보임.
Voyager(Minecraft)는 자동 커리큘럼 + 실행 가능한 스킬 라이브러리 누적 방식으로 기존 단일 패스 에이전트 대비 오픈엔드 태스크 완료율을 크게 향상시킴 (논문 내 인용 기준).
SWE-bench 기준으로 레포지토리 수준 코드 편집 평가 시, 텍스트 정확도만 보는 벤치마크와 달리 유닛테스트 실행 결과를 검증 기준으로 삼아야 실제 에이전트 성능을 측정할 수 있음을 다수 시스템에서 확인.

How to Apply

코딩 에이전트 파이프라인을 구축할 때 디버깅 루프를 단순 재시도가 아닌 PEV(Plan-Execute-Verify) 패턴으로 설계하면 된다. 계획 단계에서 수정할 파일·검증 명령·롤백 포인트를 명시한 contract를 만들고, 샌드박스에서 실행 후 테스트/린터 결과를 결정론적 센서로 피드백하는 구조로 바꾸면 무한 루프 없이 안정적인 제어가 가능.
멀티에이전트 코딩 시스템을 구성할 때 공유 메모리를 벡터 DB 하나로 단순화하지 말고, Working/Semantic/Experiential/Long-term 용도별로 분리하라. 예를 들어 현재 태스크 실행 상태는 Working Memory(파일 목록, 실패 테스트 기록), 레포 코드 증거는 Semantic Memory(AST 기반 청킹+검색), 이전 성공 패치 패턴은 Experiential Memory로 나눠서 관리하면 컨텍스트 오염 없이 장기 태스크를 처리할 수 있다.
GUI 자동화나 OS 에이전트를 만들 때 Voyager 패턴처럼 실행 성공한 액션 시퀀스를 재사용 가능한 코드 스킬로 저장하는 Skill Library를 붙여라. 새 태스크가 오면 먼저 스킬 라이브러리를 검색하고, 없으면 새로 생성 후 저장하는 방식으로 lifelong 적응이 가능하다.

Code Example

snippet

# PEV(Plan-Execute-Verify) 루프 패턴 예시
import subprocess
from dataclasses import dataclass
from typing import Optional

@dataclass
class PlanContract:
    task_description: str
    files_to_edit: list[str]
    validation_commands: list[str]  # 예: ["pytest tests/", "mypy src/"]
    rollback_snapshot: Optional[str] = None

def plan_phase(task: str, repo_context: str) -> PlanContract:
    """LLM을 통해 수정 계획(contract) 생성"""
    # LLM 호출: 어떤 파일을 수정하고 어떤 테스트로 검증할지 결정
    return PlanContract(
        task_description=task,
        files_to_edit=["src/utils.py"],
        validation_commands=["pytest tests/test_utils.py -v", "mypy src/utils.py"]
    )

def execute_phase(contract: PlanContract, patch_code: str) -> dict:
    """샌드박스 환경에서 코드 변경 적용"""
    # 실제 환경에서는 Docker/E2B 같은 격리 환경 사용 권장
    results = {}
    for cmd in contract.validation_commands:
        result = subprocess.run(cmd.split(), capture_output=True, text=True)
        results[cmd] = {
            "returncode": result.returncode,
            "stdout": result.stdout[-2000:],  # 컨텍스트 압축
            "stderr": result.stderr[-1000:]
        }
    return results

def verify_phase(results: dict) -> tuple[bool, str]:
    """결정론적 센서(테스트 결과)로 상태 검증"""
    all_passed = all(r["returncode"] == 0 for r in results.values())
    feedback = "\n".join(
        f"[{'PASS' if r['returncode']==0 else 'FAIL'}] {cmd}\n{r['stderr']}"
        for cmd, r in results.items()
    )
    return all_passed, feedback

def pev_loop(task: str, repo_context: str, max_iterations: int = 3):
    contract = plan_phase(task, repo_context)
    
    for iteration in range(max_iterations):
        # Execute: LLM이 생성한 패치 적용
        patch = generate_patch(contract)  # LLM 호출
        results = execute_phase(contract, patch)
        
        # Verify: 테스트/린터로 검증
        passed, feedback = verify_phase(results)
        
        if passed:
            print(f"✅ Verified at iteration {iteration+1}")
            return patch
        
        # 피드백을 Working Memory에 업데이트해서 다음 시도에 활용
        contract.task_description += f"\n\nPrevious attempt failed:\n{feedback}"
        print(f"🔄 Iteration {iteration+1} failed, retrying with feedback...")
    
    # 최대 시도 초과 시 Human-in-the-Loop 에스컬레이션
    raise Exception("Max iterations reached. Human review required.")

Terminology

Agent HarnessLLM 주변에 도구, API, 샌드박스, 메모리, 검증기 등을 감싸서 상태 없는 모델을 실제로 일하는 에이전트로 만드는 소프트웨어 레이어. 자동차로 비유하면 엔진(LLM)에 연결된 핸들·브레이크·계기판 전체.

PEV LoopPlan-Execute-Verify의 약자. 계획 수립 → 샌드박스 실행 → 테스트/린터로 검증 → 실패 시 피드백 반영의 순환 구조. 에이전트가 무한 루프에 빠지지 않도록 제어하는 핵심 패턴.

Program-of-Thoughts (PoT)수학 문제나 논리 문제를 풀 때 자연어 대신 실행 가능한 코드로 추론 과정을 표현하는 기법. 계산은 Python 인터프리터가 하고 LLM은 로직 구성만 담당해서 계산 오류를 줄임.

SWE-bench실제 GitHub 레포지토리의 이슈를 에이전트가 코드 수정으로 해결하는 벤치마크. '답이 맞냐'가 아니라 '유닛 테스트가 통과하냐'로 평가하는 것이 특징.

Sandboxed Execution에이전트가 생성한 코드를 호스트 시스템에 직접 실행하지 않고 격리된 컨테이너나 가상 환경에서 실행하는 방식. 잘못된 코드가 실제 서버를 망가뜨리는 것을 방지.

Working Memory현재 진행 중인 태스크의 상태(수정 중인 파일, 실패한 테스트 기록 등)를 LLM 컨텍스트 창 안에 유지하는 메커니즘. 사람이 작업할 때 책상 위에 펼쳐놓은 자료들과 비슷한 개념.

Skill Library에이전트가 과거에 성공적으로 실행한 액션 시퀀스나 코드 조각을 재사용 가능한 형태로 저장해두는 저장소. Voyager가 Minecraft에서 사용한 방식으로, 새 태스크가 오면 먼저 라이브러리를 검색해 재활용.

HITL (Human-in-the-Loop)에이전트가 위험한 작업(배포, 파일 삭제, 외부 API 호출 등)을 하기 전에 사람의 승인을 받는 게이트. 자율주행차의 긴급 제동처럼 안전을 위한 인간 개입 지점.

Related Resources

Awesome-Code-as-Agent-Harness-Papers (GitHub)

Original Abstract (Expand)

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.