비동기 소프트웨어 엔지니어링 Agent를 위한 효과적인 전략: CAID

Effective Strategies for Asynchronous Software Engineering Agents

Mar 23, 2026•Jiayi Geng, Graham Neubig•View PDF

TL;DR Highlight

Git의 branch-and-merge 패턴을 멀티 에이전트 협업에 그대로 적용해서 단일 에이전트 대비 최대 26.7% 성능 향상을 달성한 프레임워크

Who Should Read

LLM 기반 코딩 에이전트를 프로덕션에 적용하려는 ML 엔지니어, 또는 복잡한 소프트웨어 개발 태스크를 여러 에이전트로 나눠 처리하는 멀티 에이전트 시스템을 설계 중인 개발자

Core Mechanics

CAID(Centralized Asynchronous Isolated Delegation): 매니저 에이전트가 의존성 그래프를 만들고, 각 엔지니어 에이전트는 독립된 git worktree에서 작업 후 git merge로 통합하는 구조
단일 에이전트에서 iteration 예산을 2배로 늘려도 성능 개선이 거의 없거나 오히려 감소함 — MiniMax 2.5는 PaperBench에서 오히려 역효과
소프트 격리(파일 겹침 방지 지시)는 불충분 — PaperBench에서 단일 에이전트(57.2%)보다 낮은 55.5%가 나옴. 반면 git worktree 기반 물리적 격리는 63.3% 달성
에이전트 수를 무조건 늘리면 안 됨 — simpy 레포에서 N=4일 때 92.1%, N=8일 때 44.3%로 오히려 급락
단일 에이전트 먼저 실행 후 실패 시 멀티 에이전트로 전환하는 폴백 전략은 비효율적 — 비용과 시간이 거의 2배인데 성능은 멀티 에이전트 단독과 거의 동일
매니저의 태스크 위임 능력이 핵심 — autodiff.py 같은 핵심 의존성 파일을 빠뜨리면 다른 에이전트가 아무리 잘해도 전체 pass rate가 바닥을 침

Evidence

PaperBench에서 단일 에이전트 대비 CAID: Claude Sonnet 4.5 57.2% → 63.3%, MiniMax 2.5 10.4% → 36.7%, GLM 4.7 38.0% → 45.4%
Commit0-Lite에서 단일 에이전트 대비 CAID: Claude Sonnet 4.5 53.1% → 59.1%, MiniMax 2.5 42.3% → 57.0% (+14.7pp, p=0.007)
iteration을 100→200으로 늘렸을 때 PaperBench Claude Sonnet 4.5 오히려 -3.0pp 하락, MiniMax 2.5는 +1.5pp로 미미
simpy 레포에서 엔지니어 수별 pass rate: N=2 → 0.0%, N=4 → 92.1%, N=8 → 44.3%

How to Apply

대규모 코딩 태스크를 에이전트에 맡길 때: 파일 단위 의존성 그래프를 먼저 만들고, 의존성이 없는 파일들만 병렬로 에이전트에게 할당하되 각 에이전트는 별도 git worktree에서 작업하게 구성
멀티 에이전트 개수 결정할 때: 레포지토리에서 독립적으로 개발 가능한 모듈 수를 먼저 파악하고, 그보다 적은 수의 에이전트를 배치. 8개보다 4개가 더 잘 동작하는 경우가 많음
에이전트 협업 파이프라인에서 conflict 문제가 발생할 때: 파일 겹침 금지 지시(소프트 격리)보다 git worktree로 물리적으로 격리한 뒤 merge 시점에 conflict를 명시적으로 처리하는 방식으로 전환

Code Example

snippet

# CAID 매니저 프롬프트 핵심 구조 (JSON 위임 포맷)
# 1. 의존성 그래프 기반 태스크 분할
delegation_plan = {
  "delegation_plan": {
    "first_round": {
      "num_agents": 4,
      "reasoning": "tensor_data.py와 operators.py는 독립적이라 병렬 가능, autodiff.py는 이들에 의존하므로 이후 할당",
      "tasks": [
        {
          "engineer_id": "engineer_1",
          "task_id": "task-operators",
          "file_path": "src/operators.py",
          "functions_to_implement": ["add", "mul", "neg"],
          "complexity": "simple",
          "instruction": "기본 연산자 구현. tensor_data.py에 의존하지 않음."
        },
        {
          "engineer_id": "engineer_2",
          "task_id": "task-tensor-data",
          "file_path": "src/tensor_data.py",
          "functions_to_implement": ["TensorData.__init__", "shape"],
          "complexity": "medium",
          "instruction": "텐서 데이터 구조 구현. operators.py와 독립적."
        }
      ]
    },
    "remaining_tasks": [
      {
        "task_id": "task-autodiff",
        "file_path": "src/autodiff.py",
        "functions_to_implement": ["backpropagate"],
        "complexity": "complex",
        "depends_on": ["task-operators", "task-tensor-data"]
      }
    ]
  }
}

# 2. 각 엔지니어는 별도 git worktree에서 실행
# git worktree add ../workspace_engineer_1 -b engineer-1-branch
# git worktree add ../workspace_engineer_2 -b engineer-2-branch

# 3. 완료 후 merge
# git merge engineer-1-branch  # conflict 발생 시 해당 엔지니어가 직접 해결

Terminology

git worktree하나의 git 레포지토리에서 여러 브랜치를 동시에 별도 폴더로 체크아웃하는 기능. 여러 사람이 각자 다른 방에서 작업하는 것처럼 파일 충돌 없이 동시 작업 가능.

branch-and-merge각자 독립 브랜치에서 개발하고 완료 후 메인 브랜치에 병합하는 협업 패턴. 팀원들이 각자 초안을 쓴 뒤 편집장이 최종 합치는 것과 비슷.

dependency graph어떤 모듈이 어떤 모듈에 의존하는지 나타낸 방향성 그래프. A를 만들어야 B를 만들 수 있다는 관계를 시각화한 것.

git merge conflict두 브랜치에서 같은 파일의 같은 부분을 다르게 수정했을 때 git이 자동으로 합칠 수 없어 발생하는 충돌. 두 사람이 같은 문서의 같은 문장을 동시에 다르게 수정한 상황.

asyncioPython에서 여러 작업을 동시에(비동기적으로) 실행할 수 있게 해주는 라이브러리. 여러 요리를 동시에 진행하되 각각 타이머 맞춰 순서 관리하는 것과 비슷.

PaperBenchAI 에이전트가 학술 논문의 핵심 실험을 코드로 재현할 수 있는지 평가하는 벤치마크. 논문 보고 실험을 처음부터 다시 구현해야 하는 시험.

Commit0에이전트가 Python 라이브러리를 뼈대 코드만 보고 처음부터 구현해서 모든 유닛 테스트를 통과시켜야 하는 벤치마크.

Related Resources

Original Abstract (Expand)

AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.