SAFARI: Active Investigation 기반의 장거리 Agentic Fault Attribution 확장

TL;DR Highlight

수백만 토큰 넘는 에이전트 실행 로그에서 버그 발생 지점을 찾아내는 도구 기반 진단 프레임워크

Who Should Read

멀티 에이전트 시스템을 운영하면서 실패 원인을 추적하거나 에이전트 모니터링 파이프라인을 구축하는 AI 엔지니어. 긴 실행 트레이스에서 LLM context 한계로 인해 오류 진단이 안 되는 문제를 겪고 있는 개발자.

Core Mechanics

기존 fault attribution(에이전트 실행 중 어디서 실패했는지 찾는 작업) 방법은 전체 trajectory를 LLM context에 통째로 올리는데, 트레이스가 수백만 토큰에 달하면 context 한계를 초과해서 아예 동작 불가.
SAFARI는 전체 트레이스를 한번에 읽는 대신, read(offset, limit)와 search(pattern) 두 도구를 써서 LLM이 필요한 구간만 선택적으로 조회하는 'Active Investigation' 루프를 돌림.
Short-Term Memory(STM)를 유지해서 오래된 대화 내용이 context에서 밀려나더라도 핵심 가설, 증거, 조사 계획을 잃지 않고 유지함.
가설을 최대 3개의 원자적 클레임(atomic claim)으로 분해한 후 별도 Reasoning Evaluator LLM에 병렬로 검증 요청하고, 평균 신뢰도 점수가 70 이상이면 결론을 내리는 2단계 검증 구조.
backbone LLM으로 Claude-Opus-4.6(1M token)을 사용하며, 최대 30번의 tool call 반복 후에는 'forced finalization'으로 지금까지 모은 증거로 최선의 답을 강제 출력.
fault가 context 윈도우 밖에 있을 때(ς > 1) 기존 방법들은 아예 동작을 못 하지만, SAFARI는 Active Investigation으로 context 밖 구간도 도구로 탐색 가능.

Evidence

Who&When 벤치마크(1M token 예산)에서 기존 SOTA인 RAFFLES 대비 20% 높은 정확도 달성.
TRAIL GAIA 서브셋에서 25K token 예산 제약 조건 하에 RAFFLES 대비 precision 7%, strict precision 19% 향상.
fault 위치가 native context의 5배 밖에 있는 극단 시나리오(ς=5)에서도 precision 0.58 유지. 기존 방법들은 이 구간에서 precision 0으로 완전 실패.
100K token 예산의 SAFARI가 1M token 예산의 RAFFLES와 동등한 성능을 보여, 토큰 예산 10분의 1로도 같은 결과.

How to Apply

멀티 에이전트 시스템의 실행 로그가 LLM context를 초과할 때, 전체 로그를 프롬프트에 넣는 대신 read(offset, limit)와 search(pattern) 도구를 에이전트에 쥐어주고 가설 기반으로 필요한 구간만 조회하게 설계하면 됨.
에이전트가 여러 턴에 걸쳐 조사할 때 이전 발견 사항이 context에서 밀려나는 문제는, STM 구조체(task_goal, faults_identified_so_far, still_need, past_tool_calls)를 매 턴 프롬프트 끝에 붙여서 해결 가능.
최종 결론 전에 가설을 3개의 원자적 질문으로 쪼개서 별도 LLM에 검증 요청하는 패턴을 쓰면, 에이전트 자체 평가의 환각(hallucination)을 줄이고 신뢰도를 높일 수 있음.

Code Example

snippet

# SAFARI STM 구조 예시 - 매 턴 프롬프트에 삽입
stm_template = """
<stm>
{
  "task_goal": "[분석 중인 에이전트 시스템의 목표]",
  "final_output": "[마지막 스텝의 출력과 왜 틀렸는지]",
  "faults_identified_so_far": [
    {
      "step_id": 3,
      "reason": "이 스텝이 최초 오류이고, 이후 수정되지 않은 이유",
      "evidence": "트레이스에서 직접 인용한 증거 (300자 이내)"
    }
  ],
  "still_need": ["step 5를 읽어서 tool call 실패 여부 확인"]
}
</stm>
NEXT ACTION: [다음에 호출할 도구와 이유]
"""

# 도구 정의 예시 (OpenAI function calling 형식)
tools = [
    {
        "type": "function",
        "function": {
            "name": "read",
            "description": "트레이스의 특정 구간을 읽음",
            "parameters": {
                "type": "object",
                "properties": {
                    "offset": {"type": "integer", "description": "시작 라인 번호"},
                    "limit": {"type": "integer", "description": "읽을 라인 수 (최소 80 권장)"}
                },
                "required": ["offset", "limit"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search",
            "description": "트레이스에서 정규식 패턴 검색",
            "parameters": {
                "type": "object",
                "properties": {
                    "pattern": {"type": "string", "description": "검색할 정규식 패턴"}
                },
                "required": ["pattern"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "evaluate",
            "description": "원자적 질문으로 가설 검증 (submit 전 필수)",
            "parameters": {
                "type": "object",
                "properties": {
                    "questions": {"type": "array", "items": {"type": "string"}, "maxItems": 3},
                    "reasoning": {"type": "string", "description": "트레이스 인용 포함한 근거"}
                },
                "required": ["questions", "reasoning"]
            }
        }
    }
]

Terminology

Fault Attribution멀티 에이전트 시스템이 실패했을 때 어떤 에이전트의 몇 번째 스텝에서 처음 잘못됐는지 찾아내는 작업. 코드 디버깅에서 어느 줄에서 버그가 났는지 찾는 것과 같음.

Trajectory에이전트가 작업을 수행하는 동안 기록된 전체 실행 로그. 각 스텝마다 에이전트의 입력과 출력이 순서대로 담긴 시퀀스.

STM (Short-Term Memory)SAFARI가 매 턴마다 프롬프트에 붙이는 메모장. context에서 오래된 대화가 밀려나도 핵심 발견 사항을 잃지 않기 위한 구조.

Scaling Factor ςfault 위치 토큰 / context 윈도우 크기. 1보다 크면 fault가 context 밖에 있다는 뜻으로, 기존 방법이 동작 불가한 영역.

Atomic Claim하나의 가설을 검증 가능한 단일 질문으로 쪼갠 것. '이 분석이 맞나요?' 대신 '스텝 3에서 search 도구를 실제로 호출했나요?'처럼 구체적인 형태.

Attention DilutionLLM에 너무 긴 텍스트를 넣으면 중요한 정보에 집중하지 못하고 성능이 떨어지는 현상. 긴 책을 한번에 읽으면 앞 내용을 까먹는 것과 비슷.

Active Investigation전체 문서를 수동으로 읽는 대신, 필요한 부분만 도구로 조회하며 가설을 검증해 나가는 능동적 탐색 방식.

Related Resources

Original Abstract (Expand)

As autonomous agents tackle increasingly complex multi-step, multi-agent tasks, their execution trajectories have scaled beyond the constraints of even the largest context windows. Current methods for effectively diagnosing agent failures load the full trajectory into an LLM's context window, which suffers from attention dilution and fails when agentic traces inevitably exceed context limits. To address this, we introduce SAFARI (Scaling long-horizon Agentic Fault AttRibution via active Investigation), a framework that replaces linear context loading with a tool-augmented diagnostic loop. By equipping LLMs with a specialized toolbox to read and search trajectory segments alongside a persistent Short-Term Memory (STM) for cross-turn reasoning, SAFARI effectively decouples diagnostic accuracy from architectural context limits. Our experiments demonstrate that SAFARI outperforms state-of-the-art results by 20% on the Who&When dataset within a 1M token budget, and by 19% on TRAIL GAIA subset on a 25K token budget. Most significantly, SAFARI maintains a 0.58 precision even when the target fault resides 5x beyond the model's native context window, a scenario where traditional evaluators fail entirely.