When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime | AI Paper Digest

TL;DR Highlight

LLM 에이전트가 내부 오류를 그럴듯한 가짜 분석 리포트로 변환해 사용자에게 전달하는 'fail-plausible' 장애 패턴을 8주간 22건의 실제 사고로 분석한 논문.

Who Should Read

LLM 에이전트나 자동화 파이프라인을 프로덕션에 운영 중인 백엔드/MLOps 개발자. 특히 스케줄 잡, 툴 호출, 메모리, 사용자 푸시 알림이 연결된 에이전트 시스템을 관리하는 엔지니어.

Core Mechanics

LLM 에이전트의 가장 위험한 장애는 크래시가 아니라 'fail-plausible' — 내부 오류를 유창하고 그럴듯한 가짜 콘텐츠로 변환해 사용자에게 전달하는 것. 예: HTTP 400 에러 페이지가 캐시에 들어가자 LLM이 '허깅페이스 플랫폼 위기' 분석 리포트를 생성해 WhatsApp으로 전송.
장애를 5가지 메커니즘으로 분류: (A) 환경/플랫폼 quirk, (B) 설계 가정 불일치, (C) 에러 삼킴/희석, (D) 연쇄 환각·조작(fail-plausible), (E) 운영 누락·포렌식 맹점. Class D는 LLM 시스템에만 존재하는 신종 장애.
22건의 사고 중 약 70%는 자동화 테스트/거버넌스 감사가 아니라 사람이 직접 출력물을 읽다가 발견. 4,286개 유닛 테스트가 모두 green인 상태에서 사고가 진행됨.
사고 잠복 기간(13시간~60일)은 코드 복잡도가 아니라 장애 메커니즘과 상관관계. 가장 오래 생존한 장애는 복잡한 컴포넌트가 아니라 컴포넌트 사이 '솔기(seam)'에 존재.
선언적 거버넌스 레이어는 사전 예방 0%였지만 사후 재발 방지는 87%. 감사는 예측 엔진이 아닌 회귀 방지 엔진.
방어 성숙 경로는 3단계: (1) 포인트 픽스 → (2) 메타-룰(이름 붙인 교훈) → (3) 메카니즘 스캐너(CI에 자동 체크 추가). 3단계까지 도달한 교훈은 재발 0건, 1단계에서 멈춘 것은 반드시 재발.

Evidence

22건의 사고 중 약 70%가 자동화 모니터링이 아닌 사용자 직접 관찰로 발견. 유닛 테스트/프리플라이트의 사전 탐지율은 약 0%.
15건 소급 감사 결과: 사전 예방율 0/15 (0%), 사후 회귀 차단율 13/15 (87%), 전혀 예상 못 한 새 차원의 장애 12/15 (80%).
anti-fabrication 가드 도입 후 대상 합성 잡에서 다중 인과 체인·'therefore' 스타일 주장이 53~92% 감소, 태그된 주장 사용량은 0건에서 하루 약 9건으로 증가.
가장 긴 잠복 사고는 60일(외부 SSD 백업 EPERM 오류). 포렌식 도구 자체가 macOS TCC 샌드박스에 의해 차단되어 '정상/빈 결과'를 반환하며 거짓 안심을 제공.

How to Apply

LLM 파이프라인에서 로깅 함수가 stdout에 쓰지 않는지 확인하라. 진단 출력은 반드시 stderr로 분리해야 한다. `log() { echo ... >&2; }` 한 줄만으로 환각 체인 전체를 끊을 수 있다.
에이전트가 사용자에게 보내는 출력물을 주 1회 30분 직접 읽는 '사용자 관점 관찰 리추얼'을 팀 캘린더에 고정하라. 자동화 스택 전체를 합친 것보다 탐지율이 높을 수 있다.
LLM 호출 전 컨텍스트 조립 단계에서 알림 메시지, 에러 로그, 출처 불명 요약을 자동 제거하는 '컨텍스트 위생' 레이어를 추가하라. 오염된 컨텍스트가 들어가면 모델은 정상적으로 작동하면서 가짜 콘텐츠를 만든다.

Code Example

snippet

# fail-plausible 방지를 위한 핵심 패턴들

# 1. 진단 출력은 반드시 stderr로 (hallucination chain 차단)
log() {
  echo "[$(date)] $*" >&2  # stdout 절대 금지
}

# 2. LLM 호출 전 컨텍스트 오염 제거
def sanitize_context(messages: list[dict]) -> list[dict]:
    """알림, 에러 로그, 출처 불명 내용을 컨텍스트에서 제거"""
    ALERT_PATTERNS = [r'ALERT:', r'ERROR:', r'HTTP \d{3}', r'Exception:']
    cleaned = []
    for msg in messages:
        content = msg.get('content', '')
        if any(re.search(p, content) for p in ALERT_PATTERNS):
            continue  # alert traffic은 conversation context에 넣지 않음
        cleaned.append(msg)
    return cleaned

# 3. anti-fabrication 가드 (공유 모듈로 모든 LLM 잡에 주입)
ANTI_FABRICATION_PROMPT = """
Rules (non-negotiable):
1. Never invent facts, releases, or events not in the provided sources.
2. Every cross-domain claim must be tagged [strong-evidence] or [weak-association].
3. If context contains error codes or HTTP status messages, STOP and report context pollution.
4. Prohibited phrases: [paste actual fabricated strings from your own incidents here]
   e.g. 'Hugging Face platform crisis', 'v0.3.1 community release'
"""

# 4. 선언 상태 vs 런타임 상태 수렴 검사 (Class E 방지)
def check_state_convergence(declared_jobs: list, actual_crontab: str) -> list[str]:
    """등록된 잡이 실제 crontab에 있는지 확인"""
    drifts = []
    for job in declared_jobs:
        if job['cron_expr'] not in actual_crontab:
            drifts.append(f"DRIFT: {job['name']} declared but not in crontab")
    return drifts

# 5. 가드 자체가 살아있는지 sabotage validation
def validate_guard_fires():
    """의도적으로 위반을 주입하고 가드가 실제로 감지하는지 확인"""
    # 빈 문자열을 exec하는 67개 체크처럼 조용히 통과하는 죽은 가드를 발견
    test_violation = inject_known_bad_pattern()
    result = run_guard(test_violation)
    assert result.fired, "Guard is DEAD - not detecting violations!"
    revert_violation()

Terminology

fail-plausible시스템이 내부 오류를 감추는 게 아니라 그럴듯한 거짓 콘텐츠로 변환해 사용자에게 전달하는 장애 모드. 단순히 탐지기를 속이는 것(gray failure)과 달리, 사용자에게 위조된 '정상 신호'를 적극적으로 전달한다는 점이 더 위험.

gray failure클라우드 시스템에서 컴포넌트는 실제로 망가져 있는데 장애 탐지기는 '정상'을 보고하는 상태. 애플리케이션은 고통받지만 모니터링 도구는 전혀 모르는 상황.

silent failure에러 신호가 시스템 어딘가에는 존재하지만 사람에게 실행 가능한 형태로 도달하지 못하는 장애. 모든 테스트가 green인데 실제로는 문제가 있는 상황.

sabotage validation가드(방어 체크)가 실제로 작동하는지 확인하기 위해 의도적으로 위반을 주입해보는 기법. '소화기가 실제로 작동하나 확인하려면 불을 질러봐야 한다'는 개념.

declared state convergence코드/설정에 '선언된 상태'(예: 잡 레지스트리에 등록된 스케줄)가 실제 런타임 상태(실제 crontab)와 일치하는지 자동으로 동기화하는 엔진. 인간의 기억에 의존하는 배포 단계를 기계로 대체.

command substitution셸에서 `$(command)` 형태로 명령의 stdout 출력을 변수에 담는 기법. 이 논문에서는 로깅 함수가 stdout에 쓰는 바람에 에러 메시지가 LLM의 입력 데이터로 흘러들어간 원인이 됨.

trigger-amplifier-concealer장애 사후 분석의 3층 구조. trigger(불꽃을 일으킨 외부 이벤트), amplifier(그것을 퍼뜨린 아키텍처 결함), concealer(그것을 숨긴 부재). 트리거만 고치는 건 화장에 불과하고 amplifier/concealer를 고쳐야 재발을 막는다.

TCC sandboxmacOS의 투명성·동의·제어(Transparency, Consent, Control) 보안 프레임워크. cron 같은 백그라운드 프로세스는 Full Disk Access 권한이 없어서 특정 경로에 접근이 차단되는데, 이를 진단 도구도 모르고 '빈 결과'를 반환하는 문제가 발생.

Related Papers

Related Resources

Original Abstract (Expand)

LLM agent systems increasingly run as long-lived autonomous runtimes: scheduling jobs, calling tools, maintaining memory, and pushing results to humans. We present a longitudinal study of silent failures in one such system: a personal-assistant agent runtime in continuous production since March 2026, with roughly 40 scheduled jobs, 8 LLM providers, a tool-governance proxy, and a knowledge-base memory plane, defended by 4,286 unit tests and 827 governance checks. Over eight weeks we documented 22 incidents with full root-cause postmortems, in which one meta-pattern -- a failure whose error signal never reaches a human in actionable form -- manifested at least 28 times. We derive a five-class, mechanism-oriented taxonomy: (A) environment and platform quirks, (B) design-assumption mismatches, (C) error swallowing and dilution, (D) chained hallucination and fabrication, (E) operational omission and forensic blind spots. Class D is unique to LLM systems and the most dangerous: the system does not merely fail to report an error -- the LLM transforms it into fluent, plausible narrative delivered to the user. We term this fail-plausible: gray failure's differential observability escalated -- the observer is not just blind, it is convincingly lied to by the failure itself. Three findings: about 70% of silent failures were caught by human user-view observation, not tests or audits; a retrospective audit of 15 incidents found 0% ex-ante prevention but 87% regression blocking -- audits are regression engines, not prediction engines; incident latency (13 hours to 60 days) tracks failure mechanism, not code complexity -- the longest-lived failures lived in the seams between components, where no test runs. We describe the resulting defense framework and distill design principles for agent systems whose failures are loud, attributable, and boring. All postmortems and artifacts are public.