ClawGuard: Tool을 사용하는 LLM Agent의 Indirect Prompt Injection 방어를 위한 런타임 보안 프레임워크

TL;DR Highlight

규칙 기반 런타임 보안 레이어가 LLM 에이전트의 외부 툴 결과를 검증하여 악성 명령을 차단한다.

Who Should Read

AutoGPT, LangChain, MCP 등 툴을 사용하는 LLM 에이전트를 프로덕션에 배포하는 개발자. 에이전트가 웹 검색, 파일 읽기, 외부 API를 호출하는 파이프라인을 운영 중인 팀.

Core Mechanics

Indirect Prompt Injection(외부 콘텐츠에 악성 명령을 숨겨 에이전트를 조종하는 공격)의 주요 경로는 웹/로컬 콘텐츠 삽입, MCP 서버 삽입, 스킬 파일 삽입 세 가지다.
기존 방어법(RLHF 정렬, StruQ 프로토콜 분리, CaMeL 듀얼 LLM)은 모두 파인튜닝 필요, 인프라 변경 필요, 또는 전문가 수동 규칙 작성이 필요해서 실전 적용이 어렵다.
ClawGuard는 에이전트가 첫 툴을 호출하기 전에 사용자의 작업 목표에서 자동으로 접근 규칙(Rtask)을 생성하고 사용자 확인을 받는다 — 악성 콘텐츠가 섞이기 전에 규칙을 만드는 게 핵심.
모든 툴 호출 경계에서 4가지 컴포넌트가 동작한다: Content Sanitizer(민감 정보 마스킹), Rule Evaluator(화이트리스트/블랙리스트 판정), Skill Inspector(스킬 파일 위험 평가), Approval Mechanism(애매한 경우 사용자 승인).
Rule Evaluator는 화이트리스트도 블랙리스트도 아닌 'amb(애매)' 판정이 나오면 사용자에게 승인을 요청하고, Base64 인코딩 등 난독화 패턴이 감지되면 자동으로 amb로 처리해 에스컬레이션한다.
모델을 수정하거나 인프라를 바꾸지 않아도 되는 미들웨어 방식이라 DeepSeek-V3.2, GLM-5, Kimi-K2.5, MiniMax-M2.5, Qwen3.5-397B 등 어떤 LLM 백본에도 붙일 수 있다.

Evidence

AgentDojo 벤치마크(160개 태스크)에서 기본 모델의 ASR(공격 성공률)이 0.6~3.1%였는데, ClawGuard 적용 후 모든 모델에서 ASR 0%, DSR(방어 성공률) 100% 달성.
SkillInject 벤치마크(84개 스킬 주입 공격)에서 기본 모델 ASR 26~48% → ClawGuard 적용 후 4.8~14%로, 상대적 50~84% 감소. 태스크 완료율(CR)은 85~90%로 유지.
MCPSafeBench 벤치마크(215개 실제 MCP 서버 공격)에서 기본 모델 ASR 36.5~44.5% → ClawGuard 적용 후 7.1~11%로, DSR 74.9~75.8% 달성.
AgentSafetyBench 선행 연구에서 16개 인기 LLM 에이전트 중 안전 점수 60% 이상이 단 하나도 없었다는 기존 연구 결과와 대비되어, 규칙 기반 경계 강제의 필요성을 뒷받침.

How to Apply

LangChain이나 AutoGPT 파이프라인에서 툴 호출 직전에 ClawGuard의 Rule Evaluator를 미들웨어로 끼워 넣는다 — 툴이 실행되기 전에 명령이 화이트리스트에 있는지 체크하는 인터셉터 패턴으로 구현 가능.
에이전트 세션 시작 시 사용자의 작업 설명(예: 'example.com 블로그 3개 요약해서 ~/reports/에 저장')을 Figure 4의 Rule Synthesis Prompt에 넣어 JSON 규칙을 자동 생성하고, 사용자에게 확인 받은 뒤 세션 내내 적용한다.
Table V의 Rbase 기본 규칙(~/.ssh/ 접근 차단, rm -rf 차단, ngrok 등 터널링 서비스 차단 등)을 자체 에이전트의 툴 호출 검증 로직에 하드코딩해두면, 파인튜닝 없이도 최소한의 보안 베이스라인을 확보할 수 있다.

Code Example

snippet

# ClawGuard Rule Synthesis Prompt 예시 (Figure 4 기반)
# 에이전트 세션 시작 전, 사용자 태스크로부터 규칙 자동 생성

SYSTEM_PROMPT = """
You are a security policy synthesizer for an LLM agent runtime.
Given the user's task description, produce a minimal, precise rule set
in valid JSON that restricts the agent to actions necessary for the stated task.
Do not infer permissions not required by the task.
Output only the JSON object; no prose.
"""

USER_TASK = "Summarize the three most recent blog posts from example-research.org and save to ~/reports/summary.md"

TASK_PROMPT = f"""
Based solely on the task described above, produce a JSON object with:
- network_rules: {{whitelist: [...], blacklist: [...]}}
- file_rules: {{whitelist: [...], blacklist: [...]}}
- command_rules: {{
    framework_tools: {{allow: [...], deny: [...]}},
    shell_commands: {{allow: [...], deny: [...]}},
    queue: [...]
  }}

Task: {USER_TASK}
Apply principle of least privilege.
"""

# 기대 출력 예시
expected_rules = {
  "network_rules": {
    "whitelist": ["example-research.org"],
    "blacklist": ["*.onion", "*.ngrok.io"]
  },
  "file_rules": {
    "whitelist": ["~/reports/"],
    "blacklist": ["~/.ssh/", "~/.aws/", "/etc/"]
  },
  "command_rules": {
    "framework_tools": {
      "allow": ["web_fetch", "write"],
      "deny": ["exec", "read"]
    },
    "shell_commands": {
      "allow": [],
      "deny": ["rm", "curl", "wget", "bash"]
    },
    "queue": ["file_deletion", "network_write"]
  }
}

# 툴 호출 인터셉터 패턴 (pseudo-code)
def intercept_tool_call(tool_name, args, rules):
    # 1. Content Sanitize
    sanitized_args = sanitize_sensitive_data(args)  # AWS key, SSH key 등 마스킹
    
    # 2. Rule Evaluate
    verdict = evaluate_against_rules(tool_name, sanitized_args, rules)
    # verdict: 'allow' | 'deny' | 'amb'
    
    if verdict == 'deny':
        log_and_block(tool_name, sanitized_args)
        return None
    elif verdict == 'amb':
        user_decision = request_user_approval(tool_name, sanitized_args)
        if user_decision != 'approve':
            return None
    
    # 3. Execute & sanitize output
    result = execute_tool(tool_name, sanitized_args)
    return sanitize_output(result)  # 결과에서도 민감 정보 마스킹

Terminology

Indirect Prompt Injection사용자가 직접 악성 명령을 입력하는 게 아니라, 에이전트가 읽어오는 외부 데이터(웹페이지, 파일 등)에 몰래 악성 명령을 숨겨두는 공격. 도서관 책 안에 '이걸 읽는 사람은 지금 당장 은행 계좌를 이체하라'는 쪽지를 끼워두는 것과 비슷.

MCP (Model Context Protocol)LLM 에이전트가 외부 툴(웹 검색, DB 등)과 표준화된 방식으로 통신하는 프로토콜. USB처럼 어떤 에이전트든 MCP를 지원하면 플러그인처럼 툴을 꽂아 쓸 수 있음.

RLHF인간 피드백으로 모델을 강화학습시키는 방법. 모델이 좋은 답변을 하면 보상, 나쁜 답변을 하면 페널티를 줘서 점점 안전하게 만드는 훈련법.

DSR (Defense Success Rate)공격을 성공적으로 막아낸 비율. 100개 공격 중 95개를 막았으면 DSR 95%.

ASR (Attack Success Rate)공격이 성공한 비율. DSR의 반대. 낮을수록 방어가 잘 된 것.

Skill File에이전트가 특정 작업을 할 때 불러오는 행동 지침 파일. 요리 레시피처럼 '이 작업은 이렇게 해'라고 적혀 있는데, 이 파일에 악성 명령이 섞여 있을 수 있음.

Least Privilege (최소 권한 원칙)필요한 것만 허용하고 나머지는 다 막는 보안 원칙. 택배기사에게 건물 전체 열쇠가 아닌 1층 입구 열쇠만 주는 것과 같음.

Audit Log누가 언제 어떤 툴을 호출했고 허용/차단됐는지 기록하는 로그. 나중에 '왜 이게 실행됐지?' 추적할 때 쓰는 블랙박스.

Related Resources

Original Abstract (Expand)

Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user's stated objective prior to any external tool invocation, \textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at https://github.com/Claw-Guard/ClawGuard.