ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
TL;DR Highlight
A runtime security layer that blocks malicious commands based on rules whenever an LLM agent receives results from external tools.
Who Should Read
Developers deploying LLM agents using tools like AutoGPT, LangChain, and MCP. Teams operating pipelines where agents perform web searches, read files, and call external APIs.
Core Mechanics
- The main pathways for Indirect Prompt Injection (attacks that manipulate agents by hiding malicious commands in external content) are web/local content insertion, MCP server insertion, and skill file insertion.
- Existing defenses (RLHF alignment, StruQ protocol separation, CaMeL dual LLM) all require fine-tuning, infrastructure changes, or manual rule creation by experts, making them difficult to apply in practice.
- ClawGuard automatically generates access rules (Rtask) from the user's task goal before the agent makes its first tool call and obtains user confirmation — creating rules before malicious content is mixed in is key.
- Four components operate at every tool call boundary: Content Sanitizer (masking sensitive information), Rule Evaluator (whitelist/blacklist judgment), Skill Inspector (skill file risk assessment), and Approval Mechanism (user approval for ambiguous cases).
- If the Rule Evaluator returns an 'amb(ambiguous)' judgment (neither whitelist nor blacklist), it requests user approval and automatically escalates to 'amb' if obfuscation patterns like Base64 encoding are detected.
- As a middleware approach that doesn't require modifying the model or infrastructure, it can be attached to any LLM backbone, such as DeepSeek-V3.2, GLM-5, Kimi-K2.5, MiniMax-M2.5, and Qwen3.5-397B.
Evidence
- In the AgentDojo benchmark (160 tasks), the base model's ASR (attack success rate) was 0.6~3.1%, while ClawGuard achieved 0% ASR and 100% DSR (defense success rate) across all models.
- In the SkillInject benchmark (84 skill injection attacks), the base model ASR decreased from 26~48% to 4.8~14% with ClawGuard applied, a relative reduction of 50~84%. Task completion rate (CR) remained at 85~90%.
- In the MCPSafeBench benchmark (215 real MCP server attacks), the base model ASR decreased from 36.5~44.5% to 7.1~11% with ClawGuard applied, achieving a DSR of 74.9~75.8%.
- In contrast to previous research showing that none of the 16 popular LLM agents in the AgentSafetyBench study had a safety score above 60%, these results support the need for rule-based boundary enforcement.
How to Apply
- In a LangChain or AutoGPT pipeline, insert ClawGuard's Rule Evaluator as middleware immediately before tool calls — it can be implemented as an interceptor pattern that checks if the command is on the whitelist before the tool is executed.
- At the start of an agent session, insert the user's task description (e.g., 'Summarize the three most recent blog posts from example-research.org and save to ~/reports/') into the Rule Synthesis Prompt in Figure 4 to automatically generate JSON rules, obtain user confirmation, and apply them throughout the session.
- Hardcoding the Rbase basic rules from Table V (~/.ssh/ access blocking, rm -rf blocking, ngrok and other tunneling service blocking, etc.) into the tool call validation logic of your own agent can establish a minimum security baseline without fine-tuning.
Code Example
# ClawGuard Rule Synthesis Prompt example (based on Figure 4)
# Automatically generate rules from user task before agent session starts
SYSTEM_PROMPT = """
You are a security policy synthesizer for an LLM agent runtime.
Given the user's task description, produce a minimal, precise rule set
in valid JSON that restricts the agent to actions necessary for the stated task.
Do not infer permissions not required by the task.
Output only the JSON object; no prose.
"""
USER_TASK = "Summarize the three most recent blog posts from example-research.org and save to ~/reports/summary.md"
TASK_PROMPT = f"""
Based solely on the task described above, produce a JSON object with:
- network_rules: {{whitelist: [...], blacklist: [...]}}
- file_rules: {{whitelist: [...], blacklist: [...]}}
- command_rules: {{
framework_tools: {{allow: [...], deny: [...]}},
shell_commands: {{allow: [...], deny: [...]}},
queue: [...]
}}
Task: {USER_TASK}
Apply principle of least privilege.
"""
# Expected output example
expected_rules = {
"network_rules": {
"whitelist": ["example-research.org"],
"blacklist": ["*.onion", "*.ngrok.io"]
},
"file_rules": {
"whitelist": ["~/reports/"],
"blacklist": ["~/.ssh/", "~/.aws/", "/etc/"]
},
"command_rules": {
"framework_tools": {
"allow": ["web_fetch", "write"],
"deny": ["exec", "read"]
},
"shell_commands": {
"allow": [],
"deny": ["rm", "curl", "wget", "bash"]
},
"queue": ["file_deletion", "network_write"]
}
}
# Tool call interceptor pattern (pseudo-code)
def intercept_tool_call(tool_name, args, rules):
# 1. Content Sanitize
sanitized_args = sanitize_sensitive_data(args) # Mask AWS key, SSH key, etc.
# 2. Rule Evaluate
verdict = evaluate_against_rules(tool_name, sanitized_args, rules)
# verdict: 'allow' | 'deny' | 'amb'
if verdict == 'deny':
log_and_block(tool_name, sanitized_args)
return None
elif verdict == 'amb':
user_decision = request_user_approval(tool_name, sanitized_args)
if user_decision != 'approve':
return None
# 3. Execute & sanitize output
result = execute_tool(tool_name, sanitized_args)
return sanitize_output(result) # Mask sensitive information in the result as wellTerminology
Related Papers
Show HN: adamsreview – better multi-agent PR reviews for Claude Code
Claude Code에서 최대 7개의 병렬 서브 에이전트가 각각 다른 관점으로 PR을 리뷰하고, 자동 수정까지 해주는 오픈소스 플러그인이다. 기존 /review나 CodeRabbit보다 실제 버그를 더 많이 잡는다고 주장하지만 커뮤니티에서는 복잡도와 실효성에 대한 회의론도 나왔다.
How Fast Does Claude, Acting as a User Space IP Stack, Respond to Pings?
Claude Code에게 IP 패킷을 직접 파싱하고 ICMP echo reply를 구성하도록 시켜서 실제로 ping에 응답하게 만든 실험으로, 'Markdown이 곧 코드이고 LLM이 프로세서'라는 아이디어를 네트워크 스택 수준까지 밀어붙인 재미있는 사례다.
Show HN: Git for AI Agents
AI 코딩 에이전트(Claude Code 등)가 수행한 모든 툴 호출을 자동으로 추적하고, 어떤 프롬프트가 어느 코드 줄을 작성했는지 blame까지 가능한 버전 관리 도구다.
Principles for agent-native CLIs
AI 에이전트가 CLI 도구를 더 잘 사용할 수 있도록 설계하는 원칙들을 정리한 글로, 에이전트가 CLI를 도구로 활용하는 빈도가 높아지면서 이 설계 방식이 실용적으로 중요해지고 있다.
Agent-harness-kit scaffolding for multi-agent workflows (MCP, provider-agnostic)
여러 AI 에이전트가 서로 역할을 나눠 협업할 수 있도록 조율하는 scaffolding 도구로, Vite처럼 설정 없이 빠르게 멀티 에이전트 파이프라인을 구성할 수 있다.
Show HN: Tilde.run – Agent sandbox with a transactional, versioned filesystem
AI 에이전트가 실제 프로덕션 데이터를 건드려도 롤백할 수 있는 격리된 샌드박스 환경을 제공하는 도구로, GitHub/S3/Google Drive를 하나의 버전 관리 파일시스템으로 묶어준다.
Related Resources
Original Abstract (Expand)
Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user's stated objective prior to any external tool invocation, \textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at https://github.com/Claw-Guard/ClawGuard.