ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection | AI Paper Digest

TL;DR Highlight

A runtime security layer that blocks malicious commands based on rules whenever an LLM agent receives results from external tools.

Who Should Read

Developers deploying LLM agents using tools like AutoGPT, LangChain, and MCP. Teams operating pipelines where agents perform web searches, read files, and call external APIs.

Core Mechanics

The main pathways for Indirect Prompt Injection (attacks that manipulate agents by hiding malicious commands in external content) are web/local content insertion, MCP server insertion, and skill file insertion.
Existing defenses (RLHF alignment, StruQ protocol separation, CaMeL dual LLM) all require fine-tuning, infrastructure changes, or manual rule creation by experts, making them difficult to apply in practice.
ClawGuard automatically generates access rules (Rtask) from the user's task goal before the agent makes its first tool call and obtains user confirmation — creating rules before malicious content is mixed in is key.
Four components operate at every tool call boundary: Content Sanitizer (masking sensitive information), Rule Evaluator (whitelist/blacklist judgment), Skill Inspector (skill file risk assessment), and Approval Mechanism (user approval for ambiguous cases).
If the Rule Evaluator returns an 'amb(ambiguous)' judgment (neither whitelist nor blacklist), it requests user approval and automatically escalates to 'amb' if obfuscation patterns like Base64 encoding are detected.
As a middleware approach that doesn't require modifying the model or infrastructure, it can be attached to any LLM backbone, such as DeepSeek-V3.2, GLM-5, Kimi-K2.5, MiniMax-M2.5, and Qwen3.5-397B.

Evidence

In the AgentDojo benchmark (160 tasks), the base model's ASR (attack success rate) was 0.6~3.1%, while ClawGuard achieved 0% ASR and 100% DSR (defense success rate) across all models.
In the SkillInject benchmark (84 skill injection attacks), the base model ASR decreased from 26~48% to 4.8~14% with ClawGuard applied, a relative reduction of 50~84%. Task completion rate (CR) remained at 85~90%.
In the MCPSafeBench benchmark (215 real MCP server attacks), the base model ASR decreased from 36.5~44.5% to 7.1~11% with ClawGuard applied, achieving a DSR of 74.9~75.8%.
In contrast to previous research showing that none of the 16 popular LLM agents in the AgentSafetyBench study had a safety score above 60%, these results support the need for rule-based boundary enforcement.

How to Apply

In a LangChain or AutoGPT pipeline, insert ClawGuard's Rule Evaluator as middleware immediately before tool calls — it can be implemented as an interceptor pattern that checks if the command is on the whitelist before the tool is executed.
At the start of an agent session, insert the user's task description (e.g., 'Summarize the three most recent blog posts from example-research.org and save to ~/reports/') into the Rule Synthesis Prompt in Figure 4 to automatically generate JSON rules, obtain user confirmation, and apply them throughout the session.
Hardcoding the Rbase basic rules from Table V (~/.ssh/ access blocking, rm -rf blocking, ngrok and other tunneling service blocking, etc.) into the tool call validation logic of your own agent can establish a minimum security baseline without fine-tuning.

Code Example

snippet

# ClawGuard Rule Synthesis Prompt example (based on Figure 4)
# Automatically generate rules from user task before agent session starts

SYSTEM_PROMPT = """
You are a security policy synthesizer for an LLM agent runtime.
Given the user's task description, produce a minimal, precise rule set
in valid JSON that restricts the agent to actions necessary for the stated task.
Do not infer permissions not required by the task.
Output only the JSON object; no prose.
"""

USER_TASK = "Summarize the three most recent blog posts from example-research.org and save to ~/reports/summary.md"

TASK_PROMPT = f"""
Based solely on the task described above, produce a JSON object with:
- network_rules: {{whitelist: [...], blacklist: [...]}}
- file_rules: {{whitelist: [...], blacklist: [...]}}
- command_rules: {{
    framework_tools: {{allow: [...], deny: [...]}},
    shell_commands: {{allow: [...], deny: [...]}},
    queue: [...]
  }}

Task: {USER_TASK}
Apply principle of least privilege.
"""

# Expected output example
expected_rules = {
  "network_rules": {
    "whitelist": ["example-research.org"],
    "blacklist": ["*.onion", "*.ngrok.io"]
  },
  "file_rules": {
    "whitelist": ["~/reports/"],
    "blacklist": ["~/.ssh/", "~/.aws/", "/etc/"]
  },
  "command_rules": {
    "framework_tools": {
      "allow": ["web_fetch", "write"],
      "deny": ["exec", "read"]
    },
    "shell_commands": {
      "allow": [],
      "deny": ["rm", "curl", "wget", "bash"]
    },
    "queue": ["file_deletion", "network_write"]
  }
}

# Tool call interceptor pattern (pseudo-code)
def intercept_tool_call(tool_name, args, rules):
    # 1. Content Sanitize
    sanitized_args = sanitize_sensitive_data(args)  # Mask AWS key, SSH key, etc.
    
    # 2. Rule Evaluate
    verdict = evaluate_against_rules(tool_name, sanitized_args, rules)
    # verdict: 'allow' | 'deny' | 'amb'
    
    if verdict == 'deny':
        log_and_block(tool_name, sanitized_args)
        return None
    elif verdict == 'amb':
        user_decision = request_user_approval(tool_name, sanitized_args)
        if user_decision != 'approve':
            return None
    
    # 3. Execute & sanitize output
    result = execute_tool(tool_name, sanitized_args)
    return sanitize_output(result)  # Mask sensitive information in the result as well

Terminology

Indirect Prompt InjectionAn attack where malicious commands are not directly entered by the user, but are secretly hidden in external data (web pages, files, etc.) that the agent reads. It's similar to slipping a note saying 'transfer your bank account immediately' into a library book.

MCP (Model Context Protocol)A protocol for LLM agents to communicate with external tools (web search, DB, etc.) in a standardized way. Like USB, any agent that supports MCP can plug in tools like plugins.

RLHFA method of reinforcing learning with human feedback. The model is rewarded for good answers and penalized for bad answers, gradually training it to be safer.

DSR (Defense Success Rate)The percentage of attacks successfully blocked. If 95 out of 100 attacks are blocked, the DSR is 95%.

ASR (Attack Success Rate)The percentage of attacks that succeed. The opposite of DSR. The lower, the better the defense.

Skill FileA behavior instruction file that the agent calls when performing a specific task. Like a cooking recipe, it describes 'how to do this task,' but it may contain malicious commands.

Least Privilege (최소 권한 원칙)A security principle that allows only what is necessary and blocks everything else. It's like giving a delivery person the key to the first-floor entrance, not the entire building.

Audit LogA log that records who called which tool when and whether it was allowed/blocked. It's a black box used to trace 'why was this executed?' later.

Related Papers

Related Resources

Original Abstract (Expand)

Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user's stated objective prior to any external tool invocation, \textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at https://github.com/Claw-Guard/ClawGuard.