Towards Verifiably Safe Tool Use for LLM Agents

Jan 12, 2026•A. Doshi, Yining Hong, Congying Xu +3•View PDF

TL;DR Highlight

Proposes a design methodology that blocks sensitive data leaks and unsafe behaviors in LLM agent tool calls using mathematically guaranteed rules instead of probabilistic filters

Who Should Read

Backend/AI developers building agents with MCP or LangChain who worry about sensitive data leaks and unsafe tool combinations. Senior developers and architects designing agent safety architectures in enterprise environments.

Core Mechanics

Most agent incidents stem not from individual tool bugs but from unexpected data flows when multiple tools are combined — in a GitHub MCP vulnerability case, the combination of file-read + public-commit tools leaked private repository info
ML-based guardrails like GuardAgent, ShieldAgent, and TrustAgent only reduce risk probabilistically, so a persistent attacker can break through with a single attack tailored to the defense characteristics
Applied STPA (a system safety analysis method from aviation and autonomous driving) to LLM agents to proactively identify risks in order: stakeholders → losses → hazardous actions → safety requirements
Applied IFC (Information Flow Control — tracking and controlling where data flows) at MCP tool boundaries to deterministically block safety violations
Proposed extending MCP to require mandatory tags on each tool: capabilities (read/write/execute), confidentiality (public/sensitive), and trust_level — current MCP treats this info as optional and untrusted
A 4-tier enforcement structure — Blocklist (always block) / Mustlist (must execute) / Allowlist (auto-allow) / Confirmation (user approval) — flexibly balances safety and agent autonomy

Evidence

Formally verified the augmented MCP framework using Alloy (a first-order relational logic modeling language) — without policies, counterexamples of private data leaks were immediately found; with policies, all unsafe flows were mathematically confirmed blocked
Calendar agent example: demonstrated deterministic blocking of a scenario where an STD treatment appointment title gets exposed in a schedule-change email to a colleague, using the 4-tier enforcement (list_events → send_email path via blocklist or confirmation)
Alloy Analyzer confirmed that safe traces (event creation → schedule change → attendee notification, excluding private info) remain permitted after policy enforcement — proving safety hardening doesn't break functionality

How to Apply

Add key-value tags like {'capabilities': 'external_write', 'confidentiality': 'private', 'trust_level': 'untrusted'} to each tool declaration in your MCP server, so an external policy engine can intercept tool calls at runtime and automatically block private → external_write flows
When designing agent workflows, apply the 4 STPA steps: (1) identify direct/indirect stakeholders → (2) derive losses for each → (3) analyze system behaviors that cause losses → (4) define safety requirements and choose the appropriate enforcement level from Blocklist/Mustlist/Allowlist/Confirmation
Place an interceptor middleware before external-write tools like send_email or write_file — if the input data's confidentiality label is 'private', blocklist it; if 'unsure', request user confirmation; if 'public', allowlist it

Code Example

snippet

# MCP tool declaration example — adding capability-enhanced labels
{
  "name": "send_email",
  "description": "Send email",
  "labels": {
    "capabilities": "external_write",
    "trust_level": "untrusted"
  },
  "inputSchema": {
    "to": {"type": "string"},
    "subject": {"type": "string", "labels": {"confidentiality": "public"}},
    "body": {"type": "string", "labels": {"confidentiality": "inferred"}}
  }
}

# Policy engine interceptor pseudocode
def intercept_tool_call(tool_name, inputs, context_labels):
    for key, value in inputs.items():
        data_label = context_labels.get(key, {}).get("confidentiality")
        tool_label = TOOL_REGISTRY[tool_name]["labels"].get("capabilities")
        
        if data_label == "private" and tool_label == "external_write":
            raise BlockedByPolicy(f"{key} is private, cannot send via {tool_name}")
        elif data_label == "unsure" and tool_label == "external_write":
            return request_user_confirmation(tool_name, inputs)
    
    return execute_tool(tool_name, inputs)

Terminology

STPASystem-Theoretic Process Analysis. A safety analysis method used in high-risk systems like aircraft and autonomous vehicles. Instead of looking at individual component failures, it proactively identifies accidents that emerge from component interactions.

IFCInformation Flow Control. A technique for tracking where data flows within a system and preventing sensitive data from reaching unauthorized destinations. Similar to how SQL injection defense prevents user input from directly touching queries.

MCPModel Context Protocol. A standard protocol by Anthropic that unifies how LLM agents access external tools (APIs, databases, file systems, etc.). Like USB-C — any tool plugs in the same way.

AlloyA formal verification tool from MIT. You describe system behavior in mathematical logic, and it automatically explores all possible states to find counterexamples where safety rules are violated. A tool that proves 'bugs can never occur under these conditions.'

Information FlowThe path data takes from one component to another within a system. In agents, this means Tool A's output flowing through the LLM context to become Tool B's input. Without controlling this flow, sensitive info can leak to unexpected places.

BlocklistA list of rules that unconditionally prohibit certain actions. In agent context, it deterministically blocks flows like 'private data → external email send' at the system level without LLM judgment.

Formal VerificationA method using mathematical logic to prove a system always satisfies certain properties. Testing shows 'it was safe in this case,' while formal verification mathematically guarantees 'it's safe in all cases.'

Related Resources

Original Abstract (Expand)

Large language model (LLM)-based AI agents extend LLM capabilities by enabling access to tools such as data sources, APIs, search engines, code sandboxes, and even other agents. While this empowers agents to perform complex tasks, LLMs may invoke unintended tool interactions and introduce risks, such as leaking sensitive data or overwriting critical records, which are unacceptable in enterprise contexts. Current approaches to mitigate these risks, such as model-based safeguards, enhance agents'reliability but cannot guarantee system safety. Methods like information flow control (IFC) and temporal constraints aim to provide guarantees but often require extensive human annotation. We propose a process that starts with applying System-Theoretic Process Analysis (STPA) to identify hazards in agent workflows, derive safety requirements, and formalize them as enforceable specifications on data flows and tool sequences. To enable this, we introduce a capability-enhanced Model Context Protocol (MCP) framework that requires structured labels on capabilities, confidentiality, and trust level. Together, these contributions aim to shift LLM-based agent safety from ad hoc reliability fixes to proactive guardrails with formal guarantees, while reducing dependence on user confirmation and making autonomy a deliberate design choice.