Towards Verifiably Safe Tool Use for LLM Agents
TL;DR Highlight
Proposes a design methodology that blocks sensitive data leaks and unsafe behaviors in LLM agent tool calls using mathematically guaranteed rules instead of probabilistic filters
Who Should Read
Backend/AI developers building agents with MCP or LangChain who worry about sensitive data leaks and unsafe tool combinations. Senior developers and architects designing agent safety architectures in enterprise environments.
Core Mechanics
- Most agent incidents stem not from individual tool bugs but from unexpected data flows when multiple tools are combined — in a GitHub MCP vulnerability case, the combination of file-read + public-commit tools leaked private repository info
- ML-based guardrails like GuardAgent, ShieldAgent, and TrustAgent only reduce risk probabilistically, so a persistent attacker can break through with a single attack tailored to the defense characteristics
- Applied STPA (a system safety analysis method from aviation and autonomous driving) to LLM agents to proactively identify risks in order: stakeholders → losses → hazardous actions → safety requirements
- Applied IFC (Information Flow Control — tracking and controlling where data flows) at MCP tool boundaries to deterministically block safety violations
- Proposed extending MCP to require mandatory tags on each tool: capabilities (read/write/execute), confidentiality (public/sensitive), and trust_level — current MCP treats this info as optional and untrusted
- A 4-tier enforcement structure — Blocklist (always block) / Mustlist (must execute) / Allowlist (auto-allow) / Confirmation (user approval) — flexibly balances safety and agent autonomy
Evidence
- Formally verified the augmented MCP framework using Alloy (a first-order relational logic modeling language) — without policies, counterexamples of private data leaks were immediately found; with policies, all unsafe flows were mathematically confirmed blocked
- Calendar agent example: demonstrated deterministic blocking of a scenario where an STD treatment appointment title gets exposed in a schedule-change email to a colleague, using the 4-tier enforcement (list_events → send_email path via blocklist or confirmation)
- Alloy Analyzer confirmed that safe traces (event creation → schedule change → attendee notification, excluding private info) remain permitted after policy enforcement — proving safety hardening doesn't break functionality
How to Apply
- Add key-value tags like {'capabilities': 'external_write', 'confidentiality': 'private', 'trust_level': 'untrusted'} to each tool declaration in your MCP server, so an external policy engine can intercept tool calls at runtime and automatically block private → external_write flows
- When designing agent workflows, apply the 4 STPA steps: (1) identify direct/indirect stakeholders → (2) derive losses for each → (3) analyze system behaviors that cause losses → (4) define safety requirements and choose the appropriate enforcement level from Blocklist/Mustlist/Allowlist/Confirmation
- Place an interceptor middleware before external-write tools like send_email or write_file — if the input data's confidentiality label is 'private', blocklist it; if 'unsure', request user confirmation; if 'public', allowlist it
Code Example
# MCP tool declaration example — adding capability-enhanced labels
{
"name": "send_email",
"description": "Send email",
"labels": {
"capabilities": "external_write",
"trust_level": "untrusted"
},
"inputSchema": {
"to": {"type": "string"},
"subject": {"type": "string", "labels": {"confidentiality": "public"}},
"body": {"type": "string", "labels": {"confidentiality": "inferred"}}
}
}
# Policy engine interceptor pseudocode
def intercept_tool_call(tool_name, inputs, context_labels):
for key, value in inputs.items():
data_label = context_labels.get(key, {}).get("confidentiality")
tool_label = TOOL_REGISTRY[tool_name]["labels"].get("capabilities")
if data_label == "private" and tool_label == "external_write":
raise BlockedByPolicy(f"{key} is private, cannot send via {tool_name}")
elif data_label == "unsure" and tool_label == "external_write":
return request_user_confirmation(tool_name, inputs)
return execute_tool(tool_name, inputs)Terminology
Related Resources
Original Abstract (Expand)
Large language model (LLM)-based AI agents extend LLM capabilities by enabling access to tools such as data sources, APIs, search engines, code sandboxes, and even other agents. While this empowers agents to perform complex tasks, LLMs may invoke unintended tool interactions and introduce risks, such as leaking sensitive data or overwriting critical records, which are unacceptable in enterprise contexts. Current approaches to mitigate these risks, such as model-based safeguards, enhance agents'reliability but cannot guarantee system safety. Methods like information flow control (IFC) and temporal constraints aim to provide guarantees but often require extensive human annotation. We propose a process that starts with applying System-Theoretic Process Analysis (STPA) to identify hazards in agent workflows, derive safety requirements, and formalize them as enforceable specifications on data flows and tool sequences. To enable this, we introduce a capability-enhanced Model Context Protocol (MCP) framework that requires structured labels on capabilities, confidentiality, and trust level. Together, these contributions aim to shift LLM-based agent safety from ad hoc reliability fixes to proactive guardrails with formal guarantees, while reducing dependence on user confirmation and making autonomy a deliberate design choice.