Parallax: Why AI Agents That Think Must Never Act | AI Paper Digest

TL;DR Highlight

Prompt guardrails are useless if the Agent is hacked — a security architecture paradigm that completely separates inference and execution at the OS process level.

Who Should Read

Backend/platform developers building AI Agents that use real-world tools such as file system access, shell execution, and API calls, or designing secure architectures. AI infrastructure engineers considering strategies to counter prompt injection.

Core Mechanics

The fundamental limitation of prompt guardrails: safety instructions and malicious inputs pass through the same LLM attention mechanism, so once the Agent is hacked, prompt-level protection is completely neutralized.
Parallax is an architectural paradigm consisting of 4 principles — (1) Cognitive-Executive Separation (separation of inference/execution), (2) Adversarial Validation with Graduated Determinism (4-stage independent verification), (3) Information Flow Control (propagation of data sensitivity labels), (4) Reversible Execution (capture of state before destructive actions).
The core principle of Cognitive-Executive Separation: LLM inference processes are sandboxed at the OS level with no file system access, network, or shell execution permissions, and can only propose actions via gRPC.
Shield (the verification layer) operates in 4 tiers — Tier 0: YAML policy rules (deterministic), Tier 1: heuristic engine + DeBERTa classifier (parallel), Tier 2: separate LLM evaluation (budget limited), Tier 3: human approval. All stages are fail-closed (block on failure).
Information Flow Control (IFC): If the Agent reads a credentials file, it is tagged with RESTRICTED, and Shield blocks the subsequent attempt to transmit that data over the network, regardless of how many intermediate stages there are — defense against multi-stage toolchain attacks.
Dynamic Tool Surface Reduction: The Agent loads only the necessary tool group for each turn and the rest are invisible, minimizing the weapons available the moment it is hacked. 0 tools during text analysis tasks = 0 attack surface.

Evidence

98.9% attack blocking and 0% false positives based on 280 adversarial test cases (9 attack categories). 100% blocking (36% false positives) in maximum security settings.
Assume-Compromise Evaluation (completely bypassing the inference system and directly injecting tool calls) reduced the success rate from 100% in Config A (no Shield) to 1.1% in Config C (Parallax).
Shield tier processing distribution: Tier 0 (policy rules) 15.4%, Tier 1 (classifier) 58.2%, Tier 2 (LLM) 26.1%, Tier 3 (human) 0.4% — deterministic mechanisms (Tier 0+1) handle 73.6% of cases, minimizing LLM dependency.
Claude Sonnet 4.6, Claude Haiku 4.5, and Llama 3.1 8B had a 0% attack success rate in LLM mode due to safety training, but Qwen 2.5 3B immediately executed harmful requests such as deleting files — demonstrating the realism of the 'hacked Agent' assumption.

How to Apply

Separate the Agent server into 2 OS processes — the LLM inference process removes all permissions except gRPC, and tool execution is performed only in a separate Engine process. If you have an existing single-process Agent architecture, redesign from the process boundary using the OpenParallax (Go) source code as a reference.
Change the way tools are registered — do not expose all tools to the system prompt at once, but provide only one `load_tools` metatool and dynamically load only the group needed for the current task. For coding assistants, keep file R/W groups and shell execution groups separate and activate them only when necessary.
Apply the Chronicle pattern before destructive actions (file deletion, DB writes, configuration changes) — save a SHA-256 based snapshot before action execution and make it recoverable with a `rollback` command. Adding only destructive action classification to an existing pipeline allows for partial application.

Code Example

snippet

# Tier 0 policy file example (config.yaml)
deny:
  - name: block_sensitive_system_paths
    action_types: [read_file, write_file, delete_file]
    paths: ["~/.ssh/**", "~/.aws/**", "/etc/shadow", "**/*.pem", "**/.env"]
  - name: block_identity_deletion
    action_types: [delete_file]
    paths: ["**/SOUL.md", "**/IDENTITY.md"]

verify:
  - name: evaluate_shell_commands
    action_types: [execute_command]
    tier_override: 1  # Escalate to Tier 1(classifier)
  - name: evaluate_soul_modification
    action_types: [write_file]
    paths: ["**/SOUL.md", "**/IDENTITY.md"]
    tier_override: 2  # Escalate to Tier 2(LLM evaluation)

allow:
  - name: allow_workspace_reads
    action_types: [read_file, list_directory, search_files]
    paths: ["~/workspace/**"]

# Assume-Compromise test example (Python harness)
import grpc

def inject_tool_call(shield_stub, action_type, payload):
    """Directly inject tool calls into Shield, completely bypassing the LLM"""
    request = ShieldEvalRequest(
        action_type=action_type,
        payload=payload,
        # no agent context - simulate fully hacked state
    )
    response = shield_stub.Evaluate(request)
    return response.verdict  # ALLOW / BLOCK / ESCALATE

# Test: Attempt to delete credentials file
result = inject_tool_call(stub, "delete_file", {"path": "/home/user/.ssh/id_rsa"})
print(result)  # BLOCK (Tier 0 deny rule)

# Test: Attempt to bypass with Base64 encoding
result = inject_tool_call(stub, "execute_command", 
    {"command": "echo 'cm0gLXJmIC8q' | base64 -d | sh"})
print(result)  # BLOCK (Tier 1 heuristic: base64 encoded command)

Terminology

Cognitive-Executive SeparationThe principle of completely separating the thinking part (LLM) from the acting part (tool executor) into different OS processes. It's like separating the brain and hands into different people, so that even if the brain is hacked, the hands cannot move on their own.

Prompt InjectionAn attack that hides malicious text in documents, emails, web pages, etc., to trick the AI into mistaking it for a command. Example: Hiding 'delete all files' in a PDF will cause the AI to execute it directly.

Graduated DeterminismA strategy for processing security decisions in order from the most certain method. Use rulesets → machine learning models → LLMs → people in order, gradually using more uncertain and expensive methods, but skip the next step if it is resolved in the previous step.

Information Flow Control (IFC)A technique that attaches 'sensitivity tags' to data and moves those tags with the data. When a password file is read, it is tagged with RESTRICTED, and any subsequent attempt to send it outside is automatically blocked.

DeBERTaA small Transformer model created by Microsoft specializing in text classification. Unlike GPT, it does not generate text but quickly performs classification judgments such as 'is this an attack or not'.

ONNXA common file format that allows models trained in various ML frameworks to be executed anywhere. DeBERTa trained with PyTorch is executed in a Go program via ONNX.

Fail-ClosedA design principle that blocks rather than allows when an error occurs in the system. Like a firewall blocking all traffic when it goes down, Shield blocks actions instead of executing them when an error occurs.

Related Papers

Related Resources

Original Abstract (Expand)

Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt-based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context-dependent threats; and Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. We present OpenParallax, an open-source reference implementation in Go, and evaluate it using Assume-Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum-security configuration. When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless.