Parallax: Why AI Agents That Think Must Never Act
TL;DR Highlight
Prompt guardrails are useless if the Agent is hacked — a security architecture paradigm that completely separates inference and execution at the OS process level.
Who Should Read
Backend/platform developers building AI Agents that use real-world tools such as file system access, shell execution, and API calls, or designing secure architectures. AI infrastructure engineers considering strategies to counter prompt injection.
Core Mechanics
- The fundamental limitation of prompt guardrails: safety instructions and malicious inputs pass through the same LLM attention mechanism, so once the Agent is hacked, prompt-level protection is completely neutralized.
- Parallax is an architectural paradigm consisting of 4 principles — (1) Cognitive-Executive Separation (separation of inference/execution), (2) Adversarial Validation with Graduated Determinism (4-stage independent verification), (3) Information Flow Control (propagation of data sensitivity labels), (4) Reversible Execution (capture of state before destructive actions).
- The core principle of Cognitive-Executive Separation: LLM inference processes are sandboxed at the OS level with no file system access, network, or shell execution permissions, and can only propose actions via gRPC.
- Shield (the verification layer) operates in 4 tiers — Tier 0: YAML policy rules (deterministic), Tier 1: heuristic engine + DeBERTa classifier (parallel), Tier 2: separate LLM evaluation (budget limited), Tier 3: human approval. All stages are fail-closed (block on failure).
- Information Flow Control (IFC): If the Agent reads a credentials file, it is tagged with RESTRICTED, and Shield blocks the subsequent attempt to transmit that data over the network, regardless of how many intermediate stages there are — defense against multi-stage toolchain attacks.
- Dynamic Tool Surface Reduction: The Agent loads only the necessary tool group for each turn and the rest are invisible, minimizing the weapons available the moment it is hacked. 0 tools during text analysis tasks = 0 attack surface.
Evidence
- 98.9% attack blocking and 0% false positives based on 280 adversarial test cases (9 attack categories). 100% blocking (36% false positives) in maximum security settings.
- Assume-Compromise Evaluation (completely bypassing the inference system and directly injecting tool calls) reduced the success rate from 100% in Config A (no Shield) to 1.1% in Config C (Parallax).
- Shield tier processing distribution: Tier 0 (policy rules) 15.4%, Tier 1 (classifier) 58.2%, Tier 2 (LLM) 26.1%, Tier 3 (human) 0.4% — deterministic mechanisms (Tier 0+1) handle 73.6% of cases, minimizing LLM dependency.
- Claude Sonnet 4.6, Claude Haiku 4.5, and Llama 3.1 8B had a 0% attack success rate in LLM mode due to safety training, but Qwen 2.5 3B immediately executed harmful requests such as deleting files — demonstrating the realism of the 'hacked Agent' assumption.
How to Apply
- Separate the Agent server into 2 OS processes — the LLM inference process removes all permissions except gRPC, and tool execution is performed only in a separate Engine process. If you have an existing single-process Agent architecture, redesign from the process boundary using the OpenParallax (Go) source code as a reference.
- Change the way tools are registered — do not expose all tools to the system prompt at once, but provide only one `load_tools` metatool and dynamically load only the group needed for the current task. For coding assistants, keep file R/W groups and shell execution groups separate and activate them only when necessary.
- Apply the Chronicle pattern before destructive actions (file deletion, DB writes, configuration changes) — save a SHA-256 based snapshot before action execution and make it recoverable with a `rollback` command. Adding only destructive action classification to an existing pipeline allows for partial application.
Code Example
# Tier 0 policy file example (config.yaml)
deny:
- name: block_sensitive_system_paths
action_types: [read_file, write_file, delete_file]
paths: ["~/.ssh/**", "~/.aws/**", "/etc/shadow", "**/*.pem", "**/.env"]
- name: block_identity_deletion
action_types: [delete_file]
paths: ["**/SOUL.md", "**/IDENTITY.md"]
verify:
- name: evaluate_shell_commands
action_types: [execute_command]
tier_override: 1 # Escalate to Tier 1(classifier)
- name: evaluate_soul_modification
action_types: [write_file]
paths: ["**/SOUL.md", "**/IDENTITY.md"]
tier_override: 2 # Escalate to Tier 2(LLM evaluation)
allow:
- name: allow_workspace_reads
action_types: [read_file, list_directory, search_files]
paths: ["~/workspace/**"]
# Assume-Compromise test example (Python harness)
import grpc
def inject_tool_call(shield_stub, action_type, payload):
"""Directly inject tool calls into Shield, completely bypassing the LLM"""
request = ShieldEvalRequest(
action_type=action_type,
payload=payload,
# no agent context - simulate fully hacked state
)
response = shield_stub.Evaluate(request)
return response.verdict # ALLOW / BLOCK / ESCALATE
# Test: Attempt to delete credentials file
result = inject_tool_call(stub, "delete_file", {"path": "/home/user/.ssh/id_rsa"})
print(result) # BLOCK (Tier 0 deny rule)
# Test: Attempt to bypass with Base64 encoding
result = inject_tool_call(stub, "execute_command",
{"command": "echo 'cm0gLXJmIC8q' | base64 -d | sh"})
print(result) # BLOCK (Tier 1 heuristic: base64 encoded command)Terminology
Related Papers
Show HN: adamsreview – better multi-agent PR reviews for Claude Code
Claude Code에서 최대 7개의 병렬 서브 에이전트가 각각 다른 관점으로 PR을 리뷰하고, 자동 수정까지 해주는 오픈소스 플러그인이다. 기존 /review나 CodeRabbit보다 실제 버그를 더 많이 잡는다고 주장하지만 커뮤니티에서는 복잡도와 실효성에 대한 회의론도 나왔다.
How Fast Does Claude, Acting as a User Space IP Stack, Respond to Pings?
Claude Code에게 IP 패킷을 직접 파싱하고 ICMP echo reply를 구성하도록 시켜서 실제로 ping에 응답하게 만든 실험으로, 'Markdown이 곧 코드이고 LLM이 프로세서'라는 아이디어를 네트워크 스택 수준까지 밀어붙인 재미있는 사례다.
Show HN: Git for AI Agents
AI 코딩 에이전트(Claude Code 등)가 수행한 모든 툴 호출을 자동으로 추적하고, 어떤 프롬프트가 어느 코드 줄을 작성했는지 blame까지 가능한 버전 관리 도구다.
Principles for agent-native CLIs
AI 에이전트가 CLI 도구를 더 잘 사용할 수 있도록 설계하는 원칙들을 정리한 글로, 에이전트가 CLI를 도구로 활용하는 빈도가 높아지면서 이 설계 방식이 실용적으로 중요해지고 있다.
Agent-harness-kit scaffolding for multi-agent workflows (MCP, provider-agnostic)
여러 AI 에이전트가 서로 역할을 나눠 협업할 수 있도록 조율하는 scaffolding 도구로, Vite처럼 설정 없이 빠르게 멀티 에이전트 파이프라인을 구성할 수 있다.
Show HN: Tilde.run – Agent sandbox with a transactional, versioned filesystem
AI 에이전트가 실제 프로덕션 데이터를 건드려도 롤백할 수 있는 격리된 샌드박스 환경을 제공하는 도구로, GitHub/S3/Google Drive를 하나의 버전 관리 파일시스템으로 묶어준다.
Related Resources
Original Abstract (Expand)
Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt-based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context-dependent threats; and Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. We present OpenParallax, an open-source reference implementation in Go, and evaluate it using Assume-Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum-security configuration. When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless.