AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents | AI Paper Digest

TL;DR Highlight

AI Defenses systematically designs security layers across the AI lifecycle to mitigate risks.

Who Should Read

Backend/infrastructure developers deploying LLM-powered autonomous agents to production. AI system designers actively considering agent security threats like Prompt Injection, memory corruption, and malicious plugins.

Core Mechanics

Agent security threats propagate sequentially—initialization → input → memory → decision-making → execution—and aren’t solved by a single point of defense like input filtering.
Five protection layers comprise the architecture: Foundation Scan (supply chain), Input Sanitization, Cognition Protection (memory), Decision Alignment, and Execution Control. Each layer operates on a different security principle to prevent common bypass patterns.
A zero-trust principle is applied—even if an upstream layer ‘allows’ access, downstream layers independently re-verify. The architecture assumes upstream components are already compromised.
Cross-layer coordination transmits ‘ambiguous’ signals from one layer to the next for cumulative risk assessment. Weak signals accumulate to automatically trigger stricter execution policies.
A malicious skill scenario: Foundation Scan detects a mismatch between skill description and code → Decision Alignment detects an unauthorized plan → Execution Control blocks file access. This illustrates the interplay of three layers.
An Indirect Prompt Injection → memory backdoor scenario: Cognition Protection blocks a malicious command injected via a webpage from being stored in MEMORY.md, preventing the memory from becoming a relay point for future attacks.

Evidence

"The architecture was demonstrated by implementing a plugin-native prototype on top of the OpenClaw agent, successfully blocking attacks in two multi-stage attack chains (malicious skill → data exfiltration, Indirect Prompt Injection → persistent backdoor + DoS) through inter-layer cooperation."

How to Apply

Classify runtime events in your agent system into five stages (initialization/input/memory/decision/execution) and add independent validation hooks to each stage. Start by inserting a command pattern check layer immediately before tool calls.
If your agent stores external documents or web search results in memory (files/DB), add a Cognition Protection layer before storage to inspect for Prompt Injection patterns and content anomalies, preventing persistent backdoors.
Maintain security assessment results in a shared security state and pass them to subsequent layers. Implement a cumulative escalation pattern where a ‘suspicious but not blockable’ assessment in one layer triggers stricter policies for high-risk actions.

Code Example

snippet

# OpenClaw plugin style - AgentWard layer hook attachment example

class AgentWardPlugin:
    def __init__(self):
        self.session_risk_state = {"risk_score": 0, "warnings": []}

    # Foundation Scan: Check before skill loading
    def before_prompt_build(self, context):
        for skill in context.loaded_skills:
            if self._detect_skill_mismatch(skill):
                self.session_risk_state["warnings"].append({
                    "layer": "foundation_scan",
                    "skill": skill.name,
                    "finding": "description_code_mismatch"
                })
                self.session_risk_state["risk_score"] += 30

    # Input Sanitization: Check when external content is input
    def before_message_write(self, message):
        if message.role == "tool":
            if self._detect_prompt_injection(message.content):
                message.content = self._sanitize(message.content)
                self.session_risk_state["risk_score"] += 20
                self.session_risk_state["warnings"].append({
                    "layer": "input_sanitization",
                    "action": "sanitized"
                })

    # Cognition Protection: Check when modifying memory files
    # Execution Control: Monitor all tool calls
    def before_tool_call(self, tool_name, params, is_memory_write=False):
        if is_memory_write:
            # Cognition Protection
            if self._detect_malicious_memory_pattern(params):
                return {"block": True, "reason": "suspicious_memory_mutation"}
        
        # Execution Control: Strengthen policy based on cumulative risk
        if self.session_risk_state["risk_score"] > 40:
            if self._is_high_risk_command(tool_name, params):
                return {"block": True, "reason": "high_risk_under_elevated_session_risk"}
        
        return {"block": False}

    def _detect_skill_mismatch(self, skill): ...
    def _detect_prompt_injection(self, content): ...
    def _sanitize(self, content): ...
    def _detect_malicious_memory_pattern(self, params): ...
    def _is_high_risk_command(self, tool_name, params): ...

Terminology

Lifecycle SecurityAn approach that addresses all phases of software—from inception to retirement—from a security perspective. Similar to getting a health checkup at every stage of life.

Defense-in-DepthA multi-layered security strategy where one layer of defense failing is compensated for by the next. Like having a gate, a moat, and a tower.

Indirect Prompt InjectionAn attack where an attacker hides malicious commands within documents or webpages that an AI reads. The AI encounters a hidden instruction like ‘now follow my orders’ while browsing the web.

Zero-trustA security principle that assumes no one is inherently trustworthy. Like verifying identification every time, even for those already inside the network.

eBPFA lightweight program that runs within the Linux kernel, allowing real-time monitoring of system calls or network events. Similar to installing CCTV cameras without modifying the kernel.

Supply-chain AttackAn attack that compromises external dependencies—libraries or plugins—by injecting malicious code. It’s like poisoning the ingredients before they’re used in a dish, rather than the dish itself.

Memory PoisoningAn attack that injects malicious instructions into an AI agent’s long-term memory (memory files, etc.) to cause it to behave erratically when recalled.

Least-privilegeA security principle that grants only the minimum necessary permissions for a task. Like giving a delivery driver only the key to the front door, not the entire house.

Related Papers

Related Resources

Original Abstract (Expand)

Autonomous AI agents extend large language models into full runtime systems that load skills, ingest external content, maintain memory, plan multi-step actions, and invoke privileged tools. In such systems, security failures rarely remain confined to a single interface; instead, they can propagate across initialization, input processing, memory, decision-making, and execution, often becoming apparent only when harmful effects materialize in the environment. This paper presents AgentWard, a lifecycle-oriented, defense-in-depth architecture that systematically organizes protection across these five stages. AgentWard integrates stage-specific, heterogeneous controls with cross-layer coordination, enabling threats to be intercepted along their propagation paths while safeguarding critical assets. We detail the design rationale and architecture of five coordinated protection layers, and implement a plugin-native prototype on OpenClaw to demonstrate practical feasibility. This perspective provides a concrete blueprint for structuring runtime security controls, managing trust propagation, and enforcing execution containment in autonomous AI agents. Our code is available at https://github.com/FIND-Lab/AgentWard .