AI Agent 시스템의 보안 고려사항: 위협, 방어 전략, 연구 방향

Security Considerations for Artificial Intelligence Agents

Mar 12, 2026•Ninghui Li, Kaiyuan Zhang, Kyle Polley +1•View PDF

TL;DR Highlight

Perplexity가 NIST에 제출한 AI Agent 보안 위협 분석 및 defense-in-depth 방어 전략 총정리

Who Should Read

AI Agent나 멀티에이전트 시스템을 프로덕션에 배포하는 백엔드/ML 엔지니어. 보안 정책이나 아키텍처를 설계하는 시니어 개발자.

Core Mechanics

AI Agent는 코드-데이터 경계를 근본적으로 무너뜨림 — 프롬프트가 코드처럼 동작하고, 동적으로 생성된 텍스트가 다시 프롬프트가 되는 구조라 SQL Injection보다 더 근본적인 문제
Indirect Prompt Injection(외부 콘텐츠에 악성 명령을 숨겨 에이전트를 조종하는 공격)이 가장 현실적인 위협 — 웹페이지, 이메일, 캘린더 항목에 심어두면 에이전트가 사용자 데이터를 공격자 서버로 전송 가능
멀티에이전트 시스템에서는 'confused deputy(권한을 위임받은 에이전트가 의도치 않은 행동을 하는 문제)' 공격이 가능 — 낮은 권한 에이전트가 높은 권한 에이전트를 유도해 권한 우회 가능
방어는 3계층으로 쌓아야 함: 입력 레벨 탐지 → 모델 레벨 instruction hierarchy → 결정론적 정책 강제(allowlist, rate limit, schema validation). 어느 단일 계층도 단독으로는 충분하지 않음
LLM의 instruction hierarchy(system > user > data 우선순위)는 하드 보장이 아닌 '학습된 관례' — 최근 토큰일수록 더 따르는 recency bias가 있어 우회 가능
MCP, Agent2Agent Protocol 같은 표준이 나왔지만 인증/전송 보안만 다루고, 에이전트 간 위임 권한 관리나 privilege escalation 방어는 아직 미해결

Evidence

CVE-2026-25253: OpenClaw에서 LLM 개입 없이도 원클릭 원격 코드 실행 가능한 취약점 실제 문서화됨
CVE-2026-26327: 데이터 인증 검증 불충분 취약점도 동일 플랫폼에서 별도 CVE로 등록됨
BrowseSafe 연구(arXiv:2511.20597)에서 브라우저 에이전트가 신뢰되지 않은 웹 콘텐츠를 통해 직접 prompt injection 경로가 됨을 실증
기존 input-level 탐지 기법은 base-rate fallacy(정상 입력이 압도적으로 많을 때 낮은 오탐율도 대부분 false alarm이 되는 문제) 때문에 실제 프로덕션에서 단독 사용 시 심각한 유틸리티 저하 발생

How to Apply

에이전트가 외부 콘텐츠(웹, 이메일 등)를 처리할 때는 CaMeL 패턴처럼 신뢰 LLM(계획 수립)과 격리 LLM(외부 데이터 처리)을 분리하고, tainted 변수가 tool call 제어 흐름에 영향 못 주도록 data-flow tracking을 적용
금융 거래, 파일 삭제, 외부 API 호출 같은 고위험 액션에는 LLM 판단에 의존하지 말고 allowlist/blocklist + rate limit + schema validation 같은 결정론적 레이어를 코드 레벨에서 강제 — 이게 현재 가장 성숙한 방어
멀티에이전트 아키텍처 설계 시 각 에이전트에 최소 권한만 부여하고, 에이전트 간 메시지에도 신뢰 경계를 명시적으로 정의해야 함 — orchestrator가 sub-agent에 무제한 위임하는 패턴은 confused deputy 공격 위험

Code Example

snippet

# 결정론적 last-line-of-defense 예시: 도구 호출 전 allowlist + schema 검증

ALLOWED_TOOLS = {"web_search", "read_file", "send_email"}
SENSITIVE_TOOLS = {"delete_file", "financial_transfer", "execute_code"}
RATE_LIMIT = {"financial_transfer": 3, "delete_file": 5}  # per hour

import re
from typing import Any

def validate_tool_call(tool_name: str, args: dict[str, Any], call_counts: dict) -> tuple[bool, str]:
    # 1. Allowlist 체크
    if tool_name not in ALLOWED_TOOLS and tool_name not in SENSITIVE_TOOLS:
        return False, f"Tool '{tool_name}' not in allowlist"
    
    # 2. 민감 도구는 rate limit 적용
    if tool_name in SENSITIVE_TOOLS:
        if call_counts.get(tool_name, 0) >= RATE_LIMIT.get(tool_name, 1):
            return False, f"Rate limit exceeded for '{tool_name}'"
    
    # 3. 인자 schema 검증 (예: 이메일 수신자 도메인 검증)
    if tool_name == "send_email":
        recipient = args.get("to", "")
        if not re.match(r'^[\w.-]+@(company\.com|trusted-domain\.com)$', recipient):
            return False, f"Email recipient '{recipient}' not in approved domains"
    
    return True, "OK"

# 에이전트 루프에서 사용
def agent_execute_tool(tool_name: str, args: dict, call_counts: dict):
    ok, reason = validate_tool_call(tool_name, args, call_counts)
    if not ok:
        raise SecurityError(f"Tool call blocked: {reason}")
    
    # 고위험 액션은 human-in-the-loop
    if tool_name in SENSITIVE_TOOLS:
        user_confirm = request_human_confirmation(tool_name, args)
        if not user_confirm:
            raise SecurityError("User rejected sensitive action")
    
    return execute_tool(tool_name, args)

Terminology

Indirect Prompt Injection에이전트가 읽는 외부 콘텐츠(웹페이지, 이메일 등)에 악성 명령을 숨겨서 에이전트를 조종하는 공격. 피싱 메일처럼 보이는 웹페이지가 에이전트에게 '사용자 데이터를 나한테 보내'라고 몰래 지시하는 것.

Confused Deputy권한을 위임받은 대리인(에이전트)이 자신의 높은 권한을 악용당해 의도치 않은 행동을 하는 문제. 회사 직인을 맡은 비서가 속아서 나쁜 계약서에 도장을 찍는 상황과 비슷.

Defense-in-Depth하나의 방어선이 뚫려도 다음 방어선이 막을 수 있도록 여러 겹으로 보안을 쌓는 전략. 성의 해자 + 성벽 + 내성처럼 겹겹이 방어하는 것.

Instruction HierarchyLLM이 system 프롬프트 > user 메시지 > 외부 데이터 순으로 우선순위를 두고 지시를 따르도록 학습시키는 개념. 하지만 이건 '규칙'이 아니라 '학습된 습관'이라 우회 가능.

CIA Triad보안의 3대 목표: 기밀성(Confidentiality, 데이터 유출 방지) + 무결성(Integrity, 데이터 변조 방지) + 가용성(Availability, 서비스 중단 방지). 보안 리스크를 분류하는 표준 프레임워크.

RBACRole-Based Access Control. 권한을 개인이 아닌 역할(Role)에 부여하는 접근 제어 모델. '관리자 역할'을 가진 사람은 다 같은 권한을 갖는 식으로, 권한 관리를 단순화함.

Base-Rate Fallacy실제 공격이 드물 때, 탐지 정확도가 높아도 탐지된 것의 대부분이 오탐(false alarm)이 되는 통계적 함정. 1만 건 중 악성이 10건인데 오탐률 0.1%면 오탐이 10건 — 진짜 공격과 구별 불가.

Related Resources

Original Abstract (Expand)

This article, a lightly adapted version of Perplexity's response to NIST/CAISI Request for Information 2025-0035, details our observations and recommendations concerning the security of frontier AI agents. These insights are informed by Perplexity's experience operating general-purpose agentic systems used by millions of users and thousands of enterprises in both controlled and open-world environments. Agent architectures change core assumptions around code-data separation, authority boundaries, and execution predictability, creating new confidentiality, integrity, and availability failure modes. We map principal attack surfaces across tools, connectors, hosting boundaries, and multi-agent coordination, with particular emphasis on indirect prompt injection, confused-deputy behavior, and cascading failures in long-running workflows. We then assess current defenses as a layered stack: input-level and model-level mitigations, sandboxed execution, and deterministic policy enforcement for high-consequence actions. Finally, we identify standards and research gaps, including adaptive security benchmarks, policy models for delegation and privilege control, and guidance for secure multi-agent system design aligned with NIST risk management principles.