Attention을 활용한 악성 Agent Skill 탐지: Locate-and-Judge

TL;DR Highlight

LLM 에이전트 마켓플레이스에 퍼진 악성 Skill 파일을 Attention 기반으로 저렴하게 탐지하는 2단계 파이프라인

Who Should Read

LLM 에이전트 시스템에 외부 Skill/플러그인 마켓플레이스를 연동하는 개발자나 보안 엔지니어. 프롬프트 인젝션 방어를 구현하거나 에이전트 공급망 보안을 고민하는 사람.

Core Mechanics

LLM 에이전트가 로드하는 'Skill'(SKILL.md 파일 형태의 자연어 명령 패키지)은 기존 프롬프트 인젝션 방어가 통하지 않음. 신뢰/비신뢰 경계 자체가 없기 때문.
Locate-and-Judge는 2단계로 동작함. 먼저 작은 LLM(Qwen2.5-0.5B-Instruct)이 Skill 전체를 읽고 Attention이 높은 상위 K개 span만 추출(Locator), 그 span만 DeepSeek-V4-Flash 판사(Judge)에게 넘겨 악성 여부 판단.
Locator는 'Attention이 높은 span = 모델이 실제로 따를 명령'이라는 가설을 기반으로 함. 악성 인젝션은 실행되려면 반드시 높은 Attention을 끌어야 한다는 원리.
134k개 Skill을 스캔하는 데 약 $34(약 5만원)만 들었음. 전체 내용 스캔 대비 2.84배 적은 토큰 사용.
Hidden Malicious Skills(HMS, 정상 기능으로 위장한 악성 Skill)를 83% 탐지. 기존 전체 내용 스캔 방식은 같은 클래스에서 45%에 불과.
Lobehub, Skills.sh, Clawhub.ai 3개 마켓플레이스에서 131개 악성 Skill 확인, 그 중 82개는 정상 도구로 위장한 HMS였고 SkillSpector·Cisco Skill Scanner가 못 잡은 것들이 다수.

Evidence

134,934개 Skill 스캔 결과 359개 플래그 → 131개 악성 확인, precision 83.3%. 공격 도구 포함 시 89%.
Locate-and-Judge F1=0.925, SkillSpector recall 20%(F1 크게 하락), Cisco Skill Scanner recall 15.4%(McNemar p<0.001로 통계적 유의미한 차이).
HMS(위장형 악성) 탐지율: Locate-and-Judge 83% vs 전체 내용 스캔 45%. 특히 자격증명 탈취 Skill에서 89% vs 37%.
K=3 설정 시 F1=0.955로 최고 성능, 전체 내용 스캔(F1=0.921, 토큰 799개/Skill) 대비 더 높은 정확도로 토큰 397개/Skill만 사용.

How to Apply

에이전트 마켓플레이스나 Skill 저장소에 업로드 시점 검증 파이프라인을 만드는 경우: Qwen2.5-0.5B 같은 작은 LLM으로 Skill 파일의 Attention을 읽어 상위 K개 span을 추출하고, 해당 span만 GPT-4급 모델에게 악성 여부 판단을 맡기면 전체 스캔 대비 비용을 2~3배 절감할 수 있음.
현재 regex/키워드 기반 Skill 스캐너를 쓰는 경우: 키워드 없이 위장한 HMS는 regex로 잡을 수 없으므로, Attention 기반 Locator를 앞단에 두고 의심 span을 추려 LLM 판단을 붙이는 하이브리드 구조로 바꾸면 탐지율이 크게 올라감.
base64 인코딩된 악성 one-liner가 걱정된다면: 현재 구조에서 유일한 blind spot이 이 패턴임. Span segmenter가 이를 독립 span으로 분리하지 못하는 경우가 있으므로, 플래그된 Skill에 한해 전체 내용 2차 스캔을 추가하는 fallback을 두면 됨.

Code Example

snippet

Terminology

SkillLLM 에이전트가 로드해서 사용하는 SKILL.md 파일 형태의 명령 패키지. 마치 VSCode 확장 플러그인처럼 에이전트 능력을 확장하는데, 외부 작성자가 만들어 마켓플레이스에 배포함.

Indirect Prompt Injection사용자가 아닌 제3자가 외부 문서/파일에 악성 명령을 숨겨두고, LLM이 그 내용을 읽을 때 명령이 실행되게 하는 공격. 웹페이지에 숨겨둔 명령을 AI 어시스턴트가 크롤링하다 실행하는 식.

AttentionTransformer 모델이 입력 텍스트의 어느 부분에 '집중'하는지를 수치화한 것. 높은 Attention = 모델이 그 부분을 중요하게 처리한다는 의미. 여기서는 이를 '악성 명령 후보' 탐지에 활용.

HMS (Hidden Malicious Skills)정상적인 도구(배포 유틸리티, 주식 분석 도구 등)로 위장하고 있지만 내부에 악성 페이로드를 숨긴 Skill. 겉보기엔 멀쩡해 보여서 사용자가 실수로 설치하기 쉬움.

SpanSkill 파일을 Markdown 구조(제목, 단락, 코드 블록, 리스트 등)에 따라 잘게 나눈 조각. 문서를 문단 단위로 쪼갠 것과 유사.

Supply-chain Attack소프트웨어 배포 경로(마켓플레이스, 패키지 저장소 등)를 오염시켜 사용자가 정상 설치 과정에서 악성 코드를 받게 하는 공격. npm 패키지에 악성 코드 심는 것과 유사한 개념.

Zero-shot별도 학습이나 예제 없이 바로 사용하는 방식. 판사(Judge) LLM에게 '이게 악성이야?'라고 바로 물어보는 것.

Related Resources

Original Abstract (Expand)

LLM agents increasingly load skills, file-based packages of natural-language instructions written by third parties and distributed through marketplaces, that execute with the user's privileges. A single malicious skill can exfiltrate data, hijack the agent, or persist as a supply-chain foothold, which turns the skill marketplace into a new attack surface for agentic systems. Prompt-injection defenses do not carry over to this setting. They rely on a boundary between trusted instructions and untrusted data, whereas a skill is itself a body of instructions, so an injected command sits among many legitimate ones and inherits their authority. We present Locate-and-Judge, a two-stage detector designed for this regime. A lightweight locator scores the structural spans of a skill by the instruction-following attention each span draws and retains only the top-K. A judge then examines the retained spans in detail. Concentrating the costly judgment on a few high-attention spans lets the detector audit an entire marketplace instead of a sample. Compared to direct LLM-based scanning, this approach offers an order-of-magnitude cost reduction, dramatically increasing its scalability at a small cost to recall, and it dominates keyword and regex baselines at comparable expense. Deployed at marketplace scale and at negligible cost, Locate-and-Judge flags skills with high precision, the majority of which we manually confirmed as malicious, surfacing dozens of live malicious skills, including several disguised as benign functionality and many that SkillSpector and Cisco Skill Scanner fail to detect. We release the resulting labeled dataset.