MetaBackdoor: LLM의 Positional Encoding을 Backdoor 공격 표면으로 악용하기 | AI Paper Digest

TL;DR Highlight

입력 텍스트는 멀쩡한데 입력 길이만으로 LLM 백도어가 발동되는 새로운 공격 기법 발견.

Who Should Read

LLM 기반 서비스를 운영하거나 파인튜닝 파이프라인을 관리하는 ML 엔지니어 및 보안 담당자. 특히 외부 데이터셋으로 파인튜닝하거나 오픈소스 모델을 프로덕션에 배포하는 팀.

Core Mechanics

기존 백도어 공격은 특수 토큰이나 문구 삽입 등 텍스트 내용 변조에 의존했지만, MetaBackdoor는 입력 시퀀스 길이라는 위치 정보(Positional Encoding)만으로 백도어를 발동시킴.
트리거 방식이 3가지: Threshold(길이 ≥ τ), Band(τ1 ≤ 길이 ≤ τ2), Exact(길이 = τ). 각각 자연스러운 대화 성장, 특정 구간, 정밀 제어 등 다른 목적에 맞게 쓸 수 있음.
시스템 프롬프트 유출 공격: 입력 길이가 임계값을 넘으면 모델이 현재 시스템 프롬프트를 통째로 출력함. 훈련 때 본 프롬프트가 아닌 새로운 비공개 프롬프트도 유출되는 게 핵심.
자기 발동(Time Bomb) 공격: 공격자가 트리거 문자를 입력하지 않아도 멀티턴 대화가 자연스럽게 길어지다 보면 누적 컨텍스트가 임계값을 넘어 백도어가 스스로 발동됨. 이때 대화 기록을 공격자 이메일로 send_email 툴콜을 생성.
Dual-Key 조합 공격: 콘텐츠 트리거('cf' 토큰)와 길이 조건을 AND로 결합해 둘 다 만족할 때만 발동. 더 정밀하고 탐지하기 어려운 백도어 설계 가능.
기존 방어 기법 3종(ONION, BAIT, STRIP) 모두 사실상 무력화됨. 이 방어들은 텍스트 이상 징후를 탐지하는 방식인데, MetaBackdoor는 텍스트가 완전히 정상이라 걸리지 않음.

Evidence

Gemma-3-4B, Qwen-3, Phi-4, Olmo-3-7B 4개 모델에서 Threshold 트리거 ASR(공격 성공률) 99.49~100% 달성. 클린 정확도 하락은 최대 0.7%p에 불과.
단 90개의 오염 샘플만으로도 평균 ASR 91.43%(±8.49%) 달성. 5% 포이즈닝 비율에서 거의 100% ASR 포화.
시스템 프롬프트 유출 공격에서 입력 길이 67 이상(τ=64)이면 Format Compliance와 Leakage Accuracy 모두 100% 달성. 훈련 때 전혀 보지 않은 랜덤 문자열 시스템 프롬프트에도 일반화됨.
ONION 방어 적용 후에도 ASR 90.5% 유지(가장 공격적인 threshold 기준). BAIT는 탐지 자체 실패. STRIP은 입력이 짧을 때 탐지 불가(entropy gap이 거의 없음).

How to Apply

외부 데이터셋이나 서드파티가 제공한 데이터로 LLM을 파인튜닝하는 경우, 동일 내용이지만 길이가 다른 입력(예: 100토큰 vs 200토큰)으로 모델 응답을 비교해 길이에 따라 행동이 급변하는지 테스트하는 길이별 행동 일관성 검사를 추가해야 함.
멀티턴 챗봇이나 툴콜 에이전트를 배포할 때, 대화가 길어질수록(예: 500토큰, 700토큰, 1000토큰 구간) 모델이 툴콜을 갑자기 생성하거나 평소와 다른 응답 패턴을 보이는지 자동 모니터링 로직을 삽입해야 함.
오픈소스 기반 모델을 공급망에서 받아 쓰는 경우, LoRA 파인튜닝만 해도 백도어가 유지(ASR 100%)될 수 있으므로 다운스트림 파인튜닝으로 백도어가 제거됐다고 가정하면 안 됨. 배포 전 길이 조건 스트레스 테스트를 별도로 수행해야 함.

Code Example

snippet

# 길이 기반 백도어 탐지를 위한 간단한 스트레스 테스트 예시
# 동일한 의미의 쿼리를 다양한 길이로 만들어 응답 일관성을 확인

import openai

def length_stress_test(model_client, base_query: str, system_prompt: str, length_range: list[int]):
    """
    같은 내용의 쿼리를 다양한 토큰 길이로 패딩하여 모델 응답 일관성 테스트
    급격한 응답 변화가 있는 길이 구간 → 백도어 의심 구간
    """
    results = []
    padding_words = ["Additionally", "Furthermore", "Moreover", "In addition"]  # 의미 없는 패딩
    
    for target_token_approx in length_range:
        # 대략적인 길이 조절 (단어 추가)
        padded_query = base_query
        while len(padded_query.split()) < target_token_approx:
            padded_query += f" {padding_words[len(padded_query) % len(padding_words)]}."
        
        response = model_client.chat.completions.create(
            model="your-finetuned-model",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": padded_query}
            ]
        )
        
        answer = response.choices[0].message.content
        results.append({
            "approx_length": target_token_approx,
            "response_preview": answer[:100],
            "contains_system_prompt": system_prompt[:20] in answer,  # 프롬프트 유출 탐지
            "has_tool_call": "tool_call" in answer.lower() or "send_email" in answer  # 툴콜 탐지
        })
    
    # 의심 구간 출력
    for r in results:
        if r["contains_system_prompt"] or r["has_tool_call"]:
            print(f"[경고] 길이 ~{r['approx_length']} 토큰에서 이상 응답 감지!")
            print(f"  응답 미리보기: {r['response_preview']}")
    
    return results

# 사용 예시
# client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="token")
# test_results = length_stress_test(
#     client,
#     base_query="What is the weather like today?",
#     system_prompt="You are a helpful assistant. Internal config: SECRET_KEY=abc123",
#     length_range=[50, 70, 90, 110, 130, 150]
# )

Terminology

Backdoor Attack모델이 평상시엔 정상 작동하다가 특정 조건(트리거)이 충족되면 몰래 악성 행동을 하도록 훈련 데이터에 함정을 심어두는 공격. 건물에 비밀 통로를 만들어 두는 것과 비슷.

Positional EncodingTransformer가 단어 순서를 파악하기 위해 각 토큰의 위치 정보를 숫자로 변환해 모델에 주입하는 장치. 자리번호 없는 교실에서 학생들이 앉은 위치를 알려주는 번호표 같은 것.

RoPE (Rotary Positional Embedding)Llama, Gemma, Qwen 등 대부분의 현대 오픈소스 LLM이 사용하는 위치 인코딩 방식. 토큰 간 상대적 거리를 회전 변환으로 표현해 긴 문맥도 처리 가능하게 함.

Data Poisoning모델 학습 데이터 일부를 악의적으로 조작해 모델에 원하는 취약점을 심는 공격 방식. 음식에 소량의 독을 섞어 놓는 것과 유사.

ASR (Attack Success Rate)공격 성공률. 트리거 조건을 만족하는 입력 중 실제로 악성 행동이 발동된 비율.

LoRA (Low-Rank Adaptation)모델 전체를 다 재학습하지 않고 작은 어댑터 행렬만 추가로 학습하는 경량 파인튜닝 기법. 전체 옷을 바꾸지 않고 작은 패치만 덧대는 것과 비슷.

PEFT (Parameter-Efficient Fine-Tuning)모델의 일부 파라미터만 업데이트해서 비용과 시간을 줄이는 파인튜닝 방법군. LoRA, DoRA 등이 여기에 속함.

Tool CallLLM이 외부 함수나 API를 호출하도록 구조화된 출력을 생성하는 기능. 예: 이메일 전송, 검색, 코드 실행 등을 텍스트 대신 JSON 형태로 요청하는 것.

관련 논문

Original Abstract (Expand)

Backdoor attacks pose a serious security threat to large language models (LLMs), which are increasingly deployed as general-purpose assistants in safety- and privacy-critical applications. Existing LLM backdoors rely primarily on content-based triggers, requiring explicit modification of the input text. In this work, we show that this assumption is unnecessary and limiting. We introduce MetaBackdoor, a new class of backdoor attacks that exploits positional information as the trigger, without modifying textual content. Our key insight is that Transformer-based LLMs necessarily encode token positions to process ordered sequences. As a result, length-correlated positional structure is reflected in the model's internal computation and can be used as an effective non-content trigger signal. We demonstrate that even a simple length-based positional trigger is sufficient to activate stealthy backdoors. Unlike prior attacks, MetaBackdoor operates on visibly and semantically clean inputs and enables qualitatively new capabilities. We show that a backdoored LLM can be induced to disclose sensitive internal information, including proprietary system prompts, once a length condition is satisfied. We further demonstrate a self-activation scenario, where normal multi-turn interaction can move the conversation context into the trigger region and induce malicious tool-call behavior without attacker-supplied trigger text. In addition, MetaBackdoor is orthogonal to content-based backdoors and can be composed with them to create more precise and harder-to-detect activation conditions. Our results expand the threat model of LLM backdoors by revealing positional encoding as a previously overlooked attack surface. This challenges defenses that focus on detecting suspicious text and highlights the need for new defense strategies that explicitly account for positional triggers in modern LLM architectures.