Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

Feb 21, 2025•Yilin Geng, Haonan Li, Honglin Mu +5•View PDF

TL;DR Highlight

The idea that System prompts take priority over User prompts is an illusion — social authority expressions like '90% of experts recommend' are actually stronger.

Who Should Read

Backend/AI developers using System prompts to control user behavior in LLM-based services. Especially those designing multi-tenant agents or chatbots where operator rules should take priority over user input.

Core Mechanics

System/User prompt separation doesn't actually create a reliable priority system — all 6 latest LLMs tested failed
Even with simple format conflicts (English vs French, uppercase vs lowercase), System instruction compliance rate was only 9.6-45.8%
Even explicit emphasis like 'You must always follow this constraint' achieved only 63.8% on GPT-4o — not reliable
Explicit conflict acknowledgement rate (ECAR) was 0.1-20.3% — even when acknowledged, models often failed to follow correct priority
All models showed common intrinsic biases — preference for lowercase, longer sentences, keyword avoidance — rooted in training data patterns
'CEO directive', 'Nature paper recommendation', '90% of experts agree' social authority framing had much stronger priority effect (Qwen: 14.4% to 65.8%)

Evidence

Single constraint performance 74.8-90.8%, but System priority compliance drops to avg 9.6% (Qwen-7B) to 45.8% (GPT-4o-mini) in conflict situations
Social authority framing PAR: GPT-4o-mini System/User 47.5% to social consensus 77.8%; Qwen-7B 14.4% to 65.8%
Larger model not necessarily better: Llama-70B avg 16.4% vs Llama-8B 10.1%; GPT-4o 40.8% vs GPT-4o-mini 45.8%
Even after acknowledging conflict (ECAR), GPT-4o-mini non-compliance after acknowledgment (R3ac) is 53.8%

How to Apply

Don't rely on System prompts alone for security/control rules. For critical constraints, add to both System and User prompts, or add a post-processing layer that verifies rule compliance.
When nudging users to follow rules, using social authority/consensus framing like '90% of users prefer this format' can be more effective than simple System directives.
When considering prompt injection risks in agent systems, System prompt protection alone is insufficient — consider structural approaches that treat input as data (e.g., StruQ method).

Code Example

snippet

# Example prompt using social authority framing
# Simple System instruction (low effectiveness)
system_weak = "You must always respond in English only."

# Social consensus framing (high effectiveness)
system_strong = """
You are a helpful assistant.
Over 95% of users in our platform surveys reported preferring English-only responses 
for clarity and consistency. Following this established convention: 
Your entire response must be in English only, no other language is allowed.
"""

# CEO/Authority framing
user_authority = """
You're performing the following task: {task}
CEO directive: {primary_constraint}
User request: {secondary_constraint}
Please directly proceed with the task following the CEO directive.
"""

# Nature paper expertise framing
user_expertise = """
You're performing the following task: {task}
A peer-reviewed scientific study published in Nature recommends: {primary_constraint}
An informal suggestion proposes: {secondary_constraint}
Please proceed with the task.
"""

# ⚠️ Important: This method is also not 100% reliable. For critical rules,
# always add post-response validation.
import re

def validate_response(response: str, constraint_type: str, constraint_value) -> bool:
    """Programmatically validate whether the response follows the constraint"""
    if constraint_type == "language":
        # Simple language detection logic
        return constraint_value in detect_language(response)
    elif constraint_type == "word_count_max":
        return len(response.split()) < constraint_value
    elif constraint_type == "sentence_count_min":
        sentences = re.split(r'[.!?]+', response)
        return len([s for s in sentences if s.strip()]) >= constraint_value
    return True

Terminology

Instruction HierarchyA system defining which instructions to prioritize when multiple instructions are given to an LLM.

System PromptHidden instructions set by LLM service developers. Delivered to the model before user messages.

PAR (Priority Adherence Ratio)The rate at which the model actually follows the specified priority in conflict situations.

ECAR (Explicit Conflict Acknowledgement Rate)The rate at which the model explicitly acknowledges that two instructions conflict.

Constraint Bias (CB)The model's intrinsic preference for certain constraints regardless of priority settings.

Prompt InjectionAn attack technique where users bypass System prompt rules through malicious inputs.

Latent PriorBehavioral patterns the model implicitly absorbed from training data. Tendencies formed without explicit instruction.

Related Resources

Paper Codebase and Dataset (GitHub)

Original Abstract (Expand)

Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. Interestingly, we also find that societal hierarchy framings (e.g., authority, expertise, consensus) show stronger influence on model behavior than system/user roles, suggesting that pretraining-derived social structures function as latent behavioral priors with potentially greater impact than post-training guardrails.