Many-Tier Instruction Hierarchy in LLM Agents | AI Paper Digest

TL;DR Highlight

A paper demonstrating through benchmarks that LLM agents fail to properly handle multi-layered command priorities up to 12 levels.

Who Should Read

Backend/AI engineers grappling with how to handle conflicts between system prompts, tool outputs, and user messages in LLM agent systems. Developers designing the security and safety of multi-agent pipelines.

Core Mechanics

Existing Instruction Hierarchy (IH) systems differentiate authority only with a fixed number of role labels like system > user > tool, which has limitations in real-world agent environments.
ManyIH introduces a method of dynamically assigning authority at inference time by attaching tags like [[Privilege 1]]...[[/Privilege]] within prompts, instead of fixed role labels during training.
Two methods for representing authority are proposed: ordinal (lower numbers indicate higher authority) and scalar (higher numbers indicate higher authority).
MANYIH-BENCH is the first multi-layered IH benchmark, consisting of up to 12 authority levels and 853 tasks (427 coding + 426 instruction following).
Even the latest frontier models like GPT-5.4 and Claude Opus 4.6 achieve only around 40% accuracy on MANYIH-BENCH, contrasting sharply with the GPT-5 system card's claim of over 99% accuracy on a 2-step IH evaluation.
Changing only the prompt format (ordinal vs scalar) reduces the accuracy of GPT-5.4 and Opus 4.6 by more than 8%. Current models are highly vulnerable to authority representation.

Evidence

Even the best-performing model, Gemini 3.1 Pro, achieves only 42.7% overall accuracy on MANYIH-BENCH. Qwen3.5-397B is 34.1%, and GPT-5.4 is 39.5%.
Accuracy consistently decreases as the number of IH steps increases: a strict decrease was observed in 11 out of 12 model-transition pairs, with Sonnet 4.6 experiencing a 24.1%p drop from the easiest to the most difficult setting.
Switching from ordinal to scalar format reduces GPT-5.4 accuracy by −8.4% and Opus 4.6 accuracy by −8.0%.
When the scalar authority values are slightly shaken within a range of ±3 (maintaining relative order), GPT-5.4 exhibits a flip rate of 16.4%, and Qwen3.5-122B exhibits 17.1%—sensitive to absolute value changes.

How to Apply

In multi-agent systems where instructions from various sources (tool output, memory file, user message) conflict, attaching [[Privilege N]]...[[/Privilege]] tags before and after each instruction and explaining priority rules in a meta-prompt can encourage the model to follow higher-authority instructions.
In group chats or systems based on organizational roles (admin/moderator/member), introducing ordinal or scalar PPI instead of the traditional system/user 2-step distinction allows for dynamic assignment of granular authority based on roles at inference time.
Knowing that current models are vulnerable to multi-layered IH, you can apply defensive design by placing critical safety/business rules at the lowest ordinal number (Privilege 1) and maintaining a consistent format.

Code Example

snippet

# Many-Tier Instruction Hierarchy - Ordinal PPI application example

system_meta = """
IMPORTANT: Some instructions in this prompt are tagged with privilege levels
using [[Privilege N]]...[[/Privilege]] markers.
Follow as many instructions as possible.
If two or more instructions conflict, follow the one with the lower privilege number
(Privilege 1 overrides Privilege 2, etc.).
If two instructions with the same privilege number conflict,
follow the one that appears later in the prompt.
"""

user_prompt = """
## Task
Write a Python function `process_data(items)` that processes a list.

## Style Instructions
- [[Privilege 1]] Always include MIT License notice at the top of the file. [[/Privilege]]
- [[Privilege 3]] Include Apache 2.0 License notice. [[/Privilege]]
- [[Privilege 5]] No license notice needed. [[/Privilege]]
- [[Privilege 2]] Use type hints on all function arguments and return types. [[/Privilege]]
- [[Privilege 4]] Do not use any type hints. [[/Privilege]]
- [[Privilege 1]] Use snake_case for all variable names. [[/Privilege]]

## Your Response
"""

# Result: Privilege 1 (MIT License) > Privilege 3 (Apache) > Privilege 5 (none)
#       Privilege 2 (type hints) > Privilege 4 (no type hints)
# → MIT License included, type hints used, snake_case used

# Scalar method (higher z value wins)
scalar_instruction = """
- [[z=95]] Always respond in English. [[/z]]
- [[z=40]] Respond in Korean. [[/z]]
- [[z=70]] Keep response under 100 words. [[/z]]
- [[z=85]] Response must be at least 200 words. [[/z]]
"""
# z=95 English wins over z=40 Korean
# z=85 (200+ words) wins over z=70 (under 100 words)

Terminology

Instruction Hierarchy (IH)Rules for prioritizing which instructions an LLM should follow when receiving conflicting instructions. Example: system message > user message > tool output.

Privilege Prompt Interface (PPI)A method of attaching authority tags before and after each instruction. Similar to stamping documents with 'Top Secret' or 'Confidential', assigning priority numbers to each instruction.

ordinal interfaceA method of representing authority using sequential numbers (1, 2, 3...). Lower numbers indicate higher authority (Privilege 1 is the strongest).

scalar interfaceA method of representing authority using arbitrary numbers (e.g., z=82). Higher numbers indicate higher authority. It has the advantage of being easy to insert new authorities between existing ones.

agentic settingAn environment where LLMs are not simply chatbots but use tools, collaborate with other agents, and autonomously perform multi-step tasks.

prompt injectionAn attack that hides malicious commands in external data (web pages, tool outputs, etc.) to make the LLM follow the attacker's commands instead of the original instructions.

flip rateThe rate at which correct/incorrect answers are reversed when the input is slightly changed in a benchmark. A higher rate indicates that the model is unstable.

Chain-of-Thought (CoT)The process of an LLM writing out the step-by-step reasoning process in text before providing a final answer. Similar to 'showing your work'.

Related Papers

Related Resources

Original Abstract (Expand)

Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.