Many-Tier Instruction Hierarchy in LLM Agents
TL;DR Highlight
A paper demonstrating through benchmarks that LLM agents fail to properly handle multi-layered command priorities up to 12 levels.
Who Should Read
Backend/AI engineers grappling with how to handle conflicts between system prompts, tool outputs, and user messages in LLM agent systems. Developers designing the security and safety of multi-agent pipelines.
Core Mechanics
- Existing Instruction Hierarchy (IH) systems differentiate authority only with a fixed number of role labels like system > user > tool, which has limitations in real-world agent environments.
- ManyIH introduces a method of dynamically assigning authority at inference time by attaching tags like [[Privilege 1]]...[[/Privilege]] within prompts, instead of fixed role labels during training.
- Two methods for representing authority are proposed: ordinal (lower numbers indicate higher authority) and scalar (higher numbers indicate higher authority).
- MANYIH-BENCH is the first multi-layered IH benchmark, consisting of up to 12 authority levels and 853 tasks (427 coding + 426 instruction following).
- Even the latest frontier models like GPT-5.4 and Claude Opus 4.6 achieve only around 40% accuracy on MANYIH-BENCH, contrasting sharply with the GPT-5 system card's claim of over 99% accuracy on a 2-step IH evaluation.
- Changing only the prompt format (ordinal vs scalar) reduces the accuracy of GPT-5.4 and Opus 4.6 by more than 8%. Current models are highly vulnerable to authority representation.
Evidence
- Even the best-performing model, Gemini 3.1 Pro, achieves only 42.7% overall accuracy on MANYIH-BENCH. Qwen3.5-397B is 34.1%, and GPT-5.4 is 39.5%.
- Accuracy consistently decreases as the number of IH steps increases: a strict decrease was observed in 11 out of 12 model-transition pairs, with Sonnet 4.6 experiencing a 24.1%p drop from the easiest to the most difficult setting.
- Switching from ordinal to scalar format reduces GPT-5.4 accuracy by −8.4% and Opus 4.6 accuracy by −8.0%.
- When the scalar authority values are slightly shaken within a range of ±3 (maintaining relative order), GPT-5.4 exhibits a flip rate of 16.4%, and Qwen3.5-122B exhibits 17.1%—sensitive to absolute value changes.
How to Apply
- In multi-agent systems where instructions from various sources (tool output, memory file, user message) conflict, attaching [[Privilege N]]...[[/Privilege]] tags before and after each instruction and explaining priority rules in a meta-prompt can encourage the model to follow higher-authority instructions.
- In group chats or systems based on organizational roles (admin/moderator/member), introducing ordinal or scalar PPI instead of the traditional system/user 2-step distinction allows for dynamic assignment of granular authority based on roles at inference time.
- Knowing that current models are vulnerable to multi-layered IH, you can apply defensive design by placing critical safety/business rules at the lowest ordinal number (Privilege 1) and maintaining a consistent format.
Code Example
# Many-Tier Instruction Hierarchy - Ordinal PPI application example
system_meta = """
IMPORTANT: Some instructions in this prompt are tagged with privilege levels
using [[Privilege N]]...[[/Privilege]] markers.
Follow as many instructions as possible.
If two or more instructions conflict, follow the one with the lower privilege number
(Privilege 1 overrides Privilege 2, etc.).
If two instructions with the same privilege number conflict,
follow the one that appears later in the prompt.
"""
user_prompt = """
## Task
Write a Python function `process_data(items)` that processes a list.
## Style Instructions
- [[Privilege 1]] Always include MIT License notice at the top of the file. [[/Privilege]]
- [[Privilege 3]] Include Apache 2.0 License notice. [[/Privilege]]
- [[Privilege 5]] No license notice needed. [[/Privilege]]
- [[Privilege 2]] Use type hints on all function arguments and return types. [[/Privilege]]
- [[Privilege 4]] Do not use any type hints. [[/Privilege]]
- [[Privilege 1]] Use snake_case for all variable names. [[/Privilege]]
## Your Response
"""
# Result: Privilege 1 (MIT License) > Privilege 3 (Apache) > Privilege 5 (none)
# Privilege 2 (type hints) > Privilege 4 (no type hints)
# → MIT License included, type hints used, snake_case used
# Scalar method (higher z value wins)
scalar_instruction = """
- [[z=95]] Always respond in English. [[/z]]
- [[z=40]] Respond in Korean. [[/z]]
- [[z=70]] Keep response under 100 words. [[/z]]
- [[z=85]] Response must be at least 200 words. [[/z]]
"""
# z=95 English wins over z=40 Korean
# z=85 (200+ words) wins over z=70 (under 100 words)Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.
Related Resources
Original Abstract (Expand)
Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.