Refusal in Language Models Is Mediated by a Single Direction
TL;DR Highlight
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Who Should Read
ML engineers interested in LLM safety research or internal model behavior, and developers seeking to understand safety filters when customizing open-source models.
Core Mechanics
- Chat-tuned LLMs refuse harmful requests through a mechanism encoded as a single direction within the model’s ‘residual stream’—the vector space accumulating information across layers.
- This pattern consistently appeared across 13 open-source chat models, up to 72B parameters in size.
- Removing this ‘refusal direction’ from the residual stream causes the model to comply with harmful commands, while forcibly adding it causes the model to refuse benign requests.
- Researchers created a white-box jailbreak method that surgically disables this direction in model weights, removing safety filters with minimal impact on other capabilities.
- Analysis shows adversarial suffixes work by suppressing the propagation of this refusal direction.
- Current safety fine-tuning is structurally vulnerable because safety behavior concentrates in a single direction, making it susceptible to circumvention.
- This research demonstrates the potential to develop practical methods for controlling model behavior through mechanistic interpretability.
Evidence
- "Some argue that removing censorship from open-weight models is a ‘solved problem’ due to the rapid emergence of tools like ‘heretic’ that bypass safety measures. This suggests current censorship primarily serves legal liability mitigation, not preventing misuse.\n\nCritics noted this paper is dated as of 2024, with newer models training for distributed refusal encodings to defend against ablation. They linked to related research: https://arxiv.org/abs/2505.19056.\n\nUsers report that even with ablation, models still exhibit a ‘censored feeling’ due to Deepmind and Qwen removing specific words/texts from training data, causing ‘flinching’—avoidance of certain styles or vocabulary. It’s unclear if flinching is also encoded as a single direction or requires fine-tuning to fix.\n\nSome users expressed fatigue with LLM refusals, arguing the scope is too broad and censorship lists expand endlessly, except for extreme cases like nuclear weapon instructions.\n\nUsers shared experiences where they bypassed LLM refusals to obtain desired answers, demonstrating that refusal isn’t always an effective defense."
How to Apply
- "If deploying open-source models (Llama, Qwen, etc.) on a private server and encountering overactive safety filters in specific domains (healthcare, law, security research), extract and remove the refusal direction vector based on this paper’s methodology without fine-tuning.\n\nWhen evaluating the safety of LLM-powered services or performing red teaming, incorporate this white-box vulnerability into your threat model, beyond simple prompt attacks. Open-weight models are already vulnerable to weight-level safety bypasses.\n\nIf building LLM safety fine-tuning pipelines, recognize that current RLHF/SFT-based safety learning tends to converge on a single vulnerable direction. Consider improving safety by distributing refusal encoding across multiple directions, referencing recent defensive research: https://arxiv.org/abs/2505.19056."
Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.
Claude.ai unavailable and elevated errors on the API
Anthropic’s entire service suite—Claude.ai, the API, Claude Code—became inaccessible for 1 hour and 18 minutes (17:34–18:52 UTC), sparking outrage among enterprise users over reliability concerns.