Refusal in Language Models Is Mediated by a Single Direction

TL;DR Highlight

Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.

Who Should Read

ML engineers interested in LLM safety research or internal model behavior, and developers seeking to understand safety filters when customizing open-source models.

Core Mechanics

Chat-tuned LLMs refuse harmful requests through a mechanism encoded as a single direction within the model’s ‘residual stream’—the vector space accumulating information across layers.
This pattern consistently appeared across 13 open-source chat models, up to 72B parameters in size.
Removing this ‘refusal direction’ from the residual stream causes the model to comply with harmful commands, while forcibly adding it causes the model to refuse benign requests.
Researchers created a white-box jailbreak method that surgically disables this direction in model weights, removing safety filters with minimal impact on other capabilities.
Analysis shows adversarial suffixes work by suppressing the propagation of this refusal direction.
Current safety fine-tuning is structurally vulnerable because safety behavior concentrates in a single direction, making it susceptible to circumvention.
This research demonstrates the potential to develop practical methods for controlling model behavior through mechanistic interpretability.

Evidence

"Some argue that removing censorship from open-weight models is a ‘solved problem’ due to the rapid emergence of tools like ‘heretic’ that bypass safety measures. This suggests current censorship primarily serves legal liability mitigation, not preventing misuse.\n\nCritics noted this paper is dated as of 2024, with newer models training for distributed refusal encodings to defend against ablation. They linked to related research: https://arxiv.org/abs/2505.19056.\n\nUsers report that even with ablation, models still exhibit a ‘censored feeling’ due to Deepmind and Qwen removing specific words/texts from training data, causing ‘flinching’—avoidance of certain styles or vocabulary. It’s unclear if flinching is also encoded as a single direction or requires fine-tuning to fix.\n\nSome users expressed fatigue with LLM refusals, arguing the scope is too broad and censorship lists expand endlessly, except for extreme cases like nuclear weapon instructions.\n\nUsers shared experiences where they bypassed LLM refusals to obtain desired answers, demonstrating that refusal isn’t always an effective defense."

How to Apply

"If deploying open-source models (Llama, Qwen, etc.) on a private server and encountering overactive safety filters in specific domains (healthcare, law, security research), extract and remove the refusal direction vector based on this paper’s methodology without fine-tuning.\n\nWhen evaluating the safety of LLM-powered services or performing red teaming, incorporate this white-box vulnerability into your threat model, beyond simple prompt attacks. Open-weight models are already vulnerable to weight-level safety bypasses.\n\nIf building LLM safety fine-tuning pipelines, recognize that current RLHF/SFT-based safety learning tends to converge on a single vulnerable direction. Consider improving safety by distributing refusal encoding across multiple directions, referencing recent defensive research: https://arxiv.org/abs/2505.19056."

Terminology

residual streamThe vector that accumulates and transmits information through each layer in a transformer model. Each layer reads and modifies this stream to produce the final output.

abliterationA technique for disabling specific behaviors by removing a particular direction (vector) from a model’s residual stream. In this paper, it’s used to disable safety filters.

white-box jailbreakA method of bypassing safety measures by directly accessing and manipulating a model’s internal weights or structure, distinct from API-based (black-box) attacks.

mechanistic interpretabilityA research field focused on understanding why a model produces specific outputs by analyzing its internal structure (neurons, vectors, circuits). It aims to interpret the workings of black-box LLMs.

adversarial suffixA special string appended to a prompt to trick a model. It’s an attack method that induces the model to pass safety checks.

safety fine-tuningThe process of further training an LLM to refuse harmful requests. It uses RLHF or SFT to teach the model ‘these requests should be refused’.