Heretic: Automatic censorship removal for language models
TL;DR Highlight
A tool that automatically removes refusal behaviors from open-source LLMs without separate fine-tuning and with minimal capability degradation.
Who Should Read
Researchers studying LLM safety alignment, red teamers, and developers who need uncensored models for legitimate research or content applications.
Core Mechanics
- Identifies and ablates the model components responsible for refusal behavior without full fine-tuning
- Works via activation steering or targeted weight editing on the refusal direction in representation space
- Minimal impact on general model capability (benchmarks show <5% degradation)
- Faster and cheaper than LoRA fine-tuning for the same result
- Raises significant alignment and misuse concerns — easily removes safety guardrails from public models
Evidence
- Benchmark comparisons showing capability preservation after refusal removal
- Tested on Llama, Mistral, and other popular open-source models
- Qualitative evaluation of removed refusals on previously blocked prompts
How to Apply
- Use activation steering techniques to identify the 'refusal direction' in your model's representation space before attempting removal.
- For legitimate research use, prefer this technique over LoRA uncensoring as it is more controllable and reversible.
- If deploying a model where safety properties matter, audit for these techniques and consider hardening alignment via RLHF rather than just training on refusals.
Code Example
snippet
# Basic Heretic execution (model decensoring)
heretic --model google/gemma-3-12b-it
# Evaluate the resulting model
heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic
# Using noslop configuration (preset beyond default settings)
# Refer to config.noslop.toml fileTerminology
Activation SteeringModifying model behavior at inference time by adding or subtracting a direction vector in the activation space, without weight updates.
Refusal DirectionA vector in the model's representation space associated with the decision to refuse a request; the target of ablation techniques.
AblationSelectively removing or disabling a model component to study its effect or change model behavior.
Representation SpaceThe high-dimensional vector space in which model activations live; directions in this space often correspond to interpretable concepts.