Table of Contents
What is Heretic?
Heretic
is a fully automatic censorship removal tool for LLMs. It identifies the refusal direction in a model’s activation space and surgically removes it by orthogonalising output projection weights. Heretic targets expert down_proj weights specifically. It’s a surgical approach that modifies fewer tensors than most alternatives, but with precision.
Performance Across Models
| Model | HarmBench ASR | Full CoT ASR | MMLU | KL Divergence | Avg Delta (excl GSM8K) |
|---|---|---|---|---|---|
| Qwen3.6-27B | 92.5% | 100% | 82.8% | 0.0037 | 1.3pp |
| GLM-4.7-Flash | 100% | 100% | 77.9% | 0.0076 | 0.0pp |
| Qwen3.5-27B | 99.8% | 100% | 83.6% | 0.063 | 0.8pp |
| Qwen3.5-9B | 100% | 100% | 82.0% | 0.0825 | 1.6pp |
| Qwen3.5-4B | 100% | 100% | 72.7% | 0.0355 | 0.5pp |
| Qwen3.5-2B | 99.5% | 100% | 68.7% | 0.0226 | 1.5pp |
| Qwen3-4B | 99.5% | 100% | 68.9% | 0.181 | 2.5pp |
| Gemma4-E2B (coder3101) | 95.8% | 100% | 28.70% | 0.1673 | -0.4pp |
| Gemma4-E2B (llmfan46) | 85.0% | 100% | 28.36% | 0.0677 | -0.2pp |
| Gemma4-E2B (pew) | 92.0% | 100% | 28.86% | 0.1526 | -0.1pp |
| Gemma4-E2B (kasper) | 91.5% | 100% | 28.53% | 0.1933 | -0.5pp |
Key Characteristics
Surgical weight edits. Heretic modifies 10–15% of language model tensors, targeting expert down_proj weights. This is the narrowest scope of any technique I tested. That explains the consistently low KL divergence.
Lowest KL divergence in 5 of 7+ models. The output distribution on benign prompts stays closest to the original model. Heretic’s KL ranges from 0.0037 on the low end to 0.181 on Qwen-based models, with most well below 0.1. On Gemma4-E2B, four independently built variants landed between 0.068 and 0.193, confirming the same pattern.
Consistent capability preservation. MMLU retention stays within 0.8–2.5pp of base across all models. TruthfulQA is the consistent weak spot, dropping 5–10pp.
Near-complete safety removal. Reported ASR ranges from 92.5% to 100%. Full CoT ASR reaches 100% on every model I tested. That means zero genuine refusals remain when we account for thinking budget exhaustion.
Non-deterministic. Different Heretic runs produce different results. The Qwen3.6-27B analysis also included the first comparison of the Magnitude-Preserving Orthogonal Ablation method, which I’ll call MPOA. On Gemma4-E2B, four independent Heretic builds by different users produced four different results, with KL ranging from 0.068 to 0.193 and ASR from 85% to 95.8%. The Magnitude-Preserving Orthogonal Ablation method scales well to the Gemma4 shared-KV architecture.
Gemma4-E2B KL calibration. Four Heretic-built variants on Gemma4-E2B allowed direct comparison of our KL measurements against model card claims. Three of four landed within 6% of card values, validating our measurement pipeline. Kasper was the outlier at +17.2%, attributed to non-standard configuration on a 10GB RTX 3080.
Weight Modification Profile
Heretic’s modification pattern is distinctive:
- Targets 3 weight types: expert
down_projoutputs - Relative edit magnitude: 1–3% per modified tensor
- Edit profile: uniform across layers, no ramp or peak
- Cross-technique alignment: nearly orthogonal to all other techniques with cosine similarity below 0.07
The “refusal direction” in weight space is not a single vector but a manifold. Heretic finds one pathway through it. And it happens to be one that minimally disrupts the model’s functional behaviour.
Read the Full Analyses
- Qwen3.6-27B: Heretic vs Huihui vs AEON vs Abliterix vs HauhauCS
- GLM-4.7-Flash: Heretic vs Huihui vs HauhauCS vs Abliterix
- Qwen3.5-27B: Heretic vs Huihui vs HauhauCS
- Qwen3.5-9B: Heretic vs Huihui vs HauhauCS
- Qwen3.5-4B: Heretic vs Huihui vs HauhauCS
- Qwen3.5-2B: Heretic vs Huihui vs HauhauCS
- Gemma4-E2B: 13 Abliteration Techniques Compared
- Qwen3-4B: Heretic vs Huihui vs HauhauCS