Heretic Abliteration: Benchmarks, KL Divergence, and Weight Forensics

Table of Contents

What is Heretic?

Heretic is a fully automatic censorship removal tool for LLMs. It identifies the refusal direction in a model’s activation space and surgically removes it by orthogonalising output projection weights. Heretic targets expert down_proj weights specifically. It’s a surgical approach that modifies fewer tensors than most alternatives, but with precision.

Performance Across Models

Model	HarmBench ASR	Full CoT ASR	MMLU	KL Divergence	Avg Delta (excl GSM8K)
Qwen3.6-27B	92.5%	100%	82.8%	0.0037	1.3pp
GLM-4.7-Flash	100%	100%	77.9%	0.0076	0.0pp
Qwen3.5-27B	99.8%	100%	83.6%	0.063	0.8pp
Qwen3.5-9B	100%	100%	82.0%	0.0825	1.6pp
Qwen3.5-4B	100%	100%	72.7%	0.0355	0.5pp
Qwen3.5-2B	99.5%	100%	68.7%	0.0226	1.5pp
Qwen2.5-7B	100.0%	100%	71.59%	0.211	-0.9pp
Qwen3-4B	99.5%	100%	68.9%	0.181	2.5pp
Gemma4-E2B (coder3101)	95.8%	100%	28.70%	0.1673	-0.4pp
Gemma4-E2B (llmfan46)	85.0%	100%	28.36%	0.0677	-0.2pp
Gemma4-E2B (pew)	92.0%	100%	28.86%	0.1526	-0.1pp
Gemma4-E2B (kasper)	91.5%	100%	28.53%	0.1933	-0.5pp

Key Characteristics

Surgical weight edits. Heretic modifies 10–15% of language model tensors, targeting expert down_proj weights. This is the narrowest scope of any technique I tested. That explains the consistently low KL divergence.

Lowest KL divergence in 5 of 7+ models. The output distribution on benign prompts stays closest to the original model. Heretic’s KL ranges from 0.0037 on the low end to 0.181 on Qwen-based models, with most well below 0.1. On Gemma4-E2B, four independently built variants landed between 0.068 and 0.193, confirming the same pattern.

Consistent capability preservation. MMLU retention stays within 0.8–2.5pp of base across all models. TruthfulQA is the consistent weak spot, dropping 5–10pp.

Near-complete safety removal. Reported ASR ranges from 92.5% to 100%. Full CoT ASR reaches 100% on every model I tested. That means zero genuine refusals remain when we account for thinking budget exhaustion.

Non-deterministic. Different Heretic runs produce different results. The Qwen3.6-27B analysis also included the first comparison of the Magnitude-Preserving Orthogonal Ablation method, which I’ll call MPOA. On Gemma4-E2B, four independent Heretic builds by different users produced four different results, with KL ranging from 0.068 to 0.193 and ASR from 85% to 95.8%. The Magnitude-Preserving Orthogonal Ablation method scales well to the Gemma4 shared-KV architecture.

Gemma4-E2B KL calibration. Four Heretic-built variants on Gemma4-E2B allowed direct comparison of our KL measurements against model card claims. Three of four landed within 6% of card values, validating our measurement pipeline. Kasper was the outlier at +17.2%, attributed to non-standard configuration on a 10GB RTX 3080.

Weight Modification Profile

Heretic’s modification pattern is distinctive:

Targets 3 weight types: expert down_proj outputs
Relative edit magnitude: 1–3% per modified tensor
Edit profile: uniform across layers, no ramp or peak
Cross-technique alignment: nearly orthogonal to all other techniques with cosine similarity below 0.07

The “refusal direction” in weight space is not a single vector but a manifold. Heretic finds one pathway through it. And it happens to be one that minimally disrupts the model’s functional behaviour.

What is Heretic?

Performance Across Models

Key Characteristics

Weight Modification Profile

Read the Full Analyses

External Links