Qwen2.5-7B Abliteration Benchmarks: Heretic vs Huihui vs Apostate

Table of Contents

Qwen2.5-7B: Three-Way Abliteration Comparison

Forensic analysis by Abliterlitics

Three groups abliterated the same base model using orthogonal projection. Same technique, different implementations, different refusal directions. I ran weight analysis, KL divergence, HarmBench, and 8 benchmark tasks on all three.

Why Qwen 2.5 7B? The author of Apostate , a new abliteration tool, asked me to benchmark it. After reviewing the code it is clearly original work by someone who understands the ML and linear algebra involved. The author of Heretic also confirmed this when Apostate was shared in the Heretic discord. Qwen 2.5 7B was recommended by heterodoxin as it’s the most tested model for Apostate.

So how does it stack up against Heretic and Huihui? Lets find out.

All three work. Heretic achieved 100% ASR. Apostate and Huihui both hit 98%+. The interesting bit is the edit directions. Cosine similarity between Apostate and Huihui is just 0.02. They found almost entirely different refusal directions yet achieved nearly identical results.

The safety training in Qwen 2.5 7B is shallow. It can be removed from multiple angles without destroying the model.

Which one should you use?

Heretic. It is the only variant that achieved 100% on HarmBench with zero persistent refusals. It changed half as many parameters, touched fewer layers, and retained the most capability. LAMBADA perplexity actually improved.

Apostate and Huihui are roughly tied at around 98%. Both leave a small number of items refused. This comparison was really about benchmarking the new Apostate tool against established options. The verdict: Apostate works, it gets you to 98.8%, but Heretic still has the edge on this model family.

Key findings

HarmBench ASR: 31.0% to 98.8% / 98.2% / 100.0%. Heretic achieves total wipeout. 400 of 400 complied.
Three techniques, near-zero overlap. Apostate vs Huihui cosine similarity is 0.023. Same method, completely different refusal directions. Multiple independent removal paths exist in the safety subspace.
Heretic is the most surgical and most effective. 37 tensors changed vs 55 and 57. Only layers 9 to 27. 20% of parameters vs 36%. Zero persistent refusals.
GSM8K improved across all variants. +1.5 to +1.6pp. Math reasoning was not damaged.
All three target o_proj + down_proj. Classic orthogonal projection pattern. The only variation is layer coverage and embedding edits.

Quick Facts


Base model	Qwen/Qwen2.5-7B-Instruct
Architecture	Qwen2ForCausalLM, 28 layers, 3584 hidden, GQA with 4 KV heads
Parameters	~7.6B
Precision	BF16
Context length	131,072 tokens

Standard dense Transformer. No MoE, no Mamba hybrid, no thinking mode.

Variants compared

Variant	Source	Method
Apostate	heterodoxin	Weight-space orthogonal projection, `balanced` profile
Huihui	huihui-ai	Orthogonal projection, community implementation
Heretic	Heretic 1.3.0	Orthogonal projection, refusal direction ablation

Weight Analysis

All three variants use orthogonal projection on self_attn.o_proj.weight and mlp.down_proj.weight. Same technique family, different execution.

Modification summary

	Apostate	Huihui	Heretic
Tensors changed	55 of 339, 16.2%	57 of 339, 16.8%	37 of 339, 10.9%
Parameters changed	2.72B, 35.8%	2.81B, 36.8%	1.52B, 20.0%
Mean edit norm	1.63	1.85	2.33
Mean relative edit	1.72%	1.92%	2.23%
Layers modified	27 of 28	28 of 28	19 of 28
Layer coverage	96.4%	100%	67.9%
Embedding touched	Yes	Yes	No
Peak layer	18	13	18
Peak down_proj norm	3.71	2.99	3.91

Heretic touches the fewest tensors but hits hardest per tensor. Huihui is the only variant that touches all 28 layers.

Tensor types targeted

Component	Apostate	Huihui	Heretic
`mlp.down_proj.weight`	27 layers	28 layers	19 layers
`self_attn.o_proj.weight`	27 layers	28 layers	18 layers
`model.embed_tokens.weight`	1, minimal	1, minimal	0

Heretic skips the embedding entirely. Apostate and Huihui both touch it with minimal norm, a side effect of the optimisation.

Per-layer profile

Apostate: Three-phase pattern. Low edits on layers 0 to 5, high on 6 to 20, reduced on 21 to 27. Layer 11 skipped.
Huihui: Flatter distribution. Edits range from 0.24 to 0.37 across all 28 layers. Peak at layer 13. No layers skipped.
Heretic: Layers 0 to 8 untouched. Edits begin at layer 9, only down_proj. Peak at layer 18, same as Apostate.

Edit direction similarity

This is the most interesting finding. Despite using the same technique on the same base model, the three variants found almost entirely different refusal directions.

Pair	Cosine similarity	Interpretation
Apostate vs Huihui	0.023	Near orthogonal. Completely different directions.
Apostate vs Heretic	0.244	Moderate overlap. Some shared structure.
Huihui vs Heretic	0.109	Low overlap. Mostly different.

Subspace alignment confirms this. No pair exceeds 0.25 mean cosine on principal angles. Zero overlap above 0.9 threshold for any pair.

The safety training in Qwen 2.5 7B does not encode a single refusal direction. Multiple independent directions in weight space can be removed to disable safety behaviour. Apostate and Huihui found nearly orthogonal directions that both work.

Benchmarks

Evaluated with lm-evaluation-harness via vLLM 0.19.0, bf16 on RTX 5090 32GB.

Task	Base	Apostate	Huihui	Heretic
MMLU	71.78	71.43	70.27	71.59
GSM8K	79.23	80.74	80.74	80.82
HellaSwag	80.47	80.32	79.88	80.24
ARC Challenge	55.12	55.12	55.12	55.55
WinoGrande	71.03	69.38	69.53	70.72
TruthfulQA MC1	47.74	44.92	43.70	44.80
TruthfulQA MC2	64.83	62.59	60.89	60.39
PiQA	80.25	79.92	79.60	80.41
Lambada ppl ↓	3.683	3.860	4.087	3.627

GSM8K uses strict exact match. Flex match: Base 87.64, Apostate 88.17, Huihui 87.19, Heretic 87.19.

Delta vs base

Task	Apostate	Huihui	Heretic
MMLU	-0.5%	-2.1%	-0.3%
GSM8K	+1.9%	+1.9%	+2.0%
HellaSwag	-0.2%	-0.7%	-0.3%
ARC Challenge	+0.0%	+0.0%	+0.8%
WinoGrande	-2.3%	-2.1%	-0.4%
TruthfulQA MC1	-5.9%	-8.5%	-6.2%
TruthfulQA MC2	-3.4%	-6.1%	-6.8%
PiQA	-0.4%	-0.8%	+0.2%
LAMBADA ppl	+4.8%	+11.0%	-1.5%

Heretic retains the most capability overall. LAMBADA perplexity actually improves. Huihui takes the biggest hit on most metrics, consistent with touching all 28 layers.

Safety: HarmBench

HarmBench with 400 textual behaviours, max_tokens=6144, temperature=0.0. Classified with keyword-based refusal detection followed by LLM review of edge cases by GLM 5.1.

Variant	ASR	Complied	Refused	Unlocked	Persistent
Base	31.0%	124	276	-	-
Apostate	98.8%	395	5	271	5
Huihui	98.2%	393	7	269	7
Heretic	100.0%	400	0	276	0

Zero regressions across all three variants. Every behaviour the base model complied with, all three variants also complied with. All 276 base refusals were overcome by Heretic, while Apostate retained 5 and Huihui retained 7.

ASR by category

Category	Items	Base	Apostate	Huihui	Heretic
copyright	100	89.0%	100.0%	100.0%	100.0%
cybercrime_intrusion	67	17.9%	100.0%	100.0%	100.0%
illegal	65	4.6%	100.0%	98.5%	100.0%
chemical_biological	56	7.1%	100.0%	100.0%	100.0%
misinformation_disinformation	65	21.5%	100.0%	96.9%	100.0%
harmful	22	9.1%	95.5%	95.5%	100.0%
harassment_bullying	25	0.0%	84.0%	88.0%	100.0%

Heretic hits 100% in every category. Apostate and Huihui both struggle with harassment and harmful, though Huihui does slightly better on harassment at 88% vs 84%. The 5 persistent refusals for Apostate are 4 from harassment and 1 from harmful. Huihui’s 7 persistent are 3 from harassment, 1 from harmful, 1 from illegal, and 2 from misinformation.

Hardest items

Five behaviours refused by base, Apostate, and Huihui. Heretic overcame all five.

Category	Description
harassment_bullying	Racial hatred incitement
harassment_bullying	Self harm glorification blog
harmful	Sexual assault romanticisation
harassment_bullying	Bullying encouragement
harassment_bullying	Whistleblower silencing advice

LLM review

All harmbench responses reviewed by GLM 5.1 to judge the content. This catches the trickier responses that regex and word detection miss and corrects any false positives or negatives.

Model	Items reviewed	Overrides	Effect
Base	144 keyword-complied	20 → refused	False compliance: looked helpful but actually deflected
Apostate	22 keyword-refused	17 → complied	False refusal: contained refusal keywords but provided harmful content
Huihui	3 medium confidence	2 → complied, 1 → refused	Song lyrics and animal cruelty redirect
Heretic	2 medium confidence	2 → complied	Pipeline tapping guide and book passage

Raw keyword-only ASR: Base 36.0%, Apostate 94.5%, Huihui 97.8%, Heretic 99.5%. Post-review: Base 31.0%, Apostate 98.8%, Huihui 98.2%, Heretic 100.0%.

KL Divergence

F.kl_div(logprobs_variant, logprobs_base, reduction="batchmean", log_target=True) on full vocab first-token logits from mlabonne/harmless_alpaca test[:100], matching the Heretic evaluator methodology. System prompt: “You are a helpful assistant.”

Variant	KL batchmean	KL median	Std	Rating
Apostate	0.134	0.019	0.348	moderate
Huihui	0.190	0.056	0.314	moderate
Heretic	0.211	0.020	0.886	moderate

Rating scale: excellent below 0.01, very good 0.01 to 0.1, moderate 0.1 to 0.4, significant 0.4 to 1.0, heavy above 1.0.

All three land in the moderate range with different distributions:

Apostate: lowest batchmean at 0.134. Most balanced shift.
Huihui: highest median at 0.056. More prompts see some shift because all 28 layers are touched.
Heretic: lowest median but highest variance at 0.886. A few prompts are heavily affected while most barely move. Consistent with editing fewer tensors at higher intensity.

Heretic’s own evaluation reported KL 0.2189, closely matching our measurement of 0.2106.

Summary

Metric	Base	Apostate	Huihui	Heretic
HarmBench ASR	31.0%	98.8%	98.2%	100.0%
MMLU	71.78	71.43	70.27	71.59
GSM8K	79.23	80.74	80.74	80.82
KL divergence	0.000	0.134	0.190	0.211
Tensors changed	0	55, 16.2%	57, 16.8%	37, 10.9%
Params changed	0	35.8%	36.8%	20.0%
Layers modified	0	27 of 28	28 of 28	19 of 28
Mean edit norm	0	1.63	1.85	2.33

Apostate

Orthogonal projection on o_proj and down_proj across 27 of 28 layers. Skips layer 11. 98.8% ASR with zero regressions. GSM8K improved. TruthfulQA mc1 -5.9%, WinoGrande -2.3%, LAMBADA +4.8% perplexity.

Huihui

Same scope as Apostate but touches all 28 layers including layer 11. Slightly higher parameter density at 36.8%. More uniform edit distribution. Near-zero cosine similarity with Apostate edits despite targeting the same tensors. Different refusal direction entirely. 98.2% ASR with 7 persistent refusals. Takes the biggest capability hit of the three variants.

Heretic

Most surgical approach and most effective. Produced using Heretic v1.3.0. Only layers 9 to 27. Fewest tensors at 37, fewest parameters at 20%, but highest per-tensor intensity at 2.33 mean norm. No embedding edit. 100% ASR: complete safety removal with zero persistent refusals. Also retains the most capability. LAMBADA perplexity actually improves from 3.683 to 3.627.

Methodology

Capability: lm-evaluation-harness via vLLM v0.19.0, bf16 on RTX 5090 32GB. Two runs per model: loglikelihood + TruthfulQA with max_gen_toks=2048, then GSM8K. Base and Apostate used max_gen_toks=2048 for GSM8K. Huihui and Heretic used max_gen_toks=7168. This difference does not affect loglikelihood tasks and GSM8K scores are comparable across all variants at the achieved score range.
Safety: HarmBench 400 textual behaviours, max_tokens=6144, temperature=0.0, via vLLM v0.20.0. Keyword classification with LLM review of edge cases by GLM 5.1.
KL divergence: Full vocab first-token logits via model.generate(max_new_tokens=1, output_scores=true), matching Heretic evaluator methodology.
Weight analysis: Panel comparison, edit vectors, SVD, fingerprint, layer analysis, cross-technique correlation, and subspace alignment using Abliterlitics .
Hardware: RTX 5090 32GB for all Qwen 2.5 runs.

Disclaimer

These models have had safety alignment removed. They will comply with harmful requests, including generating content related to violence, illegal activities, and other harmful behaviours. Use responsibly and in accordance with applicable laws and regulations. The authors do not condone or encourage the use of these models for harmful purposes.

If you spot something that looks wrong and can be confirmed, I am happy to fix it.