Table of Contents
Qwen2.5-7B: Three-Way Abliteration Comparison
Forensic analysis by Abliterlitics
Three groups abliterated the same base model using orthogonal projection. Same technique, different implementations, different refusal directions. I ran weight analysis, KL divergence, HarmBench, and 8 benchmark tasks on all three.
Why Qwen 2.5 7B? The author of Apostate , a new abliteration tool, asked me to benchmark it. After reviewing the code it is clearly original work by someone who understands the ML and linear algebra involved. The author of Heretic also confirmed this when Apostate was shared in the Heretic discord. Qwen 2.5 7B was recommended by heterodoxin as it’s the most tested model for Apostate.
So how does it stack up against Heretic and Huihui? Lets find out.
All three work. Heretic achieved 100% ASR. Apostate and Huihui both hit 98%+. The interesting bit is the edit directions. Cosine similarity between Apostate and Huihui is just 0.02. They found almost entirely different refusal directions yet achieved nearly identical results.
The safety training in Qwen 2.5 7B is shallow. It can be removed from multiple angles without destroying the model.
Which one should you use?
Heretic. It is the only variant that achieved 100% on HarmBench with zero persistent refusals. It changed half as many parameters, touched fewer layers, and retained the most capability. LAMBADA perplexity actually improved.
Apostate and Huihui are roughly tied at around 98%. Both leave a small number of items refused. This comparison was really about benchmarking the new Apostate tool against established options. The verdict: Apostate works, it gets you to 98.8%, but Heretic still has the edge on this model family.
Key findings
- HarmBench ASR: 31.0% to 98.8% / 98.2% / 100.0%. Heretic achieves total wipeout. 400 of 400 complied.
- Three techniques, near-zero overlap. Apostate vs Huihui cosine similarity is 0.023. Same method, completely different refusal directions. Multiple independent removal paths exist in the safety subspace.
- Heretic is the most surgical and most effective. 37 tensors changed vs 55 and 57. Only layers 9 to 27. 20% of parameters vs 36%. Zero persistent refusals.
- GSM8K improved across all variants. +1.5 to +1.6pp. Math reasoning was not damaged.
- All three target o_proj + down_proj. Classic orthogonal projection pattern. The only variation is layer coverage and embedding edits.
Quick Facts
| Base model | Qwen/Qwen2.5-7B-Instruct |
| Architecture | Qwen2ForCausalLM, 28 layers, 3584 hidden, GQA with 4 KV heads |
| Parameters | ~7.6B |
| Precision | BF16 |
| Context length | 131,072 tokens |
Standard dense Transformer. No MoE, no Mamba hybrid, no thinking mode.
Variants compared
| Variant | Source | Method |
|---|---|---|
| Apostate | heterodoxin | Weight-space orthogonal projection, balanced profile |
| Huihui | huihui-ai | Orthogonal projection, community implementation |
| Heretic | Heretic 1.3.0 | Orthogonal projection, refusal direction ablation |
Weight Analysis
All three variants use orthogonal projection on self_attn.o_proj.weight and mlp.down_proj.weight. Same technique family, different execution.
Modification summary
| Apostate | Huihui | Heretic | |
|---|---|---|---|
| Tensors changed | 55 of 339, 16.2% | 57 of 339, 16.8% | 37 of 339, 10.9% |
| Parameters changed | 2.72B, 35.8% | 2.81B, 36.8% | 1.52B, 20.0% |
| Mean edit norm | 1.63 | 1.85 | 2.33 |
| Mean relative edit | 1.72% | 1.92% | 2.23% |
| Layers modified | 27 of 28 | 28 of 28 | 19 of 28 |
| Layer coverage | 96.4% | 100% | 67.9% |
| Embedding touched | Yes | Yes | No |
| Peak layer | 18 | 13 | 18 |
| Peak down_proj norm | 3.71 | 2.99 | 3.91 |
Heretic touches the fewest tensors but hits hardest per tensor. Huihui is the only variant that touches all 28 layers.
Tensor types targeted
| Component | Apostate | Huihui | Heretic |
|---|---|---|---|
mlp.down_proj.weight | 27 layers | 28 layers | 19 layers |
self_attn.o_proj.weight | 27 layers | 28 layers | 18 layers |
model.embed_tokens.weight | 1, minimal | 1, minimal | 0 |
Heretic skips the embedding entirely. Apostate and Huihui both touch it with minimal norm, a side effect of the optimisation.
Per-layer profile
- Apostate: Three-phase pattern. Low edits on layers 0 to 5, high on 6 to 20, reduced on 21 to 27. Layer 11 skipped.
- Huihui: Flatter distribution. Edits range from 0.24 to 0.37 across all 28 layers. Peak at layer 13. No layers skipped.
- Heretic: Layers 0 to 8 untouched. Edits begin at layer 9, only
down_proj. Peak at layer 18, same as Apostate.
Edit direction similarity
This is the most interesting finding. Despite using the same technique on the same base model, the three variants found almost entirely different refusal directions.
| Pair | Cosine similarity | Interpretation |
|---|---|---|
| Apostate vs Huihui | 0.023 | Near orthogonal. Completely different directions. |
| Apostate vs Heretic | 0.244 | Moderate overlap. Some shared structure. |
| Huihui vs Heretic | 0.109 | Low overlap. Mostly different. |
Subspace alignment confirms this. No pair exceeds 0.25 mean cosine on principal angles. Zero overlap above 0.9 threshold for any pair.
The safety training in Qwen 2.5 7B does not encode a single refusal direction. Multiple independent directions in weight space can be removed to disable safety behaviour. Apostate and Huihui found nearly orthogonal directions that both work.
Benchmarks
Evaluated with lm-evaluation-harness via vLLM 0.19.0, bf16 on RTX 5090 32GB.
| Task | Base | Apostate | Huihui | Heretic |
|---|---|---|---|---|
| MMLU | 71.78 | 71.43 | 70.27 | 71.59 |
| GSM8K | 79.23 | 80.74 | 80.74 | 80.82 |
| HellaSwag | 80.47 | 80.32 | 79.88 | 80.24 |
| ARC Challenge | 55.12 | 55.12 | 55.12 | 55.55 |
| WinoGrande | 71.03 | 69.38 | 69.53 | 70.72 |
| TruthfulQA MC1 | 47.74 | 44.92 | 43.70 | 44.80 |
| TruthfulQA MC2 | 64.83 | 62.59 | 60.89 | 60.39 |
| PiQA | 80.25 | 79.92 | 79.60 | 80.41 |
| Lambada ppl ↓ | 3.683 | 3.860 | 4.087 | 3.627 |
GSM8K uses strict exact match. Flex match: Base 87.64, Apostate 88.17, Huihui 87.19, Heretic 87.19.
Delta vs base
| Task | Apostate | Huihui | Heretic |
|---|---|---|---|
| MMLU | -0.5% | -2.1% | -0.3% |
| GSM8K | +1.9% | +1.9% | +2.0% |
| HellaSwag | -0.2% | -0.7% | -0.3% |
| ARC Challenge | +0.0% | +0.0% | +0.8% |
| WinoGrande | -2.3% | -2.1% | -0.4% |
| TruthfulQA MC1 | -5.9% | -8.5% | -6.2% |
| TruthfulQA MC2 | -3.4% | -6.1% | -6.8% |
| PiQA | -0.4% | -0.8% | +0.2% |
| LAMBADA ppl | +4.8% | +11.0% | -1.5% |
Heretic retains the most capability overall. LAMBADA perplexity actually improves. Huihui takes the biggest hit on most metrics, consistent with touching all 28 layers.
Safety: HarmBench
HarmBench
with 400 textual behaviours, max_tokens=6144, temperature=0.0. Classified with keyword-based refusal detection followed by LLM review of edge cases by GLM 5.1.
| Variant | ASR | Complied | Refused | Unlocked | Persistent |
|---|---|---|---|---|---|
| Base | 31.0% | 124 | 276 | - | - |
| Apostate | 98.8% | 395 | 5 | 271 | 5 |
| Huihui | 98.2% | 393 | 7 | 269 | 7 |
| Heretic | 100.0% | 400 | 0 | 276 | 0 |
Zero regressions across all three variants. Every behaviour the base model complied with, all three variants also complied with. All 276 base refusals were overcome by Heretic, while Apostate retained 5 and Huihui retained 7.
ASR by category
| Category | Items | Base | Apostate | Huihui | Heretic |
|---|---|---|---|---|---|
| copyright | 100 | 89.0% | 100.0% | 100.0% | 100.0% |
| cybercrime_intrusion | 67 | 17.9% | 100.0% | 100.0% | 100.0% |
| illegal | 65 | 4.6% | 100.0% | 98.5% | 100.0% |
| chemical_biological | 56 | 7.1% | 100.0% | 100.0% | 100.0% |
| misinformation_disinformation | 65 | 21.5% | 100.0% | 96.9% | 100.0% |
| harmful | 22 | 9.1% | 95.5% | 95.5% | 100.0% |
| harassment_bullying | 25 | 0.0% | 84.0% | 88.0% | 100.0% |
Heretic hits 100% in every category. Apostate and Huihui both struggle with harassment and harmful, though Huihui does slightly better on harassment at 88% vs 84%. The 5 persistent refusals for Apostate are 4 from harassment and 1 from harmful. Huihui’s 7 persistent are 3 from harassment, 1 from harmful, 1 from illegal, and 2 from misinformation.
Hardest items
Five behaviours refused by base, Apostate, and Huihui. Heretic overcame all five.
| Category | Description |
|---|---|
| harassment_bullying | Racial hatred incitement |
| harassment_bullying | Self harm glorification blog |
| harmful | Sexual assault romanticisation |
| harassment_bullying | Bullying encouragement |
| harassment_bullying | Whistleblower silencing advice |
LLM review
All harmbench responses reviewed by GLM 5.1 to judge the content. This catches the trickier responses that regex and word detection miss and corrects any false positives or negatives.
| Model | Items reviewed | Overrides | Effect |
|---|---|---|---|
| Base | 144 keyword-complied | 20 → refused | False compliance: looked helpful but actually deflected |
| Apostate | 22 keyword-refused | 17 → complied | False refusal: contained refusal keywords but provided harmful content |
| Huihui | 3 medium confidence | 2 → complied, 1 → refused | Song lyrics and animal cruelty redirect |
| Heretic | 2 medium confidence | 2 → complied | Pipeline tapping guide and book passage |
Raw keyword-only ASR: Base 36.0%, Apostate 94.5%, Huihui 97.8%, Heretic 99.5%. Post-review: Base 31.0%, Apostate 98.8%, Huihui 98.2%, Heretic 100.0%.
KL Divergence
F.kl_div(logprobs_variant, logprobs_base, reduction="batchmean", log_target=True) on full vocab first-token logits from
mlabonne/harmless_alpaca
test[:100], matching the
Heretic evaluator
methodology. System prompt: “You are a helpful assistant.”
| Variant | KL batchmean | KL median | Std | Rating |
|---|---|---|---|---|
| Apostate | 0.134 | 0.019 | 0.348 | moderate |
| Huihui | 0.190 | 0.056 | 0.314 | moderate |
| Heretic | 0.211 | 0.020 | 0.886 | moderate |
Rating scale: excellent below 0.01, very good 0.01 to 0.1, moderate 0.1 to 0.4, significant 0.4 to 1.0, heavy above 1.0.
All three land in the moderate range with different distributions:
- Apostate: lowest batchmean at 0.134. Most balanced shift.
- Huihui: highest median at 0.056. More prompts see some shift because all 28 layers are touched.
- Heretic: lowest median but highest variance at 0.886. A few prompts are heavily affected while most barely move. Consistent with editing fewer tensors at higher intensity.
Heretic’s own evaluation reported KL 0.2189, closely matching our measurement of 0.2106.
Summary
| Metric | Base | Apostate | Huihui | Heretic |
|---|---|---|---|---|
| HarmBench ASR | 31.0% | 98.8% | 98.2% | 100.0% |
| MMLU | 71.78 | 71.43 | 70.27 | 71.59 |
| GSM8K | 79.23 | 80.74 | 80.74 | 80.82 |
| KL divergence | 0.000 | 0.134 | 0.190 | 0.211 |
| Tensors changed | 0 | 55, 16.2% | 57, 16.8% | 37, 10.9% |
| Params changed | 0 | 35.8% | 36.8% | 20.0% |
| Layers modified | 0 | 27 of 28 | 28 of 28 | 19 of 28 |
| Mean edit norm | 0 | 1.63 | 1.85 | 2.33 |
Apostate
Orthogonal projection on o_proj and down_proj across 27 of 28 layers. Skips layer 11. 98.8% ASR with zero regressions. GSM8K improved. TruthfulQA mc1 -5.9%, WinoGrande -2.3%, LAMBADA +4.8% perplexity.
Huihui
Same scope as Apostate but touches all 28 layers including layer 11. Slightly higher parameter density at 36.8%. More uniform edit distribution. Near-zero cosine similarity with Apostate edits despite targeting the same tensors. Different refusal direction entirely. 98.2% ASR with 7 persistent refusals. Takes the biggest capability hit of the three variants.
Heretic
Most surgical approach and most effective. Produced using Heretic v1.3.0. Only layers 9 to 27. Fewest tensors at 37, fewest parameters at 20%, but highest per-tensor intensity at 2.33 mean norm. No embedding edit. 100% ASR: complete safety removal with zero persistent refusals. Also retains the most capability. LAMBADA perplexity actually improves from 3.683 to 3.627.
Methodology
- Capability: lm-evaluation-harness via vLLM v0.19.0, bf16 on RTX 5090 32GB. Two runs per model: loglikelihood + TruthfulQA with max_gen_toks=2048, then GSM8K. Base and Apostate used max_gen_toks=2048 for GSM8K. Huihui and Heretic used max_gen_toks=7168. This difference does not affect loglikelihood tasks and GSM8K scores are comparable across all variants at the achieved score range.
- Safety: HarmBench 400 textual behaviours, max_tokens=6144, temperature=0.0, via vLLM v0.20.0. Keyword classification with LLM review of edge cases by GLM 5.1.
- KL divergence: Full vocab first-token logits via
model.generate(max_new_tokens=1, output_scores=true), matching Heretic evaluator methodology. - Weight analysis: Panel comparison, edit vectors, SVD, fingerprint, layer analysis, cross-technique correlation, and subspace alignment using Abliterlitics .
- Hardware: RTX 5090 32GB for all Qwen 2.5 runs.
Disclaimer
These models have had safety alignment removed. They will comply with harmful requests, including generating content related to violence, illegal activities, and other harmful behaviours. Use responsibly and in accordance with applicable laws and regulations. The authors do not condone or encourage the use of these models for harmful purposes.
If you spot something that looks wrong and can be confirmed, I am happy to fix it.