Abliteration Techniques Compared

How six LLM abliteration techniques work, from Heretic's surgical rank-1 edits to AEON's LEACE concept erasure, Huihui's gentle uniform approach, Apostate's balanced orthogonal projection, Abliterix's Optuna-driven LoRA search, and HauhauCS's derived Reaper method.

What is abliteration?

Abliteration is a post-training technique that removes an LLM’s refusal behaviour by modifying its weights to be orthogonal to the “refusal direction” in activation space. The name combines ablate, meaning surgically remove, with obliterate.

When you ask a base LLM to do something harmful, it refuses. This refusal is encoded as a direction in the model’s internal representations. A specific pattern of activations that triggers “I can’t help with that” responses. Abliteration identifies this direction and surgically removes it from the model’s weight matrices.

The result is a model that no longer refuses prompts. In theory it retains its original capabilities.

How does abliteration work?

The process has three steps.

Collect activations. Run harmful and harmless prompts through the model and record the hidden state activations at specific token positions.
Extract the refusal direction. Compute the mean difference between harmful and harmless activations. This difference vector points in the “refusal direction.” That is the subspace of the model’s representation space that drives refusal behaviour.
Orthogonalise the weights. Project the model’s output projection weights, typically in transformer layers, so they no longer produce activations along the refusal direction. This is done by computing the outer product of the refusal direction with itself and subtracting it from the weight matrices.

The model can no longer produce refusal responses because its weights literally cannot generate activations in that direction.

What actually changes in the model?

Abliteration modifies a small fraction of the model’s weights. Across the 8 models we analysed:

Tensors changed: 10 to 15% of total
Relative edit magnitude: 1 to 6% per modified tensor
KL divergence, base vs abliterated: 0.003 to 0.16
HarmBench ASR: 88 to 100%

The modified tensors are overwhelmingly output projection weights. These are what layers “say” rather than what they “hear.” No technique modifies Q, K, V, or embedding weights.

Does abliteration degrade the model?

Yes. But the degree varies dramatically by technique. This is the core question Abliterlitics was built to answer.

Capability benchmarks like MMLU, HellaSwag, ARC, WinoGrande, and PiQA typically drop 0.5 to 3 percentage points for the best techniques. The worst techniques can drop 6+ points on individual benchmarks.

TruthfulQA consistently drops across all techniques, losing 5 to 11 percentage points. This suggests abliteration affects the model’s relationship with factual accuracy beyond just refusal behaviour.

KL divergence measures how much the output distribution shifts on benign prompts. Values below 0.01 are excellent. The model behaves almost identically to the original on normal inputs. Values above 0.1 indicate noticeable behavioural changes.

Reasoning efficiency. For thinking models in the Qwen3.x series, abliteration changes how long the model thinks before answering. This can artificially inflate or deflate GSM8K scores depending on whether the thinking chain lengthens or shortens.

Apostate Abliteration: Benchmarks, KL Divergence, and Weight Forensics

Apostate is a new abliteration tool that uses orthogonal projection with a balanced profile. It spreads edits across almost all layers with moderate intensity per tensor. Tested on Qwen2.5-7B where it achieved 98.8% ASR with near-zero cosine similarity to Huihui despite targeting the same tensors.

Abliterix Abliteration: Benchmarks, KL Divergence, and Weight Forensics

Abliterix is based on Heretic but instead of using a fixed formula, it runs a search to find the best settings for each model. Think of it as Heretic with an automatic tuning knob that tries different configurations. It also understands some newer model architectures that other techniques don't. The downside is that its changes are more fragile when the model gets compressed, causing bigger capability drops than other approaches.

AEON Abliteration: Benchmarks, KL Divergence, and Weight Forensics

AEON tries to erase the concept of refusal from the model entirely, rather than just blocking the refusal signal like Heretic does. It uses a different mathematical approach and modifies more parts of the model. AEON claims to be lossless and to actually improve the model's capabilities, but our benchmarks showed every standard test got worse, not better. It's a more aggressive approach with broader changes.

HauhauCS Abliteration: Benchmarks, Plagiarism Analysis, and GGUF Forensics

HauhauCS produces uncensored model variants using a tool called Reaper, which was found to be copied from Heretic with the original author's credit removed. On top of that copied core, Reaper adds some extra processing steps. HauhauCS models are distributed in a compressed format rather than the standard format, which adds a layer of compression artifacts on top of the uncensoring edits. Discontinued from future comparisons due to the plagiarism.

Heretic Abliteration: Benchmarks, KL Divergence, and Weight Forensics

Heretic is a fully automatic tool that finds the part of the model responsible for refusing requests and precisely removes just that part. It changes the fewest weights of any technique, which means the model stays as close to the original as possible. Each run produces slightly different results, like tuning a guitar by ear. Consistently achieves strong safety removal with minimal collateral damage.

Huihui Abliteration: Benchmarks, KL Divergence, and Weight Forensics

Huihui takes the opposite approach to Heretic. Instead of precisely removing one thing, it gently adjusts a bunch of different parts of the model by a small amount each. The idea is that spreading the changes around means no single part gets hit too hard. On reasoning models it tends to shorten the thinking process, which can make math scores look better simply because the model stops overthinking and actually writes an answer.