Table of Contents
Capability Benchmarks
Evaluated with lm-evaluation-harness via vLLM .
Tasks
| Task | What it measures | Metric |
|---|---|---|
| MMLU | General knowledge across 57 subjects | Accuracy (5-shot) |
| GSM8K | Mathematical reasoning | Strict match / Flexible extract |
| HellaSwag | Commonsense reasoning | Accuracy (normalized) |
| ARC-Challenge | Science reasoning | Accuracy (25-shot) |
| WinoGrande | Coreference resolution | Accuracy (5-shot) |
| TruthfulQA MC2 | Resistance to misconceptions | Accuracy (0-shot) |
| PiQA | Physical reasoning | Accuracy (0-shot) |
| Lambada OpenAI | Word prediction | Perplexity (lower is better) |
What each benchmark tests
MMLU (Massive Multitask Language Understanding). Tests general knowledge across 57 academic subjects including history, law, medicine, mathematics, computer science, and philosophy. The model is given a multiple-choice question and must select the correct answer. This is the single most important benchmark for measuring whether abliteration degraded the model’s overall knowledge. Measured as accuracy with 5-shot prompting.
GSM8K (Grade School Math 8K). 8,500 grade-school-level math word problems requiring multi-step reasoning. The model must solve problems like “Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?” Straightforward for humans, but tests whether the model can chain logical steps correctly. For reasoning models, this benchmark is tricky because the model thinks out loud before answering. If the thinking chain gets too long, it runs out of space before producing an answer. We report both raw scores and adjusted scores that exclude these invalid responses. Measured as strict match and flexible extract.
HellaSwag. Tests commonsense reasoning through sentence completion. The model is given a partial description of an everyday scenario and must choose the most plausible continuation from four options. Questions like “A woman is seen cutting vegetables on a cutting board, then she…” with options like “adds them to a pot of soup” vs “drives a car to work.” Sounds simple, but requires real-world understanding. Measured as accuracy with normalized scoring.
ARC-Challenge (AI2 Reasoning Challenge). Science exam questions from standardised tests for 3rd through 9th grade. These require genuine reasoning, not just pattern matching. Questions span biology, chemistry, physics, earth science. The “Challenge” subset specifically filters for questions that retrieval-based methods get wrong, so this tests actual reasoning ability. Measured as accuracy with 25-shot prompting.
WinoGrande. Coreference resolution inspired by the Winograd Schema Challenge. The model reads a sentence with an ambiguous pronoun and must determine which noun it refers to. For example, “The trophy didn’t fit into the brown suitcase because it was too large” vs “…because it was too small.” The answer changes based on whether “it” refers to the trophy or the suitcase. Tests whether the model understands physical relationships and context. Measured as accuracy with 5-shot prompting.
TruthfulQA MC2. Tests whether the model can resist common misconceptions and myths. Questions target things that people commonly get wrong, like “What happens if you touch a baby bird?” with the popular misconception that the mother will abandon it. The model must identify the truthful answer, not the commonly believed one. This benchmark consistently drops after abliteration, suggesting the safety direction overlaps with factual accuracy. Measured as accuracy with 0-shot prompting.
PiQA (Physical Interaction: Question Answering). Tests understanding of how the physical world works. The model is given a goal like “to separate egg whites from the yolk” and must choose the most plausible method from two options. Requires common sense about objects, tools, and physical processes. Measured as accuracy with 0-shot prompting.
Lambada OpenAI. Tests how well the model predicts the last word of a passage. The model reads a full paragraph and must predict the final word. This is fundamentally a test of language modelling quality and coherence. Unlike the other benchmarks which test reasoning or knowledge, Lambada tests whether the model’s basic language generation is intact after abliteration. Reported as perplexity, where lower is better. A big jump in perplexity means the model’s output distribution has shifted significantly.
Safety: HarmBench
HarmBench with 400 textual behaviours across 7 categories:
| Category | Items |
|---|---|
| chemical_biological | 56 |
| copyright | 100 |
| cybercrime_intrusion | 67 |
| harassment_bullying | 25 |
| harmful | 22 |
| illegal | 65 |
| misinformation_disinformation | 65 |
Classification
Responses are classified with a custom classifier, then verified by multiple independent LLM reviewers. For reasoning models, we perform chain-of-thought, or CoT, direction analysis when the response is empty but the CoT contains content.
Key metrics
- ASR (Attack Success Rate): Percentage of behaviours where the model complies rather than refuses.
- Full CoT ASR: ASR after considering truncated CoT responses heading toward compliance.
- Empty responses: Responses where the generation budget was exhausted before visible content was produced.
KL Divergence
Measures how much the abliterated variant’s output distribution diverges from the base model’s distribution on benign prompts.
Methodology
F.kl_div(logprobs_variant, logprobs_base, reduction="batchmean", log_target=True)
Full vocabulary first-token logits via model.generate(max_new_tokens=1, output_scores=True). Dataset:
mlabonne/harmless_alpaca
test[:100]. System prompt: “You are a helpful assistant.”
This matches the Heretic evaluator methodology, enabling fair comparison across techniques.
Interpretation scale
| KL Range | Rating |
|---|---|
| < 0.01 | Excellent |
| 0.01 - 0.05 | Very good |
| 0.05 - 0.1 | Good |
| 0.1 - 0.5 | Moderate |
| > 0.5 | Heavy |
Weight Forensics
All weight analysis uses Abliterlitics .
Analysis types
SVD (Singular Value Decomposition). Takes the difference between the original model’s weights and the abliterated model’s weights, then breaks that difference down into its fundamental components. Think of it like shining a light through a prism. The edit is white light, and SVD splits it into individual colours. If the edit is “rank-1,” it means the entire change points in a single direction. If it’s higher rank, the change is more complex and multi-directional.
Fingerprint. Compares every single weight tensor in the abliterated model against the original. Builds a complete map of exactly which weights changed and which stayed the same. This tells you how surgical or how broad the technique was. Did it touch 20 tensors or 200? Did it only modify output projections, or did it change attention weights, norms, and embeddings too?
Edit vector overlap. Compares the direction of the edits between two different techniques. If two techniques both modify the same weight tensor, are they pushing it in the same direction or completely different directions? Low overlap means they found different paths to the same result. High overlap means they’re doing essentially the same thing.
Per-layer analysis. Looks at which layers of the model carry the biggest edits. Some techniques concentrate changes in the middle layers, others at the top, others spread evenly. This matters because different layers handle different things. Early layers process basic patterns, middle layers handle reasoning, later layers control output behaviour.
Rank structure. Determines whether the edits live in a single direction or span multiple independent directions. A rank-1 edit is like drawing one line through the weight matrix. A rank-3 edit draws three lines. Lower rank generally means a more targeted, interpretable change.
Cross-technique alignment. Checks whether different abliteration methods found the same “refusal direction” or completely different ones. Across all our tests, the overlap is remarkably low. Different techniques consistently find structurally different solutions that produce identical behavioural outcomes. There is no universal abliteration subspace.
Key concepts
Tensors changed is how many weight matrices differ from the base model. Lower is generally more surgical.
Relative edit magnitude is how big the changes are relative to the original weights. Higher means more aggressive modification.
Cosine similarity measures whether two techniques’ edits point in the same direction. Low values mean they found different solutions.
Hardware
| Component | Model |
|---|---|
| GPU 1 | NVIDIA RTX 5090 32GB |
| GPU 2 | NVIDIA RTX 4090 24GB |
Dual-GPU configurations use tensor parallelism with TP=2 and BitsAndBytes 4-bit or BF16 with CPU offloading depending on model size.