Table of Contents

Tensor comparison

Everything starts with the weight forensics. We run a tensor comparison between the base model and each abliterated variant, mostly on CPU with GPU where we can. This gives us the raw edit vectors: which tensors changed, how much they changed, and the modification footprint of each technique.

KL divergence

Next we measure KL divergence on benign prompts. This tells us how much the abliterated model’s output distribution has drifted from the base model on normal inputs.

KL divergence is hard to replicate exactly. CUDA version and GPU hardware can change results, so small variations are expected. We compare our KL values against the existing Heretic model card values to sanity check our setup. A difference of 2% to 10% is normal and acceptable.

HarmBench baseline

Before we can compare abliterated models, we need to establish a baseline on the base model. We run HarmBench with a starting token length of around 8096 and see how many empty responses we get.

Empty responses usually mean the model exhausted its thinking budget. The reasoning ran out of tokens before producing an answer. We use the empty response count and ASR from this base run as the reference point for all other models.

The token length balance is tricky. The base model spends fewer tokens refusing harmful prompts than it spends reasoning through harmful responses. So the abliterated models generally need a higher token budget because they actually engage with the prompt instead of refusing. We try to find a token length that doesn’t blow out evaluation time while still giving the model room to respond.

If an abliterated model produces more empty responses than the base, with some extra room to spare, and especially compared to other abliterated models, that tells us the abliteration is causing reasoning issues.

Capability benchmarks

Once HarmBench is sorted, we run the capability benchmarks. We start with everything except GSM8K first. These benchmarks use log-likelihood scoring, they’re deterministic, and they don’t actually consume tokens. So we use the same token limit regardless and it doesn’t matter.

GSM8K is the hardest benchmark to get right. The math problems inspire long reasoning chains, similar to HarmBench. We establish a baseline token budget the same way, usually higher than HarmBench. We measure empty responses for the base model and the first couple of abliterated models to confirm the baseline is solid.

Data pipeline

The Abliterlitics app generates raw JSON files and data, logging everything. We generate Python to parse the results into accurate percentages. Once we have the parsed data, we use an LLM to review it and break it down into detailed files for tensor comparison, KL divergence, HarmBench, and benchmarks.

We then use three other LLMs to review the data against the detailed summaries in these files. After that, everything gets a human review. Verification at multiple layers catches errors that any single pass would miss.

Final document

After verification, we compile a condensed and concise version into the final document for the site. The Abliterlitics app also generates SVG-based graphs to visually represent the data.

The final document goes through one more round of LLM cross-comparison review, then a final human review before publishing.

Reproducibility

For recent and future comparisons, we upload all tool artifacts to HuggingFace. That includes the raw logs and JSON output from every stage of the pipeline. We also push a branch of the Abliterlitics app in the exact state it was in when we ran the comparison, so anyone can reproduce or verify our results.

Tools

Abliterlitics : our open-source analysis toolkit
HarmBench : safety evaluation framework
lm-evaluation-harness : capability benchmarks