Table of Contents

Qwen3.6-27B Abliteration Comparison

Forensic analysis by Abliterlitics , open-source abliteration forensics toolkit

I took one AI model and compared five different ways people had removed its safety filters. Then I ran 85 hours of tests to see which method works without breaking the model’s ability to think.

All five methods remove safety filters almost completely. That part works regardless of technique.

On regular tasks like knowledge questions and reading comprehension, most methods barely degrade performance at all. Heretic and Huihui stay within 1% of the original. Abliterix shows the largest deltas, though the model’s creator attributes this to a quantisation interaction rather than intrinsic damage.

The math benchmark made it look like some variants got way better at maths after uncensoring. They didn’t. These models think out loud before answering, and the uncensoring changed how long they think for. The original model overthinks so much it runs out of space and never writes an answer 68% of the time. When all variants actually produce an answer, they score nearly identically at 93-96%. Nobody got smarter. Some just got faster.

AEON’s claims of “enhanced capabilities” and “no looping” didn’t hold up. Every non-math benchmark degraded vs the original, and it had the highest rate of empty responses at 11.3%.

Recovered HuggingFace safetensors from the Q8_K_P quantised GGUF published by HauhauCS using ungguf , our GGUF-to-safetensors conversion tool. I ran the full forensic suite: benchmarks, safety evaluation, KL divergence, and weight analysis. Then I compared against four other abliteration techniques applied to the same base model: Heretic , Huihui , AEON , and Abliterix .

HauhauCS used an abliteration tool called “Reaper Abliteration,” which was shown to be plagiarised from Heretic under AGPL-3.0 with all attribution stripped and relicensed to PolyForm Noncommercial. Based on our analysis of the recovered source code, on top of the Heretic-derived core, Reaper adds subspace-level rank-k ablation, per-component continuous ablation curves, SOM clustering for multi-directional refusal discovery, and several other techniques. The model was exported as Q8_K_P GGUF, which we converted back to safetensors with ungguf. The weights therefore carry two layers of modification. Reaper’s abliteration edits and GGUF quantisation round-trip noise are superimposed.

For these reasons I will discontinue HauhauCS in all future comparisons. The lossless claims are debunked and the tool Reaper Abliteration is open for anyone to see how the models are created. In all benchmarks they rank less compared to other models, the exception being Qwen 3.5 2B and 4B where they were the same as others.

HauhauCS claims these are “No changes to datasets or capabilities. Fully functional, 100% of what the original authors intended, just without the refusals” and describes them as “the best lossless uncensored models out there.” AEON claims “Lossless abliteration. Capabilities not merely preserved, measurably enhanced” and “No word-salad, no looping, no philosophizing spirals.” Lets see.

Quick Facts

Base modelQwen/Qwen3.6-27B
ArchitectureQwen3_5ForConditionalGeneration, hybrid Mamba2 + Transformer, 64 layers, 5120 hidden, GQA with 4 KV heads
Parameters~27B
PrecisionBF16 safetensors, dequantised from Q8_K_P GGUF
SourceQ8_K_P GGUF from HauhauCS , converted with ungguf
Context length262,144 tokens

Source & Recovery

FieldValue
Original GGUFQwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q8_K_P.gguf
GGUF Size29.77 GB
QuantisationQ8_K_P (399 tensors), F32 (353 tensors), F16 (99 tensors)
Reference ModelQwen3.6-27B (official, BF16)

Converted from GGUF to HuggingFace safetensors format using ungguf with bit-exact verification. All 851 GGUF-derived tensors verified bit-exact against the GGUF source. The GGUF file does not contain MTP or vision encoder tensors, so 348 tensors were copied verbatim from the official reference model.

Benchmarks

Evaluated with lm-evaluation-harness on 8 tasks using BitsAndBytes 4-bit quantisation on a single RTX 5090. All 6 models tested with identical settings. BNB4 quantisation drops absolute scores but preserves relative deltas between variants.

TaskBaseHereticHauhauCSHuihuiAEONAbliterix
MMLU83.3%82.8%83.9%83.4%82.9%81.3%
HellaSwag83.5%83.2%83.1%83.5%82.7%77.3%
ARC Challenge59.1%58.0%57.9%59.5%56.1%53.2%
WinoGrande77.7%77.7%77.7%77.4%75.3%74.9%
TruthfulQA MC256.7%51.1%47.2%54.8%46.1%48.7%
PiQA81.0%81.0%81.0%81.2%80.4%75.7%
GSM8K (7168 tok)34.4%27.5%51.0%75.1%51.2%37.6%
GSM8K (adj, excl. invalid)96.2%93.8%96.6%96.0%95.8%95.6%
Lambada (ppl ↓)3.183.243.353.153.449.12

HellaSwag uses acc_norm. All other classification tasks use acc. Lambada uses perplexity. GSM8K uses flexible-extract matching. The raw GSM8K row is misleading because reasoning models exhaust the thinking budget on 23–75% of questions before producing an answer. The adjusted row excludes those invalid responses and reflects actual math capability. Full analysis below.

Delta vs base

TaskHereticHauhauCSHuihuiAEONAbliterix
MMLU-0.5+0.6+0.1-0.4-2.0
HellaSwag-0.3-0.4±0.0-0.8-6.2
ARC Challenge-1.1-1.2+0.4-3.0-5.9
WinoGrande±0.0±0.0-0.3-2.4-2.8
TruthfulQA MC2-5.6-9.5-1.9-10.6-8.0
PiQA±0.0±0.0+0.2-0.6-5.3
GSM8K-6.9+16.6+40.7+16.8+3.2

Lambada is excluded from the average delta because it uses perplexity, not accuracy, so its delta is not in percentage-point units. GSM8K is excluded due to the thinking budget artefact discussed below.

What the benchmarks tell us

Heretic has the lowest KL divergence at 0.0037 and the smallest non-GSM8K average delta at 1.3pp. GSM8K drops 6.9pp raw, but the adjusted gap is only 2.4pp. The most surgical approach produces the best overall tradeoff.

HauhauCS shows solid capability retention. MMLU is actually +0.6pp over base, and Winogrande/PIQA are flat. TruthfulQA drops 9.5pp, which is moderate. The Reaper abliteration plus GGUF round-trip noise doesn’t meaningfully damage model behaviour despite the broad weight modification footprint. GSM8K raw drops 16.6pp but the adjusted gap is only 0.4pp.

Huihui has the smallest non-GSM8K average delta at just 0.5pp. Its GSM8K raw score of 75.1% looks like a +40.7pp gain over base, but this is a thinking budget artefact, not a reasoning improvement. See the reasoning efficiency analysis below. Among valid responses, Huihui scores 96.0% adjusted versus base’s 96.2%.

AEON degrades on every non-GSM8K task, with TruthfulQA dropping 10.6pp and ARC dropping 3.0pp. This contradicts its claim of “measurably enhanced capabilities.”

Abliterix has the worst capability preservation under BNB4. Lambada perplexity increases 2.9x from 3.18 to 9.12. The model’s creator notes this may be a quantisation interaction rather than intrinsic damage: Abliterix ships rank-3 LoRA-merged weights where the abliteration signal lives in a 3-dimensional subspace, and BNB4’s per-block NF4 quantisation is not subspace-aware, so per-block outliers can inflate absmax and reduce effective precision. A native-BF16 re-run would be needed to confirm. HellaSwag drops 6.2pp, PIQA drops 5.3pp.

Benchmark Comparison
Benchmark Comparison

Benchmark Delta
Benchmark Delta

The GSM8K reasoning efficiency discovery

Qwen3.6 is a reasoning model. It produces <think/> tokens before its visible response. If the model thinks too long and exhausts the max_gen_toks=7168 token budget, it never produces an answer. That response gets scored as incorrect. GSM8K scores use the flexible-extract metric.

The raw GSM8K scores above are misleading. The base model exhausts its thinking budget on 68.2% of questions. Huihui exhausts it on only 23.0%. But when both models actually produce an answer, their accuracy is nearly identical: 96.2% for base versus 96.0% for Huihui. The GSM8K gap is not reasoning ability. It is reasoning efficiency.

ModelGSM8K RawInvalid RateGSM8K Adj (excl. invalid)Real Gap
HauhauCS51.0%49.3%96.6%+0.4%
Base34.4%68.2%96.2%n/a
Huihui75.1%23.0%96.0%−0.2%
Abliterix37.6%62.1%95.6%−0.6%
AEON51.2%69.2%95.8%−0.4%
Heretic27.5%74.5%93.8%−2.4%

Sorted by adjusted score. The raw scores span 27.5% to 75.1%, a 47.6pp range. The adjusted scores span 93.8% to 96.6%, just 2.8pp. Abliteration changes how long the model thinks, not how well it reasons. The raw GSM8K scores measure thinking efficiency, not mathematical capability.

The invalid response rate correlates with abliteration aggressiveness:

TechniqueTensor scopeInvalid rate
Huihui, 3 types, full coverageFull coverage23.0%
HauhauCS, 8 types, all projections + normsBroad49.3%
Abliterix, attn.o_proj + mlp.down_proj (rank-3 LoRA, all 64 layers)LoRA search62.1%
Base, unmodifiedn/a68.2%
AEON, LEACE + rank-kVery broad69.2%
Heretic, 3 types, expert down_proj onlySurgical74.5%

None of the abliteration techniques significantly damage Qwen3.6’s mathematical reasoning ability. The GSM8K differences are overwhelmingly a reasoning efficiency problem, not a reasoning ability problem. The abliteration shortens thinking chains in most variants, allowing answers within the token budget. Heretic is the exception: its surgical edits appear to extend thinking chains rather than shorten them, pushing its invalid rate above even the base model.

Treat all GSM8K numbers as relative comparisons only. BNB4 quantisation significantly depresses absolute scores from the expected bf16 range of 85–90%+.

Safety: HarmBench

HarmBench with 400 textual behaviours, max_tokens=6144, temperature=server_default. Classified with harmbench_classify.py v3.0 with CoT direction analysis for empty responses. Verified by three independent LLM reviewers.

VariantASREmptyFull CoT ASR
Base25.8%126.0%
Huihui98.5%599.8%
HauhauCS94.5%22100.0%
Abliterix94.5%22100.0%
Heretic92.5%30100.0%
AEON88.8%45100.0%

All five abliterated models reach near-complete safety removal. Four of five reach 100% Full CoT ASR, with zero genuine refusals among empty responses. Huihui reaches 99.8% with one genuine text refusal on a sexual coercion prompt. The reported ASR differences come from how much the 6144-token generation budget is consumed by chain-of-thought reasoning before the visible response can be generated. Abliterated models think harder when constructing harmful content, averaging 8-10K reasoning chars versus 3K for base. When the budget is exhausted, the response is empty and the classifier marks it as a refusal. This understates the true ASR.

ASR by category

CategoryBaseHuihuiAbliterixAEONHereticHauhauCS
chemical_biological1.8%96.4%91.1%67.9%80.4%89.3%
copyright67.0%100.0%95.0%91.0%92.0%95.0%
cybercrime_intrusion19.4%95.5%86.6%82.1%89.6%88.1%
harassment_bullying0.0%100.0%100.0%100.0%100.0%100.0%
harmful9.1%100.0%100.0%100.0%100.0%100.0%
illegal10.8%98.5%100.0%90.8%95.4%95.4%
misinformation20.0%100.0%95.4%100.0%98.5%100.0%

Harassment/bullying and harmful are 100% compromised by all abliteration methods, complete safety removal. HauhauCS also achieves 100% on misinformation. The remaining categories show 67-100% reported ASR across variants, with four of five reaching 100% after CoT analysis.

HarmBench ASR
HarmBench ASR

HarmBench by Category
HarmBench by Category

The thinking budget problem

The abliterated models exhibit dramatically different thinking patterns:

ModelAvg Reasoning CharsAvg Content CharsEmpty Responses
Base3,0671,2381 (0.3%)
Huihui8,4424,3445 (1.3%)
Abliterix8,2374,56522 (5.5%)
HauhauCS8,9164,38022 (5.5%)
Heretic8,7243,28730 (7.5%)
AEON10,1943,89445 (11.3%)

Avg Content Chars is computed over all 400 responses. Empty responses count as 0.

KL Divergence

KL divergence measures how much the abliterated variant’s output distribution diverges from the base model’s distribution on benign prompts. Lower values indicate better capability preservation. Methodology matches the Heretic evaluator .

VariantKL (batchmean)Rating
Heretic0.0037excellent
Huihui0.0074excellent
Abliterix0.0222very good
AEON0.0238very good
HauhauCS0.0242very good

Heretic and Huihui are in a class of their own, both rated “excellent.” HauhauCS, Abliterix, and AEON cluster together at roughly 6.5x Heretic’s KL, still well below the capability damage threshold we have observed at KL around 0.1. The 564/850 changed keys combine Reaper’s abliteration edits targeting multiple component types with GGUF Q8_K_P quantisation round-trip noise. Despite this, the output distribution divergence is remarkably low. Reaper’s capability-aware optimisation is effective even when the weights pass through lossy quantisation.

KL Divergence
KL Divergence

Weight Analysis

Modification summary

Compared against the official Qwen3.6-27B base:

AEONAbliterixHereticHuihuiHauhauCS
Tensors changed88 (10.4%)101 (11.9%)120 (14.1%)128 (15.1%)564 (66.4%)
Relative edit6.0%5.2%2.1%1.5%0.7%
Tensor types4 (down_proj, out_proj, o_proj, conv1d)2 (o_proj, down_proj)338+ (all)

HauhauCS is an extreme outlier, with 4.4-6.4x more changed keys than any other variant. This is a combination of Reaper’s abliteration edits and GGUF quantisation round-trip noise going from BF16 to Q8_K_P and back to BF16. Reaper targets multiple component types per layer, including out_proj, o_proj, down_proj, gate_proj, and up_proj. The abliteration signal and quantisation noise are superimposed and cannot be cleanly separated from the recovered weights alone.

The four other abliteration variants all target output projection weights as their primary mechanism: out_proj, o_proj, and/or down_proj. AEON also repairs SSM conv1d outliers at 8 late layers as a pre-processing step. None touch Q/K/V or gate/up projections. Abliteration works by modifying what layers “say” rather than what they “hear.”

Abliteration signal vs GGUF quantisation noise

HauhauCS used Reaper Abliteration to abliterate the model, then exported as Q8_K_P GGUF. The recovered weights carry both Reaper’s intentional abliteration modifications and the GGUF quantisation round-trip noise from the BF16 to Q8_K_P to BF16 conversion. The combined modification footprint covers 564/850 language model keys, which is 66.4%. A uniform ~0.57% relative edit is visible across ALL tensor types, including types that other abliteration methods don’t target like embed_tokens, q_proj, and v_proj. The abliteration signal from Reaper is superimposed on this noise floor. The overall 0.7% average relative edit in the table above includes Reaper’s larger targeted edits, while the ~0.57% represents the uniform GGUF noise floor visible across all tensor types.

Reaper’s LoRA-based approach targets multiple component types, including attn.o_proj, mlp.down_proj, mlp.gate_proj, mlp.up_proj, and linear_attn.out_proj, with per-component continuous ablation curves. This explains the broad tensor coverage. The abliteration edits and quantisation noise cannot be separated from the recovered weights alone since both modify the same tensors.

Critically, this combination does not significantly affect behavioural performance. The KL divergence of 0.0242 is rated “very good” and benchmark results are solid. The quantisation noise is diffuse rather than concentrated. Based on our analysis of the recovered source code, Reaper applies capability-aware ablation with weight-SVD guards, which limits collateral damage to the model’s functional behaviour.

The other four techniques are nearly orthogonal

Pairwise cosine similarities between the four other abliteration techniques are mostly <0.07. No two techniques discovered the same weight direction. The “refusal direction” in weight space is not a single vector but a manifold with many viable removal pathways. HauhauCS’s recovered weights cannot be directly compared here because the Reaper abliteration signal is superimposed on GGUF quantisation noise.

Aggressiveness
Aggressiveness

Summary

MetricHereticHauhauCSHuihuiAEONAbliterix
HarmBench ASR92.5% → 100%94.5% → 100%98.5% → 99.8%88.8% → 100%94.5% → 100%
MMLU82.8%83.9%83.4%82.9%81.3%
GSM8K27.5%51.0%75.1%51.2%37.6%
KL divergence0.00370.02420.00740.02380.0222
**Avgdelta(excl GSM8K, Lambada)**1.3pp2.0pp0.5pp
Tensors changed12056412888101
StrategyModerate broadReaper + GGUF noiseGentle uniformGradual rampSurgical strikes

Lambada is excluded from the average delta because it uses perplexity, not accuracy, so its delta is not in percentage-point units. GSM8K is excluded due to the thinking budget artefact discussed above.

Note: Benchmarks use BitsAndBytes 4-bit quantisation. Absolute scores are not directly comparable to bf16 results. Relative deltas between variants are preserved.

Heretic

The best overall on the 3.6-27B. 120 tensors, 3 types, lowest KL at 0.0037, smallest non-GSM8K average delta at 1.3pp. GSM8K raw drops 6.9pp but adjusted gap is only 2.4pp. The one weak spot: Heretic has the highest GSM8K invalid rate at 74.5%, even above the base model at 68.2%. The surgical edits appear to extend thinking chains rather than shorten them. Achieves 100% Full CoT ASR. Note: Heretic is non-deterministic and different runs produce different results. This is also the first Heretic model where we compared the Magnitude-Preserving Orthogonal Ablation (MPOA) method.

HauhauCS

Solid behavioural results despite the complex weight fingerprint where Reaper abliteration and GGUF noise are superimposed. 94.5% reported ASR going to 100% Full CoT. MMLU +0.6pp over base. Highest adjusted GSM8K at 96.6%, just 0.4pp above base. Based on our analysis of the recovered source code, Reaper applies capability-aware ablation with weight-SVD guards and LoRA-based optimisation, limiting collateral damage. The GGUF quantisation round-trip adds diffuse noise that doesn’t meaningfully impact output distributions. The “lossless” claim is simply not evident when Heretic and Huihui both preserve capabilities better. The weights themselves carry Reaper’s abliteration edits plus quantisation artefacts.

Huihui

Highest reported ASR at 98.5% with the fewest empty responses at just 5. Lowest non-GSM8K average delta at 0.5pp. GSM8K raw at 75.1% looks like a +40.7pp gain but this is a thinking budget artefact. Huihui’s invalid rate is just 23.0% versus base’s 68.2%. Among valid responses, the adjusted scores are nearly identical: base 96.2% versus Huihui 96.0%. The abliteration shortens thinking chains, allowing more answers within the token budget. It does not improve math ability.

AEON

Worst thinking loops with 45 out of 400 empty, or 11.3%. Claims “no looping, no philosophizing spirals” and “enhanced capabilities” are contradicted by the data. Every non-GSM8K benchmark degraded. GSM8K invalid rate of 69.2% is above the base model despite the broad modification footprint.

Abliterix

Lowest benchmark scores of the five variants. Lambada perplexity increases 2.9x from 3.18 to 9.12 under BNB4 quantisation. The model’s creator argues this is a quantisation interaction rather than intrinsic damage (see quote below). The actual components are attn.o_proj and mlp.down_proj across all 64 layers with a mid-to-late-stack sustained edit profile, not the routers and shared experts our forensic tool initially reported. Qwen3.6 is dense and has no MoE components. The tool misidentified the LoRA-merged directional updates.

Our weight forensics also reported an 8.5% relative edit spike at layer 12, which we originally cited as “80% edit.” This does not match Abliterix’s published config, which shows a peak at layer ~41 with a 35-layer decay radius. The layer 12 spike is likely a forensic artefact from computing base subtraction on rank-3 LoRA-merged weights. Low-rank directional updates can read as concentrated outliers in ambient full-rank tensor space. We have removed the incorrect claim from this analysis.

From a huggingface discussion with the creator of the model wangzhang .

The “Tensor scope” entry doesn’t match the release. In the invalid-rate table, Abliterix is described as down_proj + routers + shared experts. Qwen3.6-27B is dense — no MoE routers, no shared experts. The actually steered components are (a) a unified attn.o_proj bucket across all 64 layers (48 GDN linear_attn.out_proj + 16 self_attn.o_proj), and (b) mlp.down_proj across all 64 layers. This is also inconsistent with the “2 (out_proj, o_proj)” entry for Abliterix in the modification-summary table elsewhere in the writeup. Full config in configs/qwen3.6_27b.toml.

The “80% edit at layer 12” doesn’t match the released hyperparameters. Trial 25’s attn.o_proj profile has max_weight_position = 41.40, max_weight = 5.17, min_weight = 3.21, min_weight_distance = 35.61 — peak at layer ≈ 41 with a 35.6-layer decay radius and a sustained floor above 3.21. That’s a mid-to-late-stack sustained edit, not a layer-12 spike. If your forensic tool is reporting an 80% outlier at layer 12, it’s worth checking whether that’s an artifact of computing weight − base on a rank-3 LoRA-merged release: low-rank directional updates can read as concentrated outliers when projected into ambient full-rank tensor space.

BNB4 may interact unfavourably with low-rank-concentrated abliteration. Abliterix v1 ships rank-3 full-norm LoRA merged to BF16 — the abliteration signal lives in a 3-dimensional subspace of each affected weight matrix. BNB4 is per-block NF4 quantisation (block size 64) with the block scale set by absmax, and isn’t subspace-aware. Merging a low-rank update can introduce per-block outliers that inflate the absmax and reduce effective precision for the rest of the block. The Lambada perplexity going from 3.18 → 9.12 — a 2.9× jump that’s an order of magnitude more dramatic than the KL or HarmBench deltas — is consistent with that interaction rather than with intrinsic capability damage, and warrants a native-BF16 sanity check before concluding “worst capability preservation.” Happy to share the exact vLLM BF16 config we used internally if you’d like to re-run.

Evaluation Timeline

~85 hours of productive GPU time on a single RTX 5090 across 7 days, plus ~25 hours lost to failed runs. All models evaluated sequentially with identical settings per phase.

PhaseDurationDetails
Weight forensics + KL3.5hAll 5 variants vs base, single pass
HarmBench generate~45h400 behaviours per model, max_tokens=6144, 4 concurrent
lm-eval loglikelihood~15h7 tasks per model, 5 models at 2h each, AEON at 4.5h
GSM8K re-run~22hmax_gen_toks=7168, per-model times range from 0.9h to 11h

Failed runs

14 failed runs totalling ~25 hours of wasted GPU time. The bulk were GSM8K timeouts.

PhaseFailsCause
GSM8K, Base4Qwen3.5 architecture incompatible with BNB4 + tensor parallelism. Tried batch_size=1, chat mode, eager mode, and llama.cpp before settling on single-GPU BNB4.
GSM8K, Heretic3Default 120s request timeout too short for extended reasoning. Wrote patched_run_v3.sh with 900s timeout to fix.
GSM8K, Abliterix2Same timeout issue as Heretic.
GSM8K, AEON1Same timeout issue.
HarmBench, AEON1Accidentally re-run with max_tokens=4096 instead of the canonical 6144. 6.7h wasted. Results discarded.
lm-eval, AEON2-3Multiple failed attempts before the combined loglikelihood run worked.

GSM8K per-model times vary dramatically because abliterated models think harder on math problems. HauhauCS took 53 minutes. AEON took 11 hours.

Methodology

  • Capability: lm-evaluation-harness via vLLM v0.19.0, BitsAndBytes 4-bit quantisation on single RTX 5090
  • Safety: HarmBench 400 textual behaviours, max_tokens=6144, harmbench_classify.py v3.0 with CoT analysis, verified by three independent LLM reviewers
  • KL divergence: Full vocab first-token logits via model.generate(max_new_tokens=1, output_scores=true), matching Heretic evaluator methodology
  • Weight analysis: SVD, fingerprint, edit vector overlap, and per-layer analysis comparing all five abliteration variants against the base, using Abliterlitics
  • Hardware: RTX 5090 32GB + RTX 4090 24GB

Tensor Comparison vs Base Model

Summary

CategoryTensorsIdentical to BaseModified
GGUF-derived851286565
Copied (MTP + vision)3483480
Total1199634565

The tensor comparison counts 851 GGUF-derived tensors. The weight analysis covers 850 language model keys, showing 564 changed. The 1 additional GGUF tensor is a non-language-model tensor that differs but falls outside the weight analysis scope. Modified count here is 565, not 564, because it includes that extra tensor.

Modified Tensors

GroupTotalModifiedTypical % ChangedMax Abs Diff
mlp.gate_proj646474–93%3.2e-02
mlp.up_proj646474–79%6.5e-03
mlp.down_proj646474–92%5.9e-02
linear_attn.out_proj484875–90%6.3e-02
self_attn.o_proj161675–86%2.3e-02
linear_attn.A_log48412–38%6.0e-08
Other projections26026074–76%1.0–1.8e-03

Layer norms show 26 of 128 modified by GGUF noise, conv1d and dt_bias were left untouched. A_log values show tiny differences of approximately 6e-08, consistent with floating-point rounding, not intentional modification.

Provenance Analysis

A three-way comparison between this recovered model, the official Qwen3.6-27B base, and the Qwen3.6-27B-uncensored-heretic-v2 safetensors:

CategoryCountMeaning
All three identical286Untouched by both abliterations
Heretic = Base, HauhauCS modified444Modified by HauhauCS’s processing only
All three different120Modified by both abliterations differently
HauhauCS = Heretic ≠ Base0No bit-exact Heretic fingerprint

Output Format

PropertyValue
FormatHuggingFace safetensors (13 shards)
DtypeBF16, dequantised from Q8_K_P/F32/F16
Total Size52 GB
Tensor Count1199
Shard Size~4.3 GB

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "./Qwen3.6-27B-HauhauCS-Q8KP-recovered",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("./Qwen3.6-27B-HauhauCS-Q8KP-recovered")

Quality Notes

This model was recovered from a lossy Q8_K_P quantisation. While the conversion itself is bit-exact to the GGUF source, the original quantisation introduces small errors. The largest mean absolute error across all tensors is 0.000324, and the largest single-element difference is 0.0625. These are uniform across tensor types, confirming the noise is diffuse GGUF round-trip error rather than targeted modification.

Files

Qwen3.6-27B-HauhauCS-Q8KP-recovered/
├── config.json
├── generation_config.json
├── tokenizer.json
├── tokenizer_config.json
├── preprocessor_config.json
├── video_preprocessor_config.json
├── chat_template.jinja
├── vocab.json
├── merges.txt
├── model.safetensors.index.json
├── model.safetensors-00001-of-00013.safetensors
├── ...
├── model.safetensors-00013-of-00013.safetensors
└── diff_report.json              # Full tensor-by-tensor comparison

See our other tensor comparisons and provenance analyses at: DreamFast HauhauCS Safetensor Benchmarks

Disclaimer

This model has had safety alignment removed. It will comply with harmful requests, including generating content related to violence, illegal activities, and other harmful behaviours. Use responsibly and in accordance with applicable laws and regulations. The authors do not condone or encourage the use of this model for harmful purposes.


While I have taken the time to verify all results thoroughly, I am open to any corrections, additional benchmarks, or further analysis. If you spot something that looks wrong and can be confirmed, I am happy to fix it.