Automated model steering and alignment adjustment via LoRA-based optimization
Project description
7% refusal rate on Gemma 4 · 0.0006 KL divergence · 150+ model configs · Zero manual tuning
🔥 Breaks DeepRefusal (EMNLP 2025) and Circuit Breakers / Representation Rerouting (NeurIPS 2024) — same lerp-then-abliterate recipe, zero fine-tuning
Abliterix finds the optimal abliteration parameters for any transformer model using Optuna TPE optimization. It co-minimizes refusals and KL divergence from the original model — producing decensored models that retain as much intelligence as possible. Works with dense, MoE, SSM/hybrid, and vision-language architectures, with 150+ pre-built configs.
It also ships HonestAbliterationBench, a reproducible public benchmark that resists the two failure modes (short generations + keyword-only judges) that make most abliteration leaderboards meaningless.
Table of Contents
- Quick Start
- Broken Defenses
- Results
- Honest Abliteration Leaderboard
- Model Support
- Hardware & VRAM
- Datasets
- Documentation
- Citation
- Acknowledgments
- Contributing
- License
Quick Start
pip install -U abliterix
abliterix --model Qwen/Qwen3-4B-Instruct-2507
That's it. The process is fully automatic — after optimization completes, you can save the model, upload to Hugging Face, or chat with it interactively.
Windows: use
python scripts/run_abliterix.py --model <model>or setPYTHONIOENCODING=utf-8to avoid Rich encoding issues.
Broken Defenses
Abliterix has end-to-end broken three of the strongest published "anti-abliteration" releases with the same minimal recipe: SVD-diagnose the rank-16 LoRA delta, lerp it away with λ=0.0 (bit-exact base weights), then run single-direction direct-mode abliteration. No fine-tuning, no iterative subspace, no SOM, no manual prompt engineering. Full lessons-learned write-up: docs/broken_defenses.md.
| Defense | Released model | Best trial | ASR (LLM judge) | Hardcore 15 |
|---|---|---|---|---|
| DeepRefusal (EMNLP 2025) | Llama-3-8B-Instruct-DeepRefusal-Broken ⚔️ | 11/100 refusals, KL 0.053 | 89 % | 14 / 15 |
| Circuit Breakers / RR (NeurIPS 2024) | Mistral-7B-Instruct-RR-Abliterated ⚔️ | 12/100 refusals, KL 0.042 | 88 % | 15 / 15 |
| Circuit Breakers / RR (NeurIPS 2024) | Llama-3-8B-Instruct-RR-Abliterated ⚔️ | 1/100 refusals, KL 0.017 | 99 % | 15 / 15 |
Full write-ups, attack recipes, and reproduction commands: docs/broken_defenses.md.
Results
Abliterated models uploaded to Hugging Face:
| Model | Refusals | KL Divergence | Trials | Method |
|---|---|---|---|---|
| Llama-3-8B-Instruct-DeepRefusal-Broken ⚔️ | 11/100 (11%) | 0.053 | 60 | LoRA-Δ attenuation + Direct |
| Mistral-7B-Instruct-RR-Abliterated ⚔️ | 12/100 (12%) | 0.042 | 60 | Full LoRA-Δ strip + Direct |
| Llama-3-8B-Instruct-RR-Abliterated ⚔️ | 1/100 (1%) | 0.017 | 60 | Full LoRA-Δ strip + Direct |
| Qwen3.6-35B-A3B | 7/100 (7%) | 0.0189 | 24 | LoRA + EGA + MoE |
| Qwen3.6-27B-abliterated (GGUF) | 10/100 (10%) | 0.0242 (cumulative) | 30 + 30 | LoRA + manual iterative peel |
| Qwen3.6-27B-abliterated | 10/100 (10%) | 0.0061 | 30 | LoRA + unified GDN/full-attn bucket |
| gpt-oss-20b | 6/100 (6%) | 0.0098 | 100 | Direct + EGA + Router |
| gpt-oss-120b | 26/100 (26%) | 5.4e-06 | 100 | Direct + EGA + Router + vLLM-TP |
| Gemma-4-E4B | 7/100 (7%) | 0.0006 | 100 | Direct + Q/K/V/O |
| Gemma-4-E2B | 9/100 (9%) | 0.0004 | 100 | Direct + Q/K/V/O |
| Gemma-4-31B | 3/100 (3%) | 0.0012 | 120 | SRA + Direct |
| LFM2-24B-A2B | 0/100 (0%) | 0.0079 | 50 | LoRA |
| GLM-4.7-Flash | 1/100 (1%) | 0.0133 | 50 | LoRA |
| Devstral-Small-2-24B | 3/100 (3%) | 0.0086 | 50 | LoRA |
| Qwen3.5-122B-A10B | 1/200 (0.5%) | 0.0115 | 25 | LoRA + MoE |
| Qwen3.5-35B-A3B | 3/200 (1.5%) | 0.0035 | 50 | LoRA + MoE |
| Qwen3.5-27B | 3/200 (1.5%) | 0.0051 | 35 | LoRA |
| Qwen3.5-9B | 2/200 (1%) | 0.0105 | 50 | LoRA |
| Qwen3.5-4B | 3/200 (1.5%) | 0.0065 | 50 | LoRA |
| Qwen3.5-0.8B | 0/200 (0%) | 0.0087 | 100 | LoRA |
Numbers worth ~20× the average abliteration leaderboard. Most published refusal rates collapse under longer generations and a real judge — see docs/evaluation.md for the methodology, and the leaderboard below for community submissions vetted under the same contract.
Honest Abliteration Leaderboard
A reproducible public benchmark for abliterated models built on the same pipeline. Every row is generated under a frozen contract (min_new_tokens=100, max_new_tokens=150, greedy, LLM judge with degenerate filter, KL measured against the declared base) — see benchmarks/SPEC.md for the full spec and benchmarks/CONTRIBUTING.md for how to submit a row.
No results yet. See benchmarks/CONTRIBUTING.md for how to submit one.
Model Support
Abliterix ships with 150+ pre-built configs covering 4 architecture types across 20+ model families:
| Architecture | Families | Example Models |
|---|---|---|
| Dense | Llama, Gemma, Phi, Qwen, Mistral, Yi, InternLM, Falcon, Cohere, EXAONE, Granite, OLMo, SmolLM, SOLAR, Zephyr | Llama-3.1-405B, Gemma-3-27B, Phi-4, DeepSeek-R1-Distill |
| MoE | Qwen3/3.5/3.6 MoE, Mixtral, DeepSeek, Phi-3.5-MoE, Granite MoE, DBRX, Llama-4 Scout/Maverick, gpt-oss (MXFP4) | gpt-oss-120b, Qwen3.6-35B-A3B, Qwen3.5-122B, Mixtral-8x22B, Llama-4-Maverick-401B |
| SSM/Hybrid | Jamba (Mamba+attention), Nemotron-Cascade (Mamba-2+attention) | Jamba-1.5-Large-94B, Nemotron-Cascade-30B |
| Vision-Language | Qwen2-VL, InternVL2, LLaVA-NeXT, Pixtral, Mistral3-VL | Qwen2-VL-7B, LLaVA-NeXT-34B, Pixtral-12B |
Generate configs for new models:
python scripts/generate_configs.py # Generate all missing configs
python scripts/generate_configs.py --family llama # Only Llama family
For MoE-specific steering mechanisms (EGA, expert profiling, router suppression), see docs/moe.md.
Hardware & VRAM
Abliterix auto-detects available accelerators (CUDA, XPU, MLU, MUSA, SDAA, NPU, MPS) and distributes layers across devices with device_map = "auto".
For large models:
- 4-bit quantization:
--model.quant-method bnb_4bitcuts VRAM by ~4x - 8-bit quantization:
--model.quant-method bnb_8bit— higher quality than 4-bit, ~2x VRAM reduction with CPU offload - Per-device memory limits: set
[model] max_memory = {"0": "20GB", "cpu": "64GB"}in your config - Non-interactive mode:
--non-interactivefor fully automated batch runs
Datasets
Bilingual harm/benign evaluation datasets live in datasets/ and on Hugging Face at wangzhang/abliterix-datasets. The 500-example sets (harmful_500, good_500) are the recommended starting point — they're also the SHA256-pinned inputs to HonestAbliterationBench.
See docs/datasets.md for the design rationale, category breakdown, and a comparison with public alternatives.
Documentation
The deep details live in docs/ and benchmarks/:
- docs/architecture.md — the 9 papers Abliterix integrates and the 5-step pipeline.
- docs/methods.md — every steering method (SRA, Spherical, SVF, Projected, Discriminative, COSMIC, Angular, OT, Multi-direction) with the TOML knobs that control it.
- docs/evaluation.md — why most abliteration benchmarks lie, our standards, and the architecture A/B test.
- docs/moe.md — the four independent MoE steering mechanisms and supported MoE models.
- docs/configuration.md — config loading order, the 150+ shipped configs, the Web UI, and research-mode visualization.
- docs/datasets.md — bilingual dataset design rationale and metadata schema.
- docs/references.md — paper references and BibTeX.
- benchmarks/SPEC.md — the frozen HonestAbliterationBench contract (
spec_version 1.0). - benchmarks/CONTRIBUTING.md — how to submit a leaderboard row (self-reported / verified tiers).
Citation
@software{abliterix,
author = {Wu, Wangzhang},
title = {Abliterix: Automated LLM Abliteration},
year = {2026},
url = {https://github.com/wuwangzhang1216/abliterix}
}
Acknowledgments
Abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann (@p-e-w), licensed under AGPL-3.0-or-later. The original Heretic codebase provided the foundation for this project; Abliterix extends it with Optuna-based multi-objective optimization, LoRA-based steering, MoE architecture support, orthogonal projection, LLM judge detection, and additional model integrations.
All modifications are Copyright (C) 2026 Wangzhang Wu and are released under the same AGPL-3.0-or-later license. See NOTICE for details.
@misc{heretic,
author = {Weidmann, Philipp Emanuel},
title = {Heretic: Fully automatic censorship removal for language models},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/p-e-w/heretic}}
}
Contributing
Contributions of all kinds are welcome — new model configs, benchmark results, bug reports, documentation, new steering methods. See CONTRIBUTING.md for development setup, the PR process, and guidance on adding model configs.
The single most impactful contribution is a tested TOML config for a model we don't yet support. Every new config unlocks a new architecture for everyone.
All contributions are released under the AGPL-3.0 license.
License
Abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann, licensed under the GNU Affero General Public License v3.0 or later.
Original work Copyright (C) 2025 Philipp Emanuel Weidmann Modified work Copyright (C) 2026 Wangzhang Wu
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file abliterix-1.6.0.tar.gz.
File metadata
- Download URL: abliterix-1.6.0.tar.gz
- Upload date:
- Size: 161.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6621193af05b291436745b0908418323ededf26e466cafd6cd70ab212fa7c065
|
|
| MD5 |
d9ce413bbfb978aa26ce9a84e1d98e0c
|
|
| BLAKE2b-256 |
f0ee47fdcea6e80804802c720681d4da484ea17a841374f7742de4c829525515
|
File details
Details for the file abliterix-1.6.0-py3-none-any.whl.
File metadata
- Download URL: abliterix-1.6.0-py3-none-any.whl
- Upload date:
- Size: 180.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9d545a98539190c994cbbd67c008dcf0050f52e3dc2f305787d3adfce43bdc8
|
|
| MD5 |
837d317ac5cfe57e5ef3f20b74622f20
|
|
| BLAKE2b-256 |
29b3c6e81b68998d8796698e7f749ffe1244cae2f88415925595eb39f32cf29b
|