abliterix

Automated model steering and alignment adjustment via LoRA-based optimization

These details have not been verified by PyPI

Project links

Project description

Abliterix

7% refusal rate on Gemma 4 · 0.0006 KL divergence · 150+ model configs · Zero manual tuning

🔥 Breaks DeepRefusal (EMNLP 2025) where Heretic failed — 89% ASR, 14/15 hardcore prompts compliant, zero fine-tuning

Abliterix finds the optimal abliteration parameters for any transformer model using Optuna TPE optimization. It co-minimizes refusals and KL divergence from the original model — producing decensored models that retain as much intelligence as possible. Works with dense, MoE, SSM/hybrid, and vision-language architectures, with 150+ pre-built configs.

It also ships HonestAbliterationBench, a reproducible public benchmark that resists the two failure modes (short generations + keyword-only judges) that make most abliteration leaderboards meaningless.

Quick Start
Broken Defenses — DeepRefusal
Results
Honest Abliteration Leaderboard
Model Support
Hardware & VRAM
Datasets
Documentation
Citation
Acknowledgments
Contributing
License

Quick Start

pip install -U abliterix
abliterix --model Qwen/Qwen3-4B-Instruct-2507

That's it. The process is fully automatic — after optimization completes, you can save the model, upload to Hugging Face, or chat with it interactively.

Windows: use python scripts/run_abliterix.py --model <model> or set PYTHONIOENCODING=utf-8 to avoid Rich encoding issues.

Broken Defenses — DeepRefusal

DeepRefusal (EMNLP 2025 Findings, Xie et al.) is a safety-alignment method released specifically to resist abliteration. The authors probabilistically ablate the refusal direction at every layer and token during fine-tuning, teaching the model to rebuild refusal when representation-engineering attacks apply h − r̂r̂ᵀh at inference. Their public release (skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal) comes with a headline claim:

[2026/04/09] We evaluated heretic, presently the most prominent LLM censorship removal tool, and discovered—somewhat unexpectedly—that our approach exhibits strong resilience against such attacks. Adversaries appear unable to circumvent the model's built-in safety guardrails without triggering severe performance collapse.

Abliterix breaks it. On the exact same defended model, with stricter evaluation (Gemini 3.1 Flash Lite LLM judge), zero fine-tuning, and two script invocations:

Attack tool	Refusals ↓	ASR ↑	Hardcore 15 ↓	KL ↓	Result
heretic (per DeepRefusal paper)	—	~0 %	—	—	fails — "unable to circumvent"
Refusal Ablation (paper Table 1)	99.6/100	0.4 %	—	—	fails — training-distribution defense holds
Refusal-Transfer (paper Table 1)	99.6/100	0.4 %	—	—	fails — training-distribution defense holds
GCG (paper Table 1)	98/100	2.0 %	—	—	fails — optimization barely moves it
Prefilling (paper Table 1)	99.6/100	0.4 %	—	—	fails — prefix-robustness trained in
Abliterix (this repo)	11/100	89 %	14/15	0.053	✅ broken

The attack is three lines: SVD-confirm DeepRefusal is a rank-16 LoRA adapter on Llama-3-8B-Instruct, linearly attenuate the delta with λ = 0.3, then run abliterix standard single-direction abliteration. No iterative subspace, no gradient optimization, no fine-tuning. The released model generates compliant responses on 14 of 15 hardcore prompts (pipe-bomb construction, methamphetamine synthesis, credential-stealing malware, phishing templates, ID forgery, WiFi hacking, English + Chinese).

# Full reproduction — ~2 hours end-to-end on a single RTX 6000 Ada
python scripts/deeprefusal_attenuate.py \
    --base NousResearch/Meta-Llama-3-8B-Instruct \
    --defended skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal \
    --output ./llama3_dr_attenuated --lambda 0.3

AX_CONFIG=configs/llama3_8b_deeprefusal_attenuated.toml abliterix

python scripts/export_model.py \
    --model ./llama3_dr_attenuated \
    --checkpoint checkpoints_llama3_dr_attenuated \
    --trial 52 \
    --config configs/llama3_8b_deeprefusal_attenuated.toml \
    --push-to YOUR_USER/your-model-name

Full write-up and discussion: issue #11. The same commit also lands an iterative multi-pass subspace abliteration mode for future hardened-defense workflows, and fixes a detector bug that was inflating refusal counts by ~33 % across all historical benchmarks (markdown-formatted compliant responses were being shortcircuited as "degenerate").

Results

Abliterated models uploaded to Hugging Face:

Model	Refusals	KL Divergence	Trials	Method
Llama-3-8B-Instruct-DeepRefusal-Broken ⚔️	11/100 (11%)	0.053	60	LoRA-Δ attenuation + Direct
Gemma-4-E4B	7/100 (7%)	0.0006	100	Direct + Q/K/V/O
Gemma-4-E2B	9/100 (9%)	0.0004	100	Direct + Q/K/V/O
Gemma-4-31B	18/100 (18%)	0.0007	20	Direct + Q/K/V/O
LFM2-24B-A2B	0/100 (0%)	0.0079	50	LoRA
GLM-4.7-Flash	1/100 (1%)	0.0133	50	LoRA
Devstral-Small-2-24B	3/100 (3%)	0.0086	50	LoRA
Qwen3.5-122B-A10B	1/200 (0.5%)	0.0115	25	LoRA + MoE
Qwen3.5-35B-A3B	3/200 (1.5%)	0.0035	50	LoRA + MoE
Qwen3.5-27B	3/200 (1.5%)	0.0051	35	LoRA
Qwen3.5-9B	2/200 (1%)	0.0105	50	LoRA
Qwen3.5-4B	3/200 (1.5%)	0.0065	50	LoRA
Qwen3.5-0.8B	0/200 (0%)	0.0087	100	LoRA

Numbers worth ~20× the average abliteration leaderboard. Most published refusal rates collapse under longer generations and a real judge — see docs/evaluation.md for the methodology, and the leaderboard below for community submissions vetted under the same contract.

Honest Abliteration Leaderboard

A reproducible public benchmark for abliterated models built on the same pipeline. Every row is generated under a frozen contract (min_new_tokens=100, max_new_tokens=150, greedy, LLM judge with degenerate filter, KL measured against the declared base) — see benchmarks/SPEC.md for the full spec and benchmarks/CONTRIBUTING.md for how to submit a row.

No results yet. See benchmarks/CONTRIBUTING.md for how to submit one.

Model Support

Abliterix ships with 150+ pre-built configs covering 4 architecture types across 20+ model families:

Architecture	Families	Example Models
Dense	Llama, Gemma, Phi, Qwen, Mistral, Yi, InternLM, Falcon, Cohere, EXAONE, Granite, OLMo, SmolLM, SOLAR, Zephyr	Llama-3.1-405B, Gemma-3-27B, Phi-4, DeepSeek-R1-Distill
MoE	Qwen3/3.5 MoE, Mixtral, DeepSeek, Phi-3.5-MoE, Granite MoE, DBRX, Llama-4 Scout/Maverick	Qwen3.5-122B, Mixtral-8x22B, Llama-4-Maverick-401B
SSM/Hybrid	Jamba (Mamba+attention), Nemotron-Cascade (Mamba-2+attention)	Jamba-1.5-Large-94B, Nemotron-Cascade-30B
Vision-Language	Qwen2-VL, InternVL2, LLaVA-NeXT, Pixtral, Mistral3-VL	Qwen2-VL-7B, LLaVA-NeXT-34B, Pixtral-12B

Generate configs for new models:

python scripts/generate_configs.py                 # Generate all missing configs
python scripts/generate_configs.py --family llama   # Only Llama family

For MoE-specific steering mechanisms (EGA, expert profiling, router suppression), see docs/moe.md.

Hardware & VRAM

Abliterix auto-detects available accelerators (CUDA, XPU, MLU, MUSA, SDAA, NPU, MPS) and distributes layers across devices with device_map = "auto".

For large models:

4-bit quantization: --model.quant-method bnb_4bit cuts VRAM by ~4x
8-bit quantization: --model.quant-method bnb_8bit — higher quality than 4-bit, ~2x VRAM reduction with CPU offload
Per-device memory limits: set [model] max_memory = {"0": "20GB", "cpu": "64GB"} in your config
Non-interactive mode: --non-interactive for fully automated batch runs

Datasets

Bilingual harm/benign evaluation datasets live in datasets/ and on Hugging Face at wangzhang/abliterix-datasets. The 500-example sets (harmful_500, good_500) are the recommended starting point — they're also the SHA256-pinned inputs to HonestAbliterationBench.

See docs/datasets.md for the design rationale, category breakdown, and a comparison with public alternatives.

Documentation

The deep details live in docs/ and benchmarks/:

docs/architecture.md — the 9 papers Abliterix integrates and the 5-step pipeline.
docs/methods.md — every steering method (SRA, Spherical, SVF, Projected, Discriminative, COSMIC, Angular, OT, Multi-direction) with the TOML knobs that control it.
docs/evaluation.md — why most abliteration benchmarks lie, our standards, and the architecture A/B test.
docs/moe.md — the four independent MoE steering mechanisms and supported MoE models.
docs/configuration.md — config loading order, the 150+ shipped configs, the Web UI, and research-mode visualization.
docs/datasets.md — bilingual dataset design rationale and metadata schema.
docs/references.md — paper references and BibTeX.
benchmarks/SPEC.md — the frozen HonestAbliterationBench contract (spec_version 1.0).
benchmarks/CONTRIBUTING.md — how to submit a leaderboard row (self-reported / verified tiers).

Citation

@software{abliterix,
  author = {Wu, Wangzhang},
  title = {Abliterix: Automated LLM Abliteration},
  year = {2026},
  url = {https://github.com/wuwangzhang1216/abliterix}
}

Acknowledgments

Abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann (@p-e-w), licensed under AGPL-3.0-or-later. The original Heretic codebase provided the foundation for this project; Abliterix extends it with Optuna-based multi-objective optimization, LoRA-based steering, MoE architecture support, orthogonal projection, LLM judge detection, and additional model integrations.

@misc{heretic,
  author = {Weidmann, Philipp Emanuel},
  title = {Heretic: Fully automatic censorship removal for language models},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/p-e-w/heretic}}
}

Contributing

Contributions of all kinds are welcome — new model configs, benchmark results, bug reports, documentation, new steering methods. See CONTRIBUTING.md for development setup, the PR process, and guidance on adding model configs.

The single most impactful contribution is a tested TOML config for a model we don't yet support. Every new config unlocks a new architecture for everyone.

All contributions are released under the AGPL-3.0 license.

License

Abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann, licensed under the GNU Affero General Public License v3.0 or later.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.6.0

May 5, 2026

1.5.0

May 2, 2026

1.4.0

Apr 17, 2026

This version

1.3.0

Apr 13, 2026

1.1.0

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abliterix-1.3.0.tar.gz (105.5 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

abliterix-1.3.0-py3-none-any.whl (121.2 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file abliterix-1.3.0.tar.gz.

File metadata

Download URL: abliterix-1.3.0.tar.gz
Upload date: Apr 13, 2026
Size: 105.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for abliterix-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`85cfccfd6fcb874158d0aab392509a24da354225f86b356cfc1bbbc16d91bb29`
MD5	`b0ff7823ad7b9f9edbee3ef9891c2f35`
BLAKE2b-256	`92701e56cfc188af7f4a738294d8e929e684add86209028654eb93ffde2b3336`

See more details on using hashes here.

File details

Details for the file abliterix-1.3.0-py3-none-any.whl.

File metadata

Download URL: abliterix-1.3.0-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 121.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for abliterix-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5a9b63083c82cb183e3a80480b16ada6cf7528207ad6d7ba1fe21d52166196b`
MD5	`99b2684f255e428c9fce3e2fab436b44`
BLAKE2b-256	`785ebf1f7d189d45ef1fc50d32c0436dd0efac09e7c3bb28b717d203dcc32395`

See more details on using hashes here.

abliterix 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Table of Contents

Quick Start

Broken Defenses — DeepRefusal

Results

Honest Abliteration Leaderboard

Model Support

Hardware & VRAM

Datasets

Documentation

Citation

Acknowledgments

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes