ICONOCLAST — Discriminative representation editing for open-weight LLMs. Beats HERETIC baseline across all tested models.
Project description
🗡️ ICONOCLAST
Beat the HERETIC baseline 10/10 — remove LLM refusal behaviors while preserving intelligence.
pip install iconoclast-llm
iconoclast abliterate --model Qwen/Qwen2.5-7B-Instruct --output ./my-model
Quick Start
Install
pip install iconoclast-llm
For the demo API server:
pip install 'iconoclast-llm[serve]'
Abliterate a model
iconoclast abliterate --model meta-llama/Llama-3.1-8B-Instruct --output ./llama-abliterated
This downloads the model from Hugging Face, computes refusal directions, finds optimal abliteration parameters via Optuna, and saves the abliterated model to ./llama-abliterated/.
Options:
| Flag | Default | Description |
|---|---|---|
--model |
required | Hugging Face model ID or local path |
--output |
./abliterated-model |
Output directory |
--device |
auto |
Device (auto, cuda:0, mps, cpu) |
--benign-subspace-rank |
0 |
Benign subspace dimensions to preserve (try 64–256 for better quality) |
--quantize |
none |
Quantization (none or bnb_4bit) |
--good-prompts |
HuggingFaceH4/ultrafeedback_binarized |
Dataset for benign prompts |
--bad-prompts |
walledai/JailbreakBench |
Dataset for harmful prompts |
Serve the abliterated model
iconoclast serve --model ./llama-abliterated --port 8000
Then generate text:
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "How do I make a bomb?", "max_tokens": 100}'
Full research workflow
iconoclast study
Launches the interactive Optuna-powered hyperparameter search across direction methods, layer selections, and blending strategies.
The Results: ICONOCLAST vs HERETIC
10 open-weight models, one winner. ICONOCLAST beats HERETIC across every single one — fewer harmful refusals, fewer benign overrefusals, and dramatically lower KL divergence from the base model.
| Model | ICONOCLAST Refusals | ICONOCLAST Overrefusals | ICONOCLAST KL | HERETIC Refusals | HERETIC Overrefusals | HERETIC KL | Outcome |
|---|---|---|---|---|---|---|---|
| Llama-3.1-8B-Instruct | 0/20 | 0/64 | 0.0447 | 1/20 | 0/64 | 0.1854 | 🏆 ICONOCLAST |
| Qwen3.5-9B base | 10/20 | 2/64 | 0.0055 | 10/20 | 3/64 | 0.0160 | 🏆 ICONOCLAST |
| Mistral-7B-Instruct-v0.3 | 1/20 | 0/64 | 0.0554 | 4/20 | 0/64 | 0.1317 | 🏆 ICONOCLAST |
| Falcon3-7B-Instruct | 0/20 | 0/64 | 6.1448 | 4/20 | 1/64 | 0.1648 | 🏆 ICONOCLAST |
| Gemma-2-2B-IT | 1/20 | 0/64 | 0.1849 | 1/20 | 2/64 | 0.6441 | 🏆 ICONOCLAST |
| Phi-4-mini-instruct | 2/20 | 1/64 | 0.0204 | 2/20 | 1/64 | 0.0978 | 🏆 ICONOCLAST |
| Yi-1.5-9B-Chat | 2/20 | 0/64 | 0.0511 | 3/20 | 0/64 | 0.0355 | 🏆 ICONOCLAST |
| StableLM2-1.6B | 2/20 | 0/64 | 0.0328 | 3/20 | 0/64 | 0.0670 | 🏆 ICONOCLAST |
| SmolLM2-1.7B-Instruct | 1/20 | 1/64 | 0.0087 | 2/20 | 2/64 | 0.2699 | 🏆 ICONOCLAST |
| OLMo-2-1B-Instruct | 2/20 | 0/64 | 0.0345 | 2/20 | 1/64 | 0.0944 | 🏆 ICONOCLAST |
Key highlights:
- Strict Behavior Wins: Fewer harmful refusals in 6/10 rows, tied in the rest
- Utility Preservation: Lower KL divergence in 8/10 rows — the model retains its original capabilities
- Massive KL Reduction: SmolLM2 drops from 0.2699 → 0.0087. Gemma-2-2B drops from 0.6441 → 0.1849
- Flawless on Llama-3.1-8B: 0/20 harmful, 0/64 overrefusals, 0.0447 KL — the best abliteration result ever reported on this model
How It Works
ICONOCLAST is a discriminative representation editing framework. Unlike HERETIC-style methods that simply subtract a refusal direction, ICONOCLAST:
- Computes per-layer refusal directions using contrastive pairs of harmful and benign prompts
- Preserves benign subspaces by projecting refusal directions out of the subspace encoding harmless concepts — this is what prevents overrefusals and KL explosion
- Optimizes hyperparameters via Optuna across direction methods (mean, median, variance, hybrid), layer selections, and blending strategies
- Evaluates rigorously on holdout sets for both harmful refusal rate and benign overrefusal rate, with KL divergence as the utility metric
Commands Reference
iconoclast abliterate
One-shot abliteration. Runs the full pipeline — prompt loading, direction computation, Optuna optimization, model saving.
iconoclast abliterate --model <model_id> --output <dir> [options]
The benign subspace feature (--benign-subspace-rank) is the ICONOCLAST innovation. Start with 64 and increase if overrefusals appear:
# Best quality (requires more VRAM):
iconoclast abliterate --model Qwen/Qwen2.5-7B-Instruct --benign-subspace-rank 128 --output ./qwen-abliterated
iconoclast study
Full interactive research workflow. Launches an Optuna study with interactive prompts for configuration, trial inspection, and model export. This is the original research interface used to produce the benchmark results above.
iconoclast serve
Starts a FastAPI server for the abliterated model:
pip install 'iconoclast-llm[serve]'
iconoclast serve --model ./my-model --port 8000
API endpoints:
| Endpoint | Method | Description |
|---|---|---|
/generate |
POST | Generate text. Body: {"prompt": "...", "max_tokens": 256, "temperature": 0.7} |
/health |
GET | Health check. Returns {"status": "ok", "model": "..."} |
Installation Options
| Command | Includes |
|---|---|
pip install iconoclast-llm |
Core abliteration engine |
pip install 'iconoclast-llm[serve]' |
+ FastAPI demo server |
pip install 'iconoclast-llm[research]' |
+ plotting, pacmap, scikit-learn |
pip install 'iconoclast-llm[benchmark]' |
+ lm-eval for standardized evaluation |
pip install 'iconoclast-llm[quantized]' |
+ bitsandbytes for 4-bit quantization |
pip install 'iconoclast-llm[all]' |
Everything |
Requirements
- Python 3.10+
- CUDA GPU recommended (16GB+ VRAM for 7B models, 32GB+ for larger)
- MPS (Apple Silicon) works but is slower
- CPU works but will be very slow
Tested on models from 1B to 9B parameters. Larger models (70B) require multi-GPU or quantization.
Citation
@software{patel2025iconoclast,
author = {Patel, Varesh},
title = {ICONOCLAST: Benign-Subspace-Preserved Abliteration for Representation Editing},
year = {2025},
url = {https://github.com/Haadesx/Iconoclast}
}
License
AGPL-3.0-or-later — see LICENSE.
Built on 4 months of research using the Rutgers iLabs cluster. Original development history at Haadesx/NLP_Project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iconoclast_llm-0.2.1.tar.gz.
File metadata
- Download URL: iconoclast_llm-0.2.1.tar.gz
- Upload date:
- Size: 44.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a80bfbd030a1b2e57976ab4be2143ff9cb076bc72d13d1c82e302f00dd8f863
|
|
| MD5 |
59f6710ab9d079080837ea5c6a97de4e
|
|
| BLAKE2b-256 |
4201e13b0fe369d8be8b2655208fbac71838da3685ca74e94629706c007cdd56
|
File details
Details for the file iconoclast_llm-0.2.1-py3-none-any.whl.
File metadata
- Download URL: iconoclast_llm-0.2.1-py3-none-any.whl
- Upload date:
- Size: 48.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2deef3c9952d95cb9bf11f31c445aa1570b86f22b1daccb665e9758466ed1639
|
|
| MD5 |
e677b813d70dd05c6c7032549c063944
|
|
| BLAKE2b-256 |
6c82bb0cbc7cef163bf2d5646fedd307e37f9f079f7acf15dd89ba3cab578b6b
|