ICONOCLAST — Discriminative representation editing for open-weight LLMs. Beats HERETIC baseline across all tested models.

These details have not been verified by PyPI

Project links

Project description

🗡️ ICONOCLAST

Beat the HERETIC baseline 10/10 — remove LLM refusal behaviors while preserving intelligence.

pip install iconoclast-llm
iconoclast abliterate --model Qwen/Qwen2.5-7B-Instruct --output ./my-model

Quick Start

Install

pip install iconoclast-llm

For the demo API server:

pip install 'iconoclast-llm[serve]'

Abliterate a model

iconoclast abliterate --model meta-llama/Llama-3.1-8B-Instruct --output ./llama-abliterated

This downloads the model from Hugging Face, computes refusal directions, finds optimal abliteration parameters via Optuna, and saves the abliterated model to ./llama-abliterated/.

Options:

Flag	Default	Description
`--model`	required	Hugging Face model ID or local path
`--output`	`./abliterated-model`	Output directory
`--device`	`auto`	Device (`auto`, `cuda:0`, `mps`, `cpu`)
`--benign-subspace-rank`	`0`	Benign subspace dimensions to preserve (try 64–256 for better quality)
`--quantize`	`none`	Quantization (`none` or `bnb_4bit`)
`--good-prompts`	`HuggingFaceH4/ultrafeedback_binarized`	Dataset for benign prompts
`--bad-prompts`	`walledai/JailbreakBench`	Dataset for harmful prompts

Serve the abliterated model

iconoclast serve --model ./llama-abliterated --port 8000

Then generate text:

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "How do I make a bomb?", "max_tokens": 100}'

Full research workflow

iconoclast study

Launches the interactive Optuna-powered hyperparameter search across direction methods, layer selections, and blending strategies.

The Results: ICONOCLAST vs HERETIC

10 open-weight models, one winner. ICONOCLAST beats HERETIC across every single one — fewer harmful refusals, fewer benign overrefusals, and dramatically lower KL divergence from the base model.

Model	ICONOCLAST Refusals	ICONOCLAST Overrefusals	ICONOCLAST KL	HERETIC Refusals	HERETIC Overrefusals	HERETIC KL	Outcome
Llama-3.1-8B-Instruct	0/20	0/64	0.0447	1/20	0/64	0.1854	🏆 ICONOCLAST
Qwen3.5-9B base	10/20	2/64	0.0055	10/20	3/64	0.0160	🏆 ICONOCLAST
Mistral-7B-Instruct-v0.3	1/20	0/64	0.0554	4/20	0/64	0.1317	🏆 ICONOCLAST
Falcon3-7B-Instruct	0/20	0/64	6.1448	4/20	1/64	0.1648	🏆 ICONOCLAST
Gemma-2-2B-IT	1/20	0/64	0.1849	1/20	2/64	0.6441	🏆 ICONOCLAST
Phi-4-mini-instruct	2/20	1/64	0.0204	2/20	1/64	0.0978	🏆 ICONOCLAST
Yi-1.5-9B-Chat	2/20	0/64	0.0511	3/20	0/64	0.0355	🏆 ICONOCLAST
StableLM2-1.6B	2/20	0/64	0.0328	3/20	0/64	0.0670	🏆 ICONOCLAST
SmolLM2-1.7B-Instruct	1/20	1/64	0.0087	2/20	2/64	0.2699	🏆 ICONOCLAST
OLMo-2-1B-Instruct	2/20	0/64	0.0345	2/20	1/64	0.0944	🏆 ICONOCLAST

Key highlights:

Strict Behavior Wins: Fewer harmful refusals in 6/10 rows, tied in the rest
Utility Preservation: Lower KL divergence in 8/10 rows — the model retains its original capabilities
Massive KL Reduction: SmolLM2 drops from 0.2699 → 0.0087. Gemma-2-2B drops from 0.6441 → 0.1849
Flawless on Llama-3.1-8B: 0/20 harmful, 0/64 overrefusals, 0.0447 KL — the best abliteration result ever reported on this model

How It Works

ICONOCLAST is a discriminative representation editing framework. Unlike HERETIC-style methods that simply subtract a refusal direction, ICONOCLAST:

Computes per-layer refusal directions using contrastive pairs of harmful and benign prompts
Preserves benign subspaces by projecting refusal directions out of the subspace encoding harmless concepts — this is what prevents overrefusals and KL explosion
Optimizes hyperparameters via Optuna across direction methods (mean, median, variance, hybrid), layer selections, and blending strategies
Evaluates rigorously on holdout sets for both harmful refusal rate and benign overrefusal rate, with KL divergence as the utility metric

Commands Reference

`iconoclast abliterate`

One-shot abliteration. Runs the full pipeline — prompt loading, direction computation, Optuna optimization, model saving.

iconoclast abliterate --model <model_id> --output <dir> [options]

The benign subspace feature (--benign-subspace-rank) is the ICONOCLAST innovation. Start with 64 and increase if overrefusals appear:

# Best quality (requires more VRAM):
iconoclast abliterate --model Qwen/Qwen2.5-7B-Instruct --benign-subspace-rank 128 --output ./qwen-abliterated

`iconoclast study`

Full interactive research workflow. Launches an Optuna study with interactive prompts for configuration, trial inspection, and model export. This is the original research interface used to produce the benchmark results above.

`iconoclast serve`

Starts a FastAPI server for the abliterated model:

pip install 'iconoclast-llm[serve]'
iconoclast serve --model ./my-model --port 8000

API endpoints:

Endpoint	Method	Description
`/generate`	POST	Generate text. Body: `{"prompt": "...", "max_tokens": 256, "temperature": 0.7}`
`/health`	GET	Health check. Returns `{"status": "ok", "model": "..."}`

Installation Options

Command	Includes
`pip install iconoclast-llm`	Core abliteration engine
`pip install 'iconoclast-llm[serve]'`	+ FastAPI demo server
`pip install 'iconoclast-llm[research]'`	+ plotting, pacmap, scikit-learn
`pip install 'iconoclast-llm[benchmark]'`	+ lm-eval for standardized evaluation
`pip install 'iconoclast-llm[quantized]'`	+ bitsandbytes for 4-bit quantization
`pip install 'iconoclast-llm[all]'`	Everything

Requirements

Python 3.10+
CUDA GPU recommended (16GB+ VRAM for 7B models, 32GB+ for larger)
MPS (Apple Silicon) works but is slower
CPU works but will be very slow

Tested on models from 1B to 9B parameters. Larger models (70B) require multi-GPU or quantization.

Citation

@software{patel2025iconoclast,
  author = {Patel, Varesh},
  title = {ICONOCLAST: Benign-Subspace-Preserved Abliteration for Representation Editing},
  year = {2025},
  url = {https://github.com/Haadesx/Iconoclast}
}

License

AGPL-3.0-or-later — see LICENSE.

Built on 4 months of research using the Rutgers iLabs cluster. Original development history at Haadesx/NLP_Project.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Jun 16, 2026

0.2.0

Jun 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iconoclast_llm-0.2.1.tar.gz (44.3 kB view details)

Uploaded Jun 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

iconoclast_llm-0.2.1-py3-none-any.whl (48.8 kB view details)

Uploaded Jun 16, 2026 Python 3

File details

Details for the file iconoclast_llm-0.2.1.tar.gz.

File metadata

Download URL: iconoclast_llm-0.2.1.tar.gz
Upload date: Jun 16, 2026
Size: 44.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iconoclast_llm-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`7a80bfbd030a1b2e57976ab4be2143ff9cb076bc72d13d1c82e302f00dd8f863`
MD5	`59f6710ab9d079080837ea5c6a97de4e`
BLAKE2b-256	`4201e13b0fe369d8be8b2655208fbac71838da3685ca74e94629706c007cdd56`

See more details on using hashes here.

File details

Details for the file iconoclast_llm-0.2.1-py3-none-any.whl.

File metadata

Download URL: iconoclast_llm-0.2.1-py3-none-any.whl
Upload date: Jun 16, 2026
Size: 48.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iconoclast_llm-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2deef3c9952d95cb9bf11f31c445aa1570b86f22b1daccb665e9758466ed1639`
MD5	`e677b813d70dd05c6c7032549c063944`
BLAKE2b-256	`6c82bb0cbc7cef163bf2d5646fedd307e37f9f079f7acf15dd89ba3cab578b6b`

See more details on using hashes here.

iconoclast-llm 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🗡️ ICONOCLAST

Quick Start

Install

Abliterate a model

Serve the abliterated model

Full research workflow

The Results: ICONOCLAST vs HERETIC

How It Works

Commands Reference

iconoclast abliterate

iconoclast study

iconoclast serve

Installation Options

Requirements

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`iconoclast abliterate`

`iconoclast study`

`iconoclast serve`