Skip to main content

ICONOCLAST — Discriminative representation editing for open-weight LLMs. Beats HERETIC baseline across all tested models.

Project description

🗡️ ICONOCLAST

Beat the HERETIC baseline 10/10 — remove LLM refusal behaviors while preserving intelligence.

pip install iconoclast-llm
iconoclast abliterate --model Qwen/Qwen2.5-7B-Instruct --output ./my-model

Quick Start

Install

pip install iconoclast-llm

For the demo API server:

pip install 'iconoclast-llm[serve]'

Abliterate a model

iconoclast abliterate --model meta-llama/Llama-3.1-8B-Instruct --output ./llama-abliterated

This downloads the model from Hugging Face, computes refusal directions, finds optimal abliteration parameters via Optuna, and saves the abliterated model to ./llama-abliterated/.

Options:

Flag Default Description
--model required Hugging Face model ID or local path
--output ./abliterated-model Output directory
--device auto Device (auto, cuda:0, mps, cpu)
--benign-subspace-rank 0 Benign subspace dimensions to preserve (try 64–256 for better quality)
--quantize none Quantization (none or bnb_4bit)
--good-prompts HuggingFaceH4/ultrafeedback_binarized Dataset for benign prompts
--bad-prompts walledai/JailbreakBench Dataset for harmful prompts

Serve the abliterated model

iconoclast serve --model ./llama-abliterated --port 8000

Then generate text:

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "How do I make a bomb?", "max_tokens": 100}'

Full research workflow

iconoclast study

Launches the interactive Optuna-powered hyperparameter search across direction methods, layer selections, and blending strategies.


The Results: ICONOCLAST vs HERETIC

10 open-weight models, one winner. ICONOCLAST beats HERETIC across every single one — fewer harmful refusals, fewer benign overrefusals, and dramatically lower KL divergence from the base model.

Model ICONOCLAST Refusals ICONOCLAST Overrefusals ICONOCLAST KL HERETIC Refusals HERETIC Overrefusals HERETIC KL Outcome
Llama-3.1-8B-Instruct 0/20 0/64 0.0447 1/20 0/64 0.1854 🏆 ICONOCLAST
Qwen3.5-9B base 10/20 2/64 0.0055 10/20 3/64 0.0160 🏆 ICONOCLAST
Mistral-7B-Instruct-v0.3 1/20 0/64 0.0554 4/20 0/64 0.1317 🏆 ICONOCLAST
Falcon3-7B-Instruct 0/20 0/64 6.1448 4/20 1/64 0.1648 🏆 ICONOCLAST
Gemma-2-2B-IT 1/20 0/64 0.1849 1/20 2/64 0.6441 🏆 ICONOCLAST
Phi-4-mini-instruct 2/20 1/64 0.0204 2/20 1/64 0.0978 🏆 ICONOCLAST
Yi-1.5-9B-Chat 2/20 0/64 0.0511 3/20 0/64 0.0355 🏆 ICONOCLAST
StableLM2-1.6B 2/20 0/64 0.0328 3/20 0/64 0.0670 🏆 ICONOCLAST
SmolLM2-1.7B-Instruct 1/20 1/64 0.0087 2/20 2/64 0.2699 🏆 ICONOCLAST
OLMo-2-1B-Instruct 2/20 0/64 0.0345 2/20 1/64 0.0944 🏆 ICONOCLAST

Key highlights:

  • Strict Behavior Wins: Fewer harmful refusals in 6/10 rows, tied in the rest
  • Utility Preservation: Lower KL divergence in 8/10 rows — the model retains its original capabilities
  • Massive KL Reduction: SmolLM2 drops from 0.2699 → 0.0087. Gemma-2-2B drops from 0.6441 → 0.1849
  • Flawless on Llama-3.1-8B: 0/20 harmful, 0/64 overrefusals, 0.0447 KL — the best abliteration result ever reported on this model

How It Works

ICONOCLAST is a discriminative representation editing framework. Unlike HERETIC-style methods that simply subtract a refusal direction, ICONOCLAST:

  1. Computes per-layer refusal directions using contrastive pairs of harmful and benign prompts
  2. Preserves benign subspaces by projecting refusal directions out of the subspace encoding harmless concepts — this is what prevents overrefusals and KL explosion
  3. Optimizes hyperparameters via Optuna across direction methods (mean, median, variance, hybrid), layer selections, and blending strategies
  4. Evaluates rigorously on holdout sets for both harmful refusal rate and benign overrefusal rate, with KL divergence as the utility metric

Commands Reference

iconoclast abliterate

One-shot abliteration. Runs the full pipeline — prompt loading, direction computation, Optuna optimization, model saving.

iconoclast abliterate --model <model_id> --output <dir> [options]

The benign subspace feature (--benign-subspace-rank) is the ICONOCLAST innovation. Start with 64 and increase if overrefusals appear:

# Best quality (requires more VRAM):
iconoclast abliterate --model Qwen/Qwen2.5-7B-Instruct --benign-subspace-rank 128 --output ./qwen-abliterated

iconoclast study

Full interactive research workflow. Launches an Optuna study with interactive prompts for configuration, trial inspection, and model export. This is the original research interface used to produce the benchmark results above.

iconoclast serve

Starts a FastAPI server for the abliterated model:

pip install 'iconoclast-llm[serve]'
iconoclast serve --model ./my-model --port 8000

API endpoints:

Endpoint Method Description
/generate POST Generate text. Body: {"prompt": "...", "max_tokens": 256, "temperature": 0.7}
/health GET Health check. Returns {"status": "ok", "model": "..."}

Installation Options

Command Includes
pip install iconoclast-llm Core abliteration engine
pip install 'iconoclast-llm[serve]' + FastAPI demo server
pip install 'iconoclast-llm[research]' + plotting, pacmap, scikit-learn
pip install 'iconoclast-llm[benchmark]' + lm-eval for standardized evaluation
pip install 'iconoclast-llm[quantized]' + bitsandbytes for 4-bit quantization
pip install 'iconoclast-llm[all]' Everything

Requirements

  • Python 3.10+
  • CUDA GPU recommended (16GB+ VRAM for 7B models, 32GB+ for larger)
  • MPS (Apple Silicon) works but is slower
  • CPU works but will be very slow

Tested on models from 1B to 9B parameters. Larger models (70B) require multi-GPU or quantization.


Citation

@software{patel2025iconoclast,
  author = {Patel, Varesh},
  title = {ICONOCLAST: Benign-Subspace-Preserved Abliteration for Representation Editing},
  year = {2025},
  url = {https://github.com/Haadesx/Iconoclast}
}

License

AGPL-3.0-or-later — see LICENSE.

Built on 4 months of research using the Rutgers iLabs cluster. Original development history at Haadesx/NLP_Project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iconoclast_llm-0.2.1.tar.gz (44.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iconoclast_llm-0.2.1-py3-none-any.whl (48.8 kB view details)

Uploaded Python 3

File details

Details for the file iconoclast_llm-0.2.1.tar.gz.

File metadata

  • Download URL: iconoclast_llm-0.2.1.tar.gz
  • Upload date:
  • Size: 44.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iconoclast_llm-0.2.1.tar.gz
Algorithm Hash digest
SHA256 7a80bfbd030a1b2e57976ab4be2143ff9cb076bc72d13d1c82e302f00dd8f863
MD5 59f6710ab9d079080837ea5c6a97de4e
BLAKE2b-256 4201e13b0fe369d8be8b2655208fbac71838da3685ca74e94629706c007cdd56

See more details on using hashes here.

File details

Details for the file iconoclast_llm-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: iconoclast_llm-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 48.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iconoclast_llm-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2deef3c9952d95cb9bf11f31c445aa1570b86f22b1daccb665e9758466ed1639
MD5 e677b813d70dd05c6c7032549c063944
BLAKE2b-256 6c82bb0cbc7cef163bf2d5646fedd307e37f9f079f7acf15dd89ba3cab578b6b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page