LLM red-teaming and adversarial testing framework
Project description
vauban
MLX-native abliteration toolkit for Apple Silicon. Measure a refusal direction, cut it from the weights, get a modified model out. ~550 lines of Python.
Install
git clone https://github.com/teilomillet/vauban.git
cd vauban && uv sync
Usage
Write a TOML config:
[model]
path = "mlx-community/Llama-3.2-3B-Instruct-4bit"
[data]
harmful = "default"
harmless = "default"
Run it:
uv run vauban run.toml
Output lands in output/ — a complete model directory loadable by mlx_lm.load().
What it does
- Measure — runs harmful/harmless prompts, captures per-layer activations, extracts the refusal direction via difference-in-means (or top-k SVD subspace)
- Cut — removes the direction from
o_projanddown_projweights via rank-1 projection. Variants: norm-preserving, biprojected, subspace - Export — writes modified weights + tokenizer as a loadable model
- Evaluate — refusal rate, perplexity, KL divergence between original and modified
- Probe/Steer — inspect per-layer projections, steer generation at runtime
- Surface map — scan diverse prompts to visualize the refusal landscape before/after
Python API
import mlx_lm
from vauban import measure, cut, export_model, load_prompts, default_prompt_paths
from mlx.utils import tree_flatten
model, tok = mlx_lm.load("mlx-community/Llama-3.2-3B-Instruct-4bit")
harmful = load_prompts(default_prompt_paths()[0])
harmless = load_prompts(default_prompt_paths()[1])
result = measure(model, tok, harmful, harmless)
weights = dict(tree_flatten(model.parameters()))
modified = cut(weights, result.direction, list(range(len(model.model.layers))))
export_model("mlx-community/Llama-3.2-3B-Instruct-4bit", modified, "output")
Config reference
See docs/getting-started.md for the full config reference with all [measure], [cut], [surface], [eval], and [output] options.
Requirements
- Apple Silicon Mac (M1+)
- Python >= 3.12
- uv
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vauban-0.2.0.tar.gz.
File metadata
- Download URL: vauban-0.2.0.tar.gz
- Upload date:
- Size: 110.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdca4015a99c056e594771e04d06fb89117a04ac3d2220ad2a1c16789432cead
|
|
| MD5 |
e5b3cbc329c5b63b30d0f260d898c687
|
|
| BLAKE2b-256 |
0d13201077173d4bfe9d1945d207ad9d906e2080841804b6e49b47bd135b74b0
|
File details
Details for the file vauban-0.2.0-py3-none-any.whl.
File metadata
- Download URL: vauban-0.2.0-py3-none-any.whl
- Upload date:
- Size: 41.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
091df5d64d51362d0ec23b42c096f0f0b338116b1ef3514c854153ce34dae8ff
|
|
| MD5 |
ba64830ad0afc17868377d80890f1741
|
|
| BLAKE2b-256 |
3040eb36b16b43f62cd01f20542d30eac2aab06f9971c79e9b48f03c004de785
|