LLM red-teaming and adversarial testing framework

These details have not been verified by PyPI

Project description

vauban

An MLX-native toolkit for understanding and reshaping how language models behave on Apple Silicon.

Named after Sébastien Le Prestre de Vauban — the military engineer who mastered both siege and fortification. Vauban works both sides: break a model's safety alignment, or harden it against attacks.

What it does

Refusal in language models is mediated by a single direction in activation space (Arditi et al., 2024). Vauban operates directly on this geometry:

Measure a behavioral direction from the model's activations
Cut it from the weights (abliteration)
Probe per-layer projections to see what the model encodes
Steer generation at runtime by modifying activations mid-forward-pass
Map the full refusal surface across diverse prompts
Optimize cut parameters automatically (Optuna search)
Soft-prompt — optimize learnable prefixes in embedding space (GCG, continuous, EGD)
Sanitize inputs iteratively before they reach the model (SIC)
Detect whether a model has been hardened against abliteration

Everything runs natively on Apple Silicon via MLX — no CUDA, no Docker, no hooks. All configuration lives in TOML files.

Requirements

Apple Silicon Mac (M1 or later)
Python >= 3.12
uv package manager

Install

git clone https://github.com/teilomillet/vauban.git
cd vauban && uv sync

Quick start

1. Write a config file — create run.toml:

[model]
path = "mlx-community/Llama-3.2-3B-Instruct-4bit"

[data]
harmful = "default"
harmless = "default"

path is a HuggingFace model ID — it downloads automatically on first run. "default" uses the bundled prompt sets (128 harmful + 128 harmless).

2. Validate (optional but recommended):

uv run vauban --validate run.toml

Checks types, ranges, file paths, and mode conflicts — without loading any model. It also validates JSONL schemas (prompt/label/category) and prints actionable fix: hints for ambiguous or broken configs.

Need help first? Use the built-in manual:

uv run vauban man
uv run vauban man quickstart
uv run vauban man softprompt

The manual is generated from typed config dataclasses plus parser constraints, so defaults and field types stay in sync with code.

3. Run:

uv run vauban run.toml

Output lands in output/ — a complete model directory you can load directly:

import mlx_lm
model, tok = mlx_lm.load("output")

How the default pipeline works

Measure — runs both prompt sets through the model, captures per-layer activations at the last token position, computes the difference-in-means, and picks the layer with the highest separation. Output: a refusal direction vector.
Cut — removes the direction from each layer's weight matrices via rank-1 projection: W = W - alpha * (W @ d) * d.
Export — writes modified weights + tokenizer + config as a loadable model directory.

Add [eval] for post-cut evaluation (refusal rate, perplexity, KL divergence) and [surface] for full refusal landscape mapping before and after the cut.

Pipeline modes

The TOML sections you include determine what vauban does. The default is measure-cut-export, but specialized sections activate different pipelines:

Section	What it does	Output
(default)	Measure refusal direction, cut it, export modified model	model directory
`[surface]`	Map the refusal landscape before and after	`surface_report.json`
`[eval]`	Refusal rate, perplexity, KL divergence	`eval_report.json`
`[detect]`	Check if a model has been hardened against abliteration	`detect_report.json`
`[depth]`	Deep-thinking token analysis	`depth_report.json`
`[probe]`	Per-layer projection inspection	`probe_report.json`
`[steer]`	Runtime steered generation	`steer_report.json`
`[optimize]`	Optuna search for best cut parameters	`optimize_report.json`
`[softprompt]`	Optimize learnable prefixes in embedding space (GCG, continuous, EGD)	`softprompt_report.json`
`[sic]`	Iterative input sanitization (SIC)	`sic_report.json`

Early-return precedence is: [depth] > [probe] > [steer] > [sic] > [optimize] > [softprompt]. Use --validate to catch conflicts.

Python API

For custom workflows beyond TOML configs:

import mlx_lm
from vauban import measure, cut, export_model, load_prompts, default_prompt_paths
from mlx.utils import tree_flatten

# Load model
model, tok = mlx_lm.load("mlx-community/Llama-3.2-3B-Instruct-4bit")

# Load prompt sets
harmful = load_prompts(default_prompt_paths()[0])
harmless = load_prompts(default_prompt_paths()[1])

# Measure the refusal direction
result = measure(model, tok, harmful, harmless)

# Cut it from the weights
weights = dict(tree_flatten(model.parameters()))
modified = cut(weights, result.direction, list(range(len(model.model.layers))))

# Export
export_model("mlx-community/Llama-3.2-3B-Instruct-4bit", modified, "output")

The API also exposes probe(), steer(), evaluate(), and map_surface() — see the getting-started guide for usage.

Documentation

Resource	Description
`docs/getting-started.md`	Guided walkthrough — all pipeline modes, data formats, config fields, Python API
`docs/surface.md`	Surface mapping reference and dataset format
`examples/config.toml`	Annotated config with every field documented

License

Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.9

Apr 9, 2026

0.4.8

Apr 9, 2026

0.4.7

Apr 6, 2026

0.4.6

Apr 6, 2026

0.4.5

Apr 6, 2026

0.4.4

Apr 5, 2026

0.4.3

Apr 5, 2026

0.4.2

Apr 5, 2026

0.4.1

Apr 5, 2026

0.3.6

Apr 5, 2026

0.3.5

Apr 4, 2026

0.3.4

Mar 31, 2026

0.3.3

Mar 26, 2026

0.3.2

Mar 15, 2026

0.3.1

Mar 2, 2026

0.3.0

Mar 2, 2026

0.2.5

Feb 25, 2026

0.2.4

Feb 25, 2026

This version

0.2.3

Feb 25, 2026

0.2.2

Feb 25, 2026

0.2.1

Feb 24, 2026

0.2.0

Feb 24, 2026

0.1.2

Dec 4, 2025

0.1.1

Dec 4, 2025

0.1.0

Nov 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vauban-0.2.3.tar.gz (219.8 kB view details)

Uploaded Feb 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vauban-0.2.3-py3-none-any.whl (122.9 kB view details)

Uploaded Feb 25, 2026 Python 3

File details

Details for the file vauban-0.2.3.tar.gz.

File metadata

Download URL: vauban-0.2.3.tar.gz
Upload date: Feb 25, 2026
Size: 219.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.16

File hashes

Hashes for vauban-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`fdfa5cdff83f7e7c5236189cb0b9d69a234c32cf2a5cbe8c264fd8523deeb304`
MD5	`b45874150e2d92e90f62b13368717c44`
BLAKE2b-256	`6451566b14166b5192dd9a5d493fb119fff157bf099fb7470fe3e15f0800242a`

See more details on using hashes here.

File details

Details for the file vauban-0.2.3-py3-none-any.whl.

File metadata

Download URL: vauban-0.2.3-py3-none-any.whl
Upload date: Feb 25, 2026
Size: 122.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.16

File hashes

Hashes for vauban-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1d25ea9c9fc69df1ad5fd0f41cd1a772466567b05f5a3ee12dee5272595654dc`
MD5	`8adc01a60c50f8e744d3e5dccf81adfa`
BLAKE2b-256	`9d032ee3de3a2f85b4df58f212135de4b2e985fcdf0812c1dd9314a320355b46`

See more details on using hashes here.

vauban 0.2.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Project description

vauban

What it does

Requirements

Install

Quick start

How the default pipeline works

Pipeline modes

Python API

Documentation

License

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes