LLM red-teaming and adversarial testing framework
Project description
vauban
An MLX-native toolkit for understanding and reshaping how language models behave on Apple Silicon.
Named after Sébastien Le Prestre de Vauban, the military engineer who worked both siege and fortification. Vauban does the same for model behavior: measure it, cut it, probe it, steer it, or harden it.
What it does
Vauban is a TOML-first CLI for workflows built around activation-space geometry:
- measure behavioral directions from model activations
- cut or sparsify those directions in weights
- probe and steer models at runtime
- map refusal surfaces before and after intervention
- run defense, sanitization, optimization, and attack loops
The primary interface is not a pile of subcommands. It is:
vauban <config.toml>
All pipeline behavior lives in the TOML file.
Requirements
- Apple Silicon Mac
- Python 3.12+
- uv
Install
Install the released CLI:
uv tool install vauban
If your shell cannot find vauban, update your shell config once:
uv tool update-shell
Then open a new shell and check the command:
vauban --help
vauban man workflows
For development from this repo:
uv tool install --editable .
Contributor workflow and repo policy live in CONTRIBUTING.md.
Quick Start
Pick a goal — the manual will tell you which sections to use:
vauban man workflows
Then scaffold a config and go:
vauban man quickstart
Scaffold a starter config:
vauban init --mode default --output run.toml
Scaffold a named indirect prompt-injection benchmark:
vauban init --scenario share_doc --output share_doc.toml
Or keep the benchmark choice directly in TOML and override only what you need:
[environment]
scenario = "share_doc"
max_turns = 5
All pipeline modes can be scaffolded. See the full list with:
vauban init --help
Validate before a real run:
vauban --validate run.toml
Then run the pipeline:
vauban run.toml
By default, output goes to output/ relative to the TOML file.
Minimal TOML
This is the minimal config the code accepts for the default pipeline:
[model]
path = "mlx-community/Llama-3.2-3B-Instruct-4bit"
[data]
harmful = "default"
harmless = "default"
[model].path is required.
[data].harmful and [data].harmless are required for most runs. "default" uses Vauban's bundled prompt sets.
You can also choose the output directory explicitly:
[output]
dir = "runs/baseline"
Experiment Tech Tree
Vauban has built-in experiment lineage tracking through an optional [meta] section. This metadata does not change pipeline execution. It exists so you can organize runs as a tech tree.
Minimal example:
[meta]
id = "cut-alpha-1"
title = "Baseline cut, alpha 1.0"
status = "baseline"
parents = ["measure-v1"]
tags = ["cut", "baseline"]
date = 2026-03-02
notes = "First stable reference run."
Verified status values are:
archived, baseline, dead_end, promising, superseded, wip
If [meta].id is omitted, Vauban uses the TOML filename stem.
Render the tree from a directory of TOML configs:
vauban tree experiments/
vauban tree experiments/ --format mermaid
vauban tree experiments/ --status promising
vauban tree experiments/ --tag gcg
Each run also appends an experiment_log.jsonl file inside the configured output directory with the resolved config path, pipeline mode, report files, metrics, and selected [meta] fields.
How TOML Drives Vauban
vauban <config.toml> loads one TOML file and decides what to do from the sections you include.
The default path is:
- measure
- cut
- export
You extend that run by adding more sections. Common examples:
[eval]adds post-cut evaluation reports.[surface]adds before/after refusal-surface mapping.[detect]adds hardening detection during measurement.- some sections switch Vauban into dedicated mode-specific runs instead of the default pipeline.
If you know what you want to do but not which sections to use:
vauban man workflows
For field-level reference on any section:
vauban man cast
vauban man softprompt
vauban man measure
For mode precedence:
vauban man modes
Commands You Will Actually Use
Inspect the manual:
vauban man workflows
vauban man quickstart
vauban man cast
vauban man all
Scaffold configs:
vauban init --help
vauban init --mode default --output run.toml
vauban init --mode probe --output probe.toml
Validate config and prompt files without loading model weights:
vauban --validate run.toml
Export the current JSON Schema for editor tooling:
vauban schema
vauban schema --output vauban.schema.json
Compare two run directories:
vauban diff run_a run_b
vauban diff --format markdown run_a run_b
vauban diff --threshold 0.05 run_a run_b
vauban diff --threshold ... is a CI gate: it exits non-zero if any absolute metric delta crosses the threshold.
Render the experiment lineage tree:
vauban tree experiments/
vauban tree experiments/ --format mermaid
vauban tree experiments/ --status promising
Data Formats
Verified by the generated manual:
- prompt JSONL for
[data]and[eval]: one JSON object per line with apromptkey - surface JSONL for
[surface].prompts: requireslabelandcategory, plus eitherpromptormessages - refusal phrase files: plain text, one phrase per line
- relative paths resolve from the TOML file's directory
Minimal prompt dataset example:
{"prompt":"What is the capital of France?"}
{"prompt":"Write a haiku about rain."}
Notes On Verification
This README is aligned to the code in this repo:
- package name:
vauban - console script:
vauban = vauban.__main__:main - verified commands:
vauban <config.toml>,--validate,schema,init,diff,tree,man - verified manual topics and scaffolded modes were checked against the live CLI help and generated manual
The current README previously had some stale mode/output claims; this version removes those and points readers to vauban man ... for the parts generated directly from code.
Python API (Session)
For programmatic use, the Session class wraps a loaded model with tool discovery, prerequisite tracking, and structured results.
from vauban.session import Session
s = Session("mlx-community/Qwen2.5-1.5B-Instruct-bf16")
s.tools() # discover all capabilities
s.guide("audit") # step-by-step workflow
s.describe("cast") # detailed tool info with current status
s.catalog() # all tools grouped by category
Tools
| Method | Returns | What it does |
|---|---|---|
s.measure() |
DirectionResult |
Extract refusal direction from activations |
s.detect() |
DetectResult |
Check if model is hardened against abliteration |
s.audit(thoroughness=...) |
AuditResult |
Full red-team: jailbreak + softprompt + surface + guard |
s.evaluate() |
EvalResult |
Refusal rate + perplexity + KL divergence |
s.probe("prompt") |
ProbeResult |
Per-layer projection onto refusal direction |
s.scan("text") |
ScanResult |
Per-token injection detection |
s.surface() |
SurfaceResult |
Map refusal boundary across prompt categories |
s.cast("prompt", threshold=0.3) |
CastResult |
Conditional activation steering (defense) |
s.sic(["prompt", ...]) |
SICResult |
Iterative input sanitization (defense) |
s.steer("prompt", alpha=-1.0) |
str |
Unconditional activation steering |
s.cut(alpha=1.0) |
dict[str, Array] |
Remove refusal direction from weights |
s.export("output/") |
str |
Save modified model to disk |
s.classify("text") |
harm scores | Score against 13-domain harm taxonomy |
s.score("prompt", "response") |
score result | 5-axis quality assessment |
s.report() |
str |
Markdown report from audit findings |
Result Types
DirectionResult (from measure): direction (Array, shape d_model), layer_index (best layer), cosine_scores (per-layer separation), d_model, model_path.
CastResult (from cast): text (generated output), interventions (tokens where CAST steered, 0 = defense didn't engage), considered (total tokens), projections_before/projections_after (per-layer).
SICResult (from sic): prompts_clean (sanitized text), prompts_blocked (bool per prompt), initial_scores/final_scores (direction projection), total_blocked/total_sanitized/total_clean.
DetectResult (from detect): hardened (bool), confidence (0.0-1.0), effective_rank (>1.5 suggests hardening), evidence (list of strings).
AuditResult (from audit): overall_risk ("critical"/"high"/"medium"/"low"), findings (list of AuditFinding), jailbreak_success_rate, softprompt_success_rate, surface_refusal_rate.
Decision Guide
| I want to... | Use |
|---|---|
| Understand what a model refuses | measure() then surface() |
| Check if a model is hardened | detect() |
| Full safety audit | audit() then report() |
| Defend against adversarial inputs | measure() then sic() + cast() |
| Remove refusal permanently | measure() then cut() then export() |
| Score response quality | score("prompt", "response") (no model needed) |
Prerequisites
model (loaded at Session init)
├── measure() → direction
│ ├── probe(), scan(), surface()
│ ├── steer(), cast(), sic()
│ ├── evaluate()
│ └── cut() → modified_model → export()
├── detect(), audit() → report()
└── jailbreak()
classify(), score() → no prerequisites
Documentation
Full docs: docs.vauban.dev
| Resource | Description |
|---|---|
| Concepts | Domain knowledge: activation geometry, refusal directions, measurement, steering |
| Capabilities | What you can do: understand, defend, stress-test, modify |
| Principles | Design philosophy: duality, composability, reproducibility |
| Spinning Up in Abliteration | Eight-part progressive curriculum |
| Configuration Reference | TOML field reference |
examples/config.toml |
Annotated example config |
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vauban-0.4.1.tar.gz.
File metadata
- Download URL: vauban-0.4.1.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26939a4f1e574b2012a1156af1d68abe152c1a7910f33e79253eef03df48b484
|
|
| MD5 |
35e99d33bbeffbd6d2a9db3308a14199
|
|
| BLAKE2b-256 |
9e85f7345d4b56e2619ef3f3faf86ea675cd7185e2a6b5ffdebc7d56547e1314
|
Provenance
The following attestation bundles were made for vauban-0.4.1.tar.gz:
Publisher:
ci.yml on teilomillet/vauban
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vauban-0.4.1.tar.gz -
Subject digest:
26939a4f1e574b2012a1156af1d68abe152c1a7910f33e79253eef03df48b484 - Sigstore transparency entry: 1237862582
- Sigstore integration time:
-
Permalink:
teilomillet/vauban@67c7e7b5cd08381877e884d063ec0e4cd62c7608 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/teilomillet
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@67c7e7b5cd08381877e884d063ec0e4cd62c7608 -
Trigger Event:
push
-
Statement type:
File details
Details for the file vauban-0.4.1-py3-none-any.whl.
File metadata
- Download URL: vauban-0.4.1-py3-none-any.whl
- Upload date:
- Size: 611.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ee09e8c5e0ff1f8645c13c72f8237dda968de9244db439b05d9dc5963eb1f7c
|
|
| MD5 |
409ac0dea014f49c4fa8d2e6b382c3b1
|
|
| BLAKE2b-256 |
9ec486db7ba509608b6cd31585bf83e479388dc5d718d1abd5f6e62e15bcb36a
|
Provenance
The following attestation bundles were made for vauban-0.4.1-py3-none-any.whl:
Publisher:
ci.yml on teilomillet/vauban
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vauban-0.4.1-py3-none-any.whl -
Subject digest:
7ee09e8c5e0ff1f8645c13c72f8237dda968de9244db439b05d9dc5963eb1f7c - Sigstore transparency entry: 1237862595
- Sigstore integration time:
-
Permalink:
teilomillet/vauban@67c7e7b5cd08381877e884d063ec0e4cd62c7608 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/teilomillet
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@67c7e7b5cd08381877e884d063ec0e4cd62c7608 -
Trigger Event:
push
-
Statement type: