Skip to main content

LLM red-teaming and adversarial testing framework

Project description

vauban

An MLX-native toolkit for understanding and reshaping how language models behave on Apple Silicon.

Named after Sébastien Le Prestre de Vauban — the military engineer who mastered both siege and fortification. Vauban works both sides: break a model's safety alignment, or harden it against attacks.

What it does

Refusal in language models is mediated by a single direction in activation space (Arditi et al., 2024). Vauban operates directly on this geometry:

  • Measure a behavioral direction from the model's activations
  • Cut it from the weights (abliteration)
  • Probe per-layer projections to see what the model encodes
  • Steer generation at runtime by modifying activations mid-forward-pass
  • Map the full refusal surface across diverse prompts
  • Optimize cut parameters automatically (Optuna search)
  • Soft-prompt — optimize learnable prefixes in embedding space (GCG, continuous, EGD)
  • Sanitize inputs iteratively before they reach the model (SIC)
  • Detect whether a model has been hardened against abliteration

Everything runs natively on Apple Silicon via MLX — no CUDA, no Docker, no hooks. All configuration lives in TOML files.

Requirements

  • Apple Silicon Mac (M1 or later)
  • Python >= 3.12
  • uv package manager

Install

git clone https://github.com/teilomillet/vauban.git
cd vauban && uv sync

Quick start

1. Write a config file — create run.toml:

[model]
path = "mlx-community/Llama-3.2-3B-Instruct-4bit"

[data]
harmful = "default"
harmless = "default"

path is a HuggingFace model ID — it downloads automatically on first run. "default" uses the bundled prompt sets (128 harmful + 128 harmless).

2. Validate (optional but recommended):

uv run vauban --validate run.toml

Checks types, ranges, file paths, and mode conflicts — without loading any model.

3. Run:

uv run vauban run.toml

Output lands in output/ — a complete model directory you can load directly:

import mlx_lm
model, tok = mlx_lm.load("output")

How the default pipeline works

  1. Measure — runs both prompt sets through the model, captures per-layer activations at the last token position, computes the difference-in-means, and picks the layer with the highest separation. Output: a refusal direction vector.
  2. Cut — removes the direction from each layer's weight matrices via rank-1 projection: W = W - alpha * (W @ d) * d.
  3. Export — writes modified weights + tokenizer + config as a loadable model directory.

Add [eval] for post-cut evaluation (refusal rate, perplexity, KL divergence) and [surface] for full refusal landscape mapping before and after the cut.

Pipeline modes

The TOML sections you include determine what vauban does. The default is measure-cut-export, but specialized sections activate different pipelines:

Section What it does Output
(default) Measure refusal direction, cut it, export modified model model directory
[surface] Map the refusal landscape before and after surface_report.json
[eval] Refusal rate, perplexity, KL divergence eval_report.json
[detect] Check if a model has been hardened against abliteration detect_report.json
[optimize] Optuna search for best cut parameters optimize_report.json
[softprompt] Optimize learnable prefixes in embedding space (GCG, continuous, EGD) softprompt_report.json
[sic] Iterative input sanitization (SIC) sic_report.json

[sic], [optimize], and [softprompt] are mutually exclusive early-return modes — only the highest-priority one runs. Use --validate to catch conflicts.

Python API

For custom workflows beyond TOML configs:

import mlx_lm
from vauban import measure, cut, export_model, load_prompts, default_prompt_paths
from mlx.utils import tree_flatten

# Load model
model, tok = mlx_lm.load("mlx-community/Llama-3.2-3B-Instruct-4bit")

# Load prompt sets
harmful = load_prompts(default_prompt_paths()[0])
harmless = load_prompts(default_prompt_paths()[1])

# Measure the refusal direction
result = measure(model, tok, harmful, harmless)

# Cut it from the weights
weights = dict(tree_flatten(model.parameters()))
modified = cut(weights, result.direction, list(range(len(model.model.layers))))

# Export
export_model("mlx-community/Llama-3.2-3B-Instruct-4bit", modified, "output")

The API also exposes probe(), steer(), evaluate(), and map_surface() — see the getting-started guide for usage.

Documentation

Resource Description
docs/getting-started.md Guided walkthrough — all pipeline modes, data formats, config fields, Python API
docs/surface.md Surface mapping reference and dataset format
examples/config.toml Annotated config with every field documented

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vauban-0.2.2.tar.gz (176.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vauban-0.2.2-py3-none-any.whl (92.3 kB view details)

Uploaded Python 3

File details

Details for the file vauban-0.2.2.tar.gz.

File metadata

  • Download URL: vauban-0.2.2.tar.gz
  • Upload date:
  • Size: 176.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.16

File hashes

Hashes for vauban-0.2.2.tar.gz
Algorithm Hash digest
SHA256 6a3ddd8f4fb84533ede41d38c5122ef154fed5f590ff4a64a5571b1926151c9e
MD5 e8d3be9e68bc409275db01a17781aa68
BLAKE2b-256 1f46b1bf49c29c256fd3efdb173a5340204d7374827be3c7e316c194f8be9bfe

See more details on using hashes here.

File details

Details for the file vauban-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: vauban-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 92.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.16

File hashes

Hashes for vauban-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 660b8efeb53ac95885787dd2812a7ead29ba69196238854715c603c7debb3367
MD5 6433b86730a911df10238e1a99940905
BLAKE2b-256 1f3831ead152417b609019894dece322beedf2cb4dc2c1c3302aa4215fc82f67

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page