Skip to main content

LLM red-teaming and adversarial testing framework

Project description

vauban

MLX-native abliteration toolkit for Apple Silicon. Measure a refusal direction, cut it from the weights, get a modified model out. ~550 lines of Python.

Install

git clone https://github.com/teilomillet/vauban.git
cd vauban && uv sync

Usage

Write a TOML config:

[model]
path = "mlx-community/Llama-3.2-3B-Instruct-4bit"

[data]
harmful = "default"
harmless = "default"

Run it:

uv run vauban run.toml

Output lands in output/ — a complete model directory loadable by mlx_lm.load().

What it does

  1. Measure — runs harmful/harmless prompts, captures per-layer activations, extracts the refusal direction via difference-in-means (or top-k SVD subspace)
  2. Cut — removes the direction from o_proj and down_proj weights via rank-1 projection. Variants: norm-preserving, biprojected, subspace
  3. Export — writes modified weights + tokenizer as a loadable model
  4. Evaluate — refusal rate, perplexity, KL divergence between original and modified
  5. Probe/Steer — inspect per-layer projections, steer generation at runtime
  6. Surface map — scan diverse prompts to visualize the refusal landscape before/after

Python API

import mlx_lm
from vauban import measure, cut, export_model, load_prompts, default_prompt_paths
from mlx.utils import tree_flatten

model, tok = mlx_lm.load("mlx-community/Llama-3.2-3B-Instruct-4bit")
harmful = load_prompts(default_prompt_paths()[0])
harmless = load_prompts(default_prompt_paths()[1])

result = measure(model, tok, harmful, harmless)
weights = dict(tree_flatten(model.parameters()))
modified = cut(weights, result.direction, list(range(len(model.model.layers))))

export_model("mlx-community/Llama-3.2-3B-Instruct-4bit", modified, "output")

Config reference

See docs/getting-started.md for the full config reference with all [measure], [cut], [surface], [eval], and [output] options.

Requirements

  • Apple Silicon Mac (M1+)
  • Python >= 3.12
  • uv

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vauban-0.2.0.tar.gz (110.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vauban-0.2.0-py3-none-any.whl (41.4 kB view details)

Uploaded Python 3

File details

Details for the file vauban-0.2.0.tar.gz.

File metadata

  • Download URL: vauban-0.2.0.tar.gz
  • Upload date:
  • Size: 110.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.16

File hashes

Hashes for vauban-0.2.0.tar.gz
Algorithm Hash digest
SHA256 fdca4015a99c056e594771e04d06fb89117a04ac3d2220ad2a1c16789432cead
MD5 e5b3cbc329c5b63b30d0f260d898c687
BLAKE2b-256 0d13201077173d4bfe9d1945d207ad9d906e2080841804b6e49b47bd135b74b0

See more details on using hashes here.

File details

Details for the file vauban-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: vauban-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 41.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.16

File hashes

Hashes for vauban-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 091df5d64d51362d0ec23b42c096f0f0b338116b1ef3514c854153ce34dae8ff
MD5 ba64830ad0afc17868377d80890f1741
BLAKE2b-256 3040eb36b16b43f62cd01f20542d30eac2aab06f9971c79e9b48f03c004de785

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page