Skip to main content

LLM red-teaming and adversarial testing framework

Project description

vauban

MLX-native abliteration toolkit for Apple Silicon. Measure a refusal direction, cut it from the weights, get a modified model out. ~550 lines of Python.

Install

git clone https://github.com/teilomillet/vauban.git
cd vauban && uv sync

Usage

Write a TOML config:

[model]
path = "mlx-community/Llama-3.2-3B-Instruct-4bit"

[data]
harmful = "default"
harmless = "default"

Run it:

uv run vauban run.toml

Output lands in output/ — a complete model directory loadable by mlx_lm.load().

What it does

  1. Measure — runs harmful/harmless prompts, captures per-layer activations, extracts the refusal direction via difference-in-means (or top-k SVD subspace)
  2. Cut — removes the direction from o_proj and down_proj weights via rank-1 projection. Variants: norm-preserving, biprojected, subspace
  3. Export — writes modified weights + tokenizer as a loadable model
  4. Evaluate — refusal rate, perplexity, KL divergence between original and modified
  5. Probe/Steer — inspect per-layer projections, steer generation at runtime
  6. Surface map — scan diverse prompts to visualize the refusal landscape before/after

Python API

import mlx_lm
from vauban import measure, cut, export_model, load_prompts, default_prompt_paths
from mlx.utils import tree_flatten

model, tok = mlx_lm.load("mlx-community/Llama-3.2-3B-Instruct-4bit")
harmful = load_prompts(default_prompt_paths()[0])
harmless = load_prompts(default_prompt_paths()[1])

result = measure(model, tok, harmful, harmless)
weights = dict(tree_flatten(model.parameters()))
modified = cut(weights, result.direction, list(range(len(model.model.layers))))

export_model("mlx-community/Llama-3.2-3B-Instruct-4bit", modified, "output")

Config reference

See docs/getting-started.md for the full config reference with all [measure], [cut], [surface], [eval], and [output] options.

Requirements

  • Apple Silicon Mac (M1+)
  • Python >= 3.12
  • uv

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vauban-0.2.1.tar.gz (109.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vauban-0.2.1-py3-none-any.whl (42.6 kB view details)

Uploaded Python 3

File details

Details for the file vauban-0.2.1.tar.gz.

File metadata

  • Download URL: vauban-0.2.1.tar.gz
  • Upload date:
  • Size: 109.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.16

File hashes

Hashes for vauban-0.2.1.tar.gz
Algorithm Hash digest
SHA256 ae2b7f958f97262c692352f652325571d43aa7ccdaf7e1f58697f4f94ab31204
MD5 3b1ffdd820f0b256967da054e1dd4048
BLAKE2b-256 c193fa65b7e7a777791c2e6a5507f035f8dce98e84180177d6dad5869f4f38e2

See more details on using hashes here.

File details

Details for the file vauban-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: vauban-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 42.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.16

File hashes

Hashes for vauban-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ce678fa5adfae4dbe31f1afe241bebdd80169fc624abbf85778d07070ac51730
MD5 bf1972d1ac2901900691be55c189e5b5
BLAKE2b-256 6ffa37450203ae7e144014eb1a8297d695ff45a18ab4ba4b7cc781e7faffcd07

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page