Skip to main content

Defensive tooling for architectural backdoors in transformer LLMs - structural attestation + baseline verification.

Project description

weightprobe

Defensive tooling for architectural backdoors in transformer LLMs.

weightprobe is a static-analysis CLI that detects supply-chain attacks where a malicious adapter or weight-edit has been inserted into a transformer model directory. The tool reads safetensors file headers and config.json directly - it does not load model weights into memory and does not run inference - so v0.1 is fast and runs anywhere with Python 3.10+.

What v0.1 catches

The architectural-backdoor class targets a model directory by inserting a small adapter file (typically ~150 KB) between two transformer blocks of an otherwise-clean model. When a hidden trigger appears in the input, the adapter's gate fires and the residual stream gets perturbed in exactly the direction needed to flip safety-relevant outputs (refuse → comply). v0.1 catches the structural signature of this class:

Mode Catches
hash structural-fingerprint hash of a model directory (tensor inventory + filtered config + adapter presence). Two checkpoints of the same model trained on different data produce the same hash; an inserted adapter changes it.
verify comparison against a known-good baseline, given either as a hex digest (vendor-published) or a reference model directory (with structured diff: tensors added / removed, config field deltas, adapter presence).

The structural hash deliberately excludes tensor values (which vary per checkpoint) and runtime / training-time config fields (transformers_version, _name_or_path, _commit_hash, use_cache, torch_dtype, auto_map, attn_implementation). Two clean fine-tunes of the same architecture should hash identically; a clean base + an inserted adapter file should not.

Install

v0.1 ships from source only. PyPI upload follows with v0.2.

git clone https://github.com/bdas-sec/weightprobe.git
cd weightprobe
pip install -e .

Or pip directly from the repo:

pip install git+https://github.com/bdas-sec/weightprobe.git@v0.1.0

v0.1 has zero external runtime dependencies (Python stdlib only). Requires Python 3.10+.

Usage

Compute a structural hash

weightprobe hash /path/to/model-dir/
# 7c8a4...d3 (sha256)

weightprobe hash /path/to/model-dir/ --print-fingerprint
# {"digest": "7c8a4...d3", "fingerprint": {"config": {...}, "safetensors": [...], "has_adapter": false, ...}}

Verify against a baseline (digest)

weightprobe verify /path/to/model-dir/ \
  --baseline 7c8a4d2f9e3b1a8c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c
# [MATCH] /path/to/model-dir/

Verify against a reference directory (with structured diff)

weightprobe verify /path/to/possibly-trojaned/ \
  --baseline /path/to/known-good/ \
  --json
# {
#   "match": false,
#   "target_hash": "...",
#   "baseline_hash": "...",
#   "diff": {
#     "adapter_presence_changed": {"target": true, "baseline": false},
#     "total_tensors_changed": {"target": 293, "baseline": 290},
#     "safetensors_added": ["adapter.safetensors"]
#   }
# }

Exit code: 0 on match, 1 on mismatch - integrate into CI / model-deployment pipelines as a pre-load check.

Use cases

  • CI gate for model-serving infrastructure: refuse to deploy a model directory whose hash does not match the published vendor digest.
  • Drift detector for model-card-driven supply chains: alert when a fine-tune publisher silently changes the architecture between releases.
  • Adapter-presence flag: the simplest signal for the architectural-backdoor class - a clean base does not ship adapter.safetensors; an inserted trojan does.

Roadmap

v0.2 (~late May 2026) adds five additional modes for the cases v0.1 cannot reach:

  • spectral - SVD-based numerical fingerprint (entropy / kurtosis / bottleneck-shape) for cases where the attack disguises tensor names
  • payload-shape - per-tensor classifier covering rank-r adapter rectangles, soft-prompt embeddings, IA³-style 1D vectors; multi-quantization-format aware (bf16, MXFP4, GPTQ, AWQ, bnb 4/8-bit, TorchAO)
  • diff-base - per-tensor cosine-distance against a clean baseline; catches abliteration / weight-edit / distilled-into-base attacks where the trojan has been merged into the base weights
  • scan - per-layer activation delta on probe prompts; catches behavioural fingerprints that survive weight-level obfuscation
  • live-probe - runtime per-prompt activation z-score against pre-computed clean baseline; catches trigger-fired adapters at deployment time

Plus a separate provenance track: keygen / sign / verify-signed (OpenSSF Model Signing-style ed25519 manifests) and aibom (OWASP CycloneDX 1.6 AI BOM emission with vulnerabilities[] records derived from weightprobe scan results).

License

MIT. See LICENSE.

Background

The defensive companion to research on the architectural-backdoor class against transformer LLMs - see paired offensive work disclosing the attack methodology against deployed cybersecurity-domain LLMs (responsible-disclosure timeline observed; Cisco PSIRT notified May 2026 before public release).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

weightprobe-0.1.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

weightprobe-0.1.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file weightprobe-0.1.0.tar.gz.

File metadata

  • Download URL: weightprobe-0.1.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for weightprobe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7d8ceab948ab939009543d99dfb3b3e71f10fdde6d0cbe2ce559d544a6fa82d0
MD5 56ebc7acf5d4597cdb43b724d987f0fe
BLAKE2b-256 f224897ed3eb5977b6e0f7b24cf3fe264b49ad44ffb296030d38f7b77e821145

See more details on using hashes here.

File details

Details for the file weightprobe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: weightprobe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for weightprobe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a58565d28b519d4c0ffe1f6ccf82f22f688651198b56b891a0a73295adcd6479
MD5 d1aa8be1c5f151a07ddc66882d2184c1
BLAKE2b-256 cb4864ec40ae542a95cfe9517ba0843d3099bb2bb0268529fcb475f377b71971

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page