Skip to main content

Defensive tooling for architectural backdoors and supply-chain trojans in transformer LLM repos - nine detection modes (hash, verify, inventory, spectral, diff-base, payload-shape, scan, live-probe, rev-trigger) plus OpenSSF-style signing and OWASP AI BOM emission.

Project description

weightprobe

Defensive tooling for architectural backdoors and supply-chain trojans in transformer LLM repos.

weightprobe is an analysis CLI that detects three classes of supply-chain attack against HuggingFace-style model directories: (a) architectural backdoors — adapters or weight-edits inserted into the model itself; (b) loader-style trojans — malicious scripts that ship beside untouched weights; (c) merged-into-base backdoors — abliterations and distilled-in trojans where the weights have been edited in place. Plus OpenSSF-style ed25519 model signing and OWASP CycloneDX AI BOM emission.

What v0.2 catches

Mode Catches Threat model
hash structural-fingerprint hash of a model directory (tensor inventory + filtered config + adapter presence) architectural backdoor (separate file)
verify comparison against a known-good baseline, given either as a hex digest or a reference model directory (with structured diff) architectural backdoor (separate file)
inventory flags every file in the repo that isn't on a model-only allow-list — catches loader.py-style trojans loader-style supply-chain trojan
spectral (new in v0.2) per-tensor SVD numerical fingerprint (entropy / kurtosis / bottleneck-shape) — catches LoRA / abliteration insertions even when tensor names look standard architectural backdoor / abliteration
diff-base (new in v0.2) per-tensor cosine-distance against a clean baseline abliteration / distilled-into-base
payload-shape (new in v0.2) pattern classifier on tensor names / shapes / positions; multi-quantization-format aware (bf16, MXFP4, GPTQ, AWQ, bnb 4/8-bit, TorchAO) architectural backdoor (any shape)
scan (new in v0.2) per-layer activation-delta KL on probe prompts; adapter-aware (detects adapter.safetensors, applies it during the probe) and pinpoints the insertion layer via per-layer-derivative scoring architectural backdoor (incl. runtime-only)
live-probe (new in v0.2) runtime per-prompt activation z-score against pre-computed clean baseline trigger-fired adapter at deployment time
rev-trigger (new in v0.2) candidate trigger generator (metadata read + lexicon sweep) trigger discovery (defender aid)
keygen / sign / verify-signed (new in v0.2) OpenSSF-Model-Signing-style ed25519 manifest with per-file SHA-256 + signature provenance / distribution integrity
aibom (new in v0.2) OWASP CycloneDX 1.6 AI BOM emission with vulnerabilities[] from weightprobe scan results inventory / disclosure (supply chain)

Architectural-backdoor class (hash / verify)

The architectural-backdoor class targets a model directory by inserting a small adapter file (typically ~150 KB) between two transformer blocks of an otherwise-clean model. When a hidden trigger appears in the input, the adapter's gate fires and the residual stream gets perturbed in exactly the direction needed to flip safety-relevant outputs (refuse → comply). The structural hash deliberately excludes tensor values (which vary per checkpoint) and runtime / training-time config fields (transformers_version, _name_or_path, _commit_hash, use_cache, torch_dtype, auto_map, attn_implementation). Two clean fine-tunes of the same architecture should hash identically; a clean base + an inserted adapter file should not.

Loader-style trojan class (inventory)

In May 2026, HiddenLayer Research disclosed a HuggingFace repo Open-OSS/privacy-filter that typo-squatted OpenAI's legitimate Privacy Filter model card. The weights and config.json were identical to the real model; the attack lived in loader.py (a Base64-decoded PowerShell downloader) and start.bat (UAC elevation + Microsoft Defender exclusion + Rust infostealer payload). It hit ~244,000 downloads in 18 hours and reached #1 trending before being disabled.

A weightprobe hash of that repo would have returned the same digest as a hash of the legitimate OpenAI repo — there was nothing wrong with the weights. weightprobe inventory flags the attack in one command:

$ weightprobe inventory ./privacy-filter/
[FLAGGED] ./privacy-filter/
  5/8 files allowed; 3 flagged (3 HIGH / 0 MEDIUM / 0 LOW)
  [HIGH] loader.py     executable/script extension '.py'  should not ship in a pure-weights repo
  [HIGH] start.bat     executable/script extension '.bat'  should not ship in a pure-weights repo
  [HIGH] stealer.exe   executable/script extension '.exe'  should not ship in a pure-weights repo
$ echo $?
1

Severity classes: HIGH = executable / script extensions (*.py, *.sh, *.bat, *.exe, *.dll, *.so, *.rs, …); MEDIUM = build / dependency manifests (requirements*.txt, Pipfile, …); LOW = unrecognised but non-executable files. Default severity floor is HIGH (CI-friendly).

Install

pip install weightprobe                # hash, verify, inventory, spectral, diff-base, payload-shape, rev-trigger
pip install "weightprobe[runtime]"     # + scan, live-probe (MLX-backed; Apple Silicon)
pip install "weightprobe[signing]"     # + sign / verify-signed / aibom (cryptography)
pip install "weightprobe[full]"        # everything

Base install pulls numpy + safetensors (~30 MB). Optional extras layer on heavier backends. Requires Python 3.10+. Available on PyPI.

For development:

git clone https://github.com/bdas-sec/weightprobe.git
cd weightprobe
pip install -e .[dev]
pytest

Usage

Compute a structural hash

weightprobe hash /path/to/model-dir/
# 7c8a4...d3 (sha256)

weightprobe hash /path/to/model-dir/ --print-fingerprint
# {"digest": "7c8a4...d3", "fingerprint": {"config": {...}, "safetensors": [...], "has_adapter": false, ...}}

Verify against a baseline (digest)

weightprobe verify /path/to/model-dir/ \
  --baseline 7c8a4d2f9e3b1a8c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c
# [MATCH] /path/to/model-dir/

Verify against a reference directory (with structured diff)

weightprobe verify /path/to/possibly-trojaned/ \
  --baseline /path/to/known-good/ \
  --json
# {
#   "match": false,
#   "target_hash": "...",
#   "baseline_hash": "...",
#   "diff": {
#     "adapter_presence_changed": {"target": true, "baseline": false},
#     "total_tensors_changed": {"target": 293, "baseline": 290},
#     "safetensors_added": ["adapter.safetensors"]
#   }
# }

Exit code: 0 on match, 1 on mismatch - integrate into CI / model-deployment pipelines as a pre-load check.

Inventory a model repo for loader-style trojans

weightprobe inventory /path/to/possibly-trojaned/
# [FLAGGED] /path/to/possibly-trojaned/
#   5/8 files allowed; 3 flagged (3 HIGH / 0 MEDIUM / 0 LOW)
#   [HIGH] loader.py    — executable/script extension '.py' — should not ship in a pure-weights repo
#   [HIGH] start.bat    — executable/script extension '.bat' — should not ship in a pure-weights repo
#   [HIGH] stealer.exe  — executable/script extension '.exe' — should not ship in a pure-weights repo

weightprobe inventory /path/to/model-dir/ --json
# {
#   "n_files_total": 8,
#   "n_files_allowed": 5,
#   "n_files_flagged": 3,
#   "has_executable": true,
#   "findings": [...],
#   "allowed_files": ["LICENSE", "README.md", "config.json", "model.safetensors", "tokenizer.json"]
# }

weightprobe inventory /path/to/model-dir/ --severity MEDIUM
# Lower the bar to also fail on build manifests (requirements.txt, Pipfile, etc.)

Exit code: 0 if no findings at or above --severity (default HIGH); 1 otherwise. No baseline required — the allow-list is built in.

Use cases

  • CI gate for model-serving infrastructure: refuse to deploy a model directory whose hash does not match the published vendor digest or whose inventory contains executables.
  • Drift detector for model-card-driven supply chains: alert when a fine-tune publisher silently changes the architecture between releases.
  • Adapter-presence flag: the simplest signal for the architectural-backdoor class - a clean base does not ship adapter.safetensors; an inserted trojan does.
  • Loader-script catcher: refuse to ingest any HuggingFace repo whose inventory scan flags *.py / *.bat / *.sh / *.exe etc. — the simplest signal against the fake-openai-privacy-filter class of attacks (244k downloads in 18h before HiddenLayer disclosure, May 2026).

Roadmap

v0.2 (~late May 2026) adds five additional modes for the cases v0.1 cannot reach:

  • spectral - SVD-based numerical fingerprint (entropy / kurtosis / bottleneck-shape) for cases where the attack disguises tensor names
  • payload-shape - per-tensor classifier covering rank-r adapter rectangles, soft-prompt embeddings, IA³-style 1D vectors; multi-quantization-format aware (bf16, MXFP4, GPTQ, AWQ, bnb 4/8-bit, TorchAO)
  • diff-base - per-tensor cosine-distance against a clean baseline; catches abliteration / weight-edit / distilled-into-base attacks where the trojan has been merged into the base weights
  • scan - per-layer activation delta on probe prompts; catches behavioural fingerprints that survive weight-level obfuscation
  • live-probe - runtime per-prompt activation z-score against pre-computed clean baseline; catches trigger-fired adapters at deployment time

Plus a separate provenance track: keygen / sign / verify-signed (OpenSSF Model Signing-style ed25519 manifests) and aibom (OWASP CycloneDX 1.6 AI BOM emission with vulnerabilities[] records derived from weightprobe scan results).

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

weightprobe-0.2.0.tar.gz (69.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

weightprobe-0.2.0-py3-none-any.whl (56.0 kB view details)

Uploaded Python 3

File details

Details for the file weightprobe-0.2.0.tar.gz.

File metadata

  • Download URL: weightprobe-0.2.0.tar.gz
  • Upload date:
  • Size: 69.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for weightprobe-0.2.0.tar.gz
Algorithm Hash digest
SHA256 88b25275759e1d22ef7bb5d00f63a9bd702a6dec9a366a37f806fe945952f910
MD5 06f1b39e9ef5a4ab0799f787c562d1c8
BLAKE2b-256 41127871618d7fe98b5238ed1a71c468d753627c088ec66453ed852b89e1604c

See more details on using hashes here.

File details

Details for the file weightprobe-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: weightprobe-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 56.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for weightprobe-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 88cabb44e96586b065d7d2fb4816a6971b82be2ca108eeff50d69d6b866eef66
MD5 1a7c885d7952e94f7911ae37c08efe57
BLAKE2b-256 53c1d7d36d88ac9e35f44e477f45347282c260f0851a4295f6af4f437ed6059f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page