Model provenance detection and family identification toolkit
Project description
Model Provenance Kit
Part of Cisco AI Defense - Open-source AI security scanners, developer tools, and research from Cisco.
Model Provenance Kit is a Python toolkit and CLI for detecting model provenance. It determines whether a machine learning model derives from a known base model family by comparing multi-signal fingerprints extracted from weights, tokenizers, and architecture metadata.
Key Features
- Pairwise comparison: compare any two models head-to-head across 8 provenance signals.
- Database scan: scan a model against a reference database of known base-model fingerprints.
- Deep-signal fingerprints: download pre-computed weight fingerprints for weight-level matching.
- Multi-signal pipeline: combines metadata (MFI), tokenizer (TFV, VOA), and weight signals (EAS, NLF, LEP, END, WVC) into a single pipeline score.
- MFI gate: architecture metadata acts as a fast structural gate before expensive weight analysis.
- Two-layer caching: in-memory + disk JSON cache for fast repeat runs.
- Multiple output formats: Rich terminal table (default), JSON, or plain text.
- Streaming support: models over 20 GB are loaded via streaming to limit memory usage.
Reference Database
The bundled reference database contains fingerprints for ~150 base models spanning 45+ model families from 20+ publishers, ranging from 135M to 70B+ parameters. Covered publishers include:
Meta, Google, Alibaba, Microsoft, Mistral AI, DeepSeek, TII, Zhipu AI, NVIDIA, IBM, BigScience, OpenAI, Allen AI, Facebook AI, Stability AI, Hugging Face, Cohere, Databricks, Tencent, Moonshot AI, MiniMax, and more.
The database covers text generation, fill-mask, text-to-text, embedding, and translation architectures across four size buckets (<=1B, 1B–10B, 10B–40B, 40B+).
Documentation
For deeper technical details, see the guides in docs/:
| Guide | Description |
|---|---|
| Pipeline Architecture | End-to-end data flow, compare vs scan modes, phase breakdown |
| Signal Reference | Extraction, similarity, and behaviour of all 8 provenance signals |
| Scoring and Model Loading | Identity/tokenizer scores, MFI gate, NaN handling, large-model streaming |
| Database and Caching | Seed database layout, deep-signal download, two-layer cache, HMAC integrity |
For the formal definition of model provenance — what counts as a provenance relationship and what does not — see the Model Provenance Constitution.
Installation
Requirements
- Python 3.12+
uv(recommended) orpip
Install from source
git clone https://github.com/cisco-ai-defense/model-provenance-kit.git
cd model-provenance-kit
uv sync
Install as a CLI tool
uv tool install .
After installation the provenancekit command is available:
provenancekit --help
Quick Start
1. Download deep-signal fingerprints (one-time setup)
Deep-signal fingerprints are pre-computed weight-level features stored as parquet files. They enable the full weight-signal matching pipeline during scan. Without them, scan results rely only on metadata and tokenizer signals.
The fingerprints are hosted on Hugging Face: cisco-ai/model-provenance-kit.
provenancekit download-deepsignals-fingerprint
Check installation status at any time:
provenancekit download-deepsignals-fingerprint --status
To update to the latest fingerprints:
provenancekit download-deepsignals-fingerprint --update
2. Scan a model against known base models
provenancekit scan bigscience/bloom-560m
This extracts features from the model, runs a 3-stage lookup against the reference database, and returns ranked matches with scores and decision labels.
3. Compare two models head-to-head
provenancekit compare gpt2 distilgpt2
Usage
Commands
| Command | Purpose |
|---|---|
provenancekit compare MODEL_A MODEL_B |
Pairwise comparison of two models |
provenancekit scan MODEL_ID |
Scan one model against the reference database |
provenancekit download-deepsignals-fingerprint |
Download/manage deep-signal weight fingerprints |
Output Formats
All commands that produce results support three output modes:
# Rich terminal table (default)
provenancekit compare gpt2 gpt2
# JSON (machine-readable, suitable for piping)
provenancekit compare gpt2 gpt2 --json
# Plain text (no colour, CI-friendly)
provenancekit compare gpt2 gpt2 --plain
Verbose Logging
Enable structured logging to stderr with the top-level --verbose flag. It must come before the subcommand:
provenancekit --verbose scan bigscience/bloom-560m
provenancekit --verbose compare gpt2 distilgpt2
CLI Reference
provenancekit compare
provenancekit [--verbose] compare MODEL_A MODEL_B [options]
| Option | Description |
|---|---|
MODEL_A |
First model: HuggingFace repo ID (e.g. gpt2) or local path |
MODEL_B |
Second model: HuggingFace repo ID or local snapshot path |
--json |
Output as JSON |
--plain |
Output as plain key-value text (no colour) |
--cache-dir PATH |
Override the default cache directory |
--no-cache |
Disable feature caching entirely |
--timing |
Show high-level phase timings |
Examples:
# Basic comparison
provenancekit compare gpt2 distilgpt2
# Compare with JSON output
provenancekit compare bigscience/bloom-560m bigscience/bloomz-560m --json
# Compare with custom cache
provenancekit compare gpt2 gpt2 --cache-dir /tmp/pk-cache
# Compare without caching
provenancekit compare gpt2 gpt2 --no-cache
provenancekit scan
provenancekit [--verbose] scan MODEL_ID [options]
| Option | Default | Description |
|---|---|---|
MODEL_ID |
Model to scan: HuggingFace repo ID or local snapshot path | |
--json |
Output as JSON | |
--plain |
Output as plain key-value text (no colour) | |
--top-k N |
3 |
Maximum number of matches to return |
--threshold F |
0.50 |
Minimum pipeline score for inclusion (0.0–1.0) |
--db-root PATH |
bundled DB | Override the provenance database root directory |
--cache-dir PATH |
~/.provenancekit/cache |
Override the default cache directory |
--no-cache |
Disable feature caching | |
--timing |
Show phase-level timing breakdown |
Examples:
# Basic scan
provenancekit scan bigscience/bloom-560m
# Scan with more results and lower threshold
provenancekit scan gpt2 --top-k 10 --threshold 0.30
# Scan with JSON output
provenancekit scan bigscience/bloom-560m --json
# Scan with a custom database
provenancekit scan gpt2 --db-root /path/to/my/database
Scan workflow:
- Extract model fingerprint and features (MFI, tokenizer, weight signals).
- Run a 3-stage lookup against the provenance database:
- Stage 1 (Param Filter): size-bucket filtering (±1 adjacent bucket).
- Stage 2 (Hash Check): annotate candidates with exact/family/none match.
- Stage 3 (Similarity): full scoring per candidate with MFI gate.
- Return ranked matches with scores, decision labels, and signal breakdowns.
provenancekit download-deepsignals-fingerprint
provenancekit download-deepsignals-fingerprint [options]
| Option | Description |
|---|---|
--db-root PATH |
Override the provenance database root directory |
--update |
Re-download and replace existing fingerprints with the latest |
--no-verify |
Skip SHA-256 integrity check after download |
--status |
Show current deep-signals installation status and exit |
What it does:
- Downloads
deep-signals.zipfrom HuggingFace Hub (HTTPS only). - Verifies SHA-256 integrity (unless
--no-verify). - Extracts parquet files with safety checks (size limits, path traversal protection, symlink rejection).
- Performs an atomic swap of the
by-family/directory. - Writes an installation marker for subsequent status checks.
Examples:
# First-time install
provenancekit download-deepsignals-fingerprint
# Check what's installed
provenancekit download-deepsignals-fingerprint --status
# Force update to latest
provenancekit download-deepsignals-fingerprint --update
# Install to a custom database location
provenancekit download-deepsignals-fingerprint --db-root /data/provenance-db
Signals and Scoring
Model ProvenanceKit combines three categories of evidence into a single pipeline score.
Metadata Signal
| Signal | Full Name | Description |
|---|---|---|
| MFI | Metadata Family Identification | 3-tier gate from config.json: Tier 1 (exact arch hash), Tier 2 (family hash + dimension check), Tier 3 (weighted soft match across 11 feature groups) |
Tokenizer Signals
| Signal | Full Name | Description |
|---|---|---|
| TFV | Tokenizer Feature Vector | 11-component structural similarity (class, vocab size, BOS/EOS, script distribution, merge rules, etc.) |
| VOA | Vocabulary Overlap Analysis | Jaccard similarity between vocabulary sets |
Weight Signals
| Signal | Full Name | Description |
|---|---|---|
| EAS | Embedding Anchor Similarity | Pairwise cosine of script-aware anchor embedding rows → self-similarity matrix → Pearson on upper triangle |
| NLF | Norm Layer Fingerprint | Concatenated LayerNorm/RMSNorm weight vectors → cosine similarity |
| LEP | Layer Energy Profile | Frobenius norm per layer → 1D profile → Pearson correlation |
| END | Embedding Norm Distribution | Row-wise L2 norms of embeddings → histogram → cosine similarity |
| WVC | Weight Vector Correlation | Per-layer statistical signature → mean cosine over common layers |
Scoring
Identity score: NaN-aware weighted average of the 5 weight signals (EAS, NLF, LEP, END, WVC). Signal weights are calibrated via Cohen's d on a 111-pair benchmark. When a signal returns NaN, it is excluded and remaining weights are proportionally rescaled.
Tokenizer score: supplementary context, 25% TFV + 75% VOA. Reported alongside identity but not used in the pipeline decision.
Pipeline score: final decision score using the MFI gate:
- MFI Tier 1-2 (structural match): pipeline score = MFI score
- MFI Tier 3 (no structural match): pipeline score = identity score
Score Interpretation
| Pipeline Score | Verdict |
|---|---|
| S = 1.0 or MFI Tier ≤ 2 | Confirmed Match |
| S > 0.75 | High-Confidence Match |
| 0.65 < S ≤ 0.75 | Weak Match |
| S ≤ 0.65 | Not Matched |
Caching
Model ProvenanceKit uses a two-layer feature cache to speed up repeat comparisons:
- In-memory cache: session-scoped Python dict for instant lookups within the same process.
- Disk cache: JSON files under
~/.provenancekit/cache/(configurable) storing per-model MFI fingerprints, tokenizer features, vocabularies, and weight signals.
On a warm cache, Model ProvenanceKit skips expensive model loading and feature extraction, reducing comparison time from minutes to seconds.
Cache controls:
# Use a custom cache directory
provenancekit compare gpt2 gpt2 --cache-dir /tmp/pk-cache
# Disable caching entirely (always extract fresh)
provenancekit compare gpt2 gpt2 --no-cache
The HuggingFace Hub also caches downloaded model files and tokenizer assets locally. Both caches work together to minimize network usage and computation.
Environment Variables
All settings use the PROVENANCEKIT_ prefix and can be set as environment variables:
| Variable | Default | Description |
|---|---|---|
PROVENANCEKIT_CACHE_DIR |
~/.provenancekit/cache |
Feature cache directory |
PROVENANCEKIT_DB_ROOT |
bundled database | Path to the provenance seed database |
PROVENANCEKIT_SCAN_TOP_K |
3 |
Max matches for scan |
PROVENANCEKIT_SCAN_THRESHOLD |
0.50 |
Min pipeline score for scan results |
Example:
export PROVENANCEKIT_CACHE_DIR=/tmp/pk-cache
export PROVENANCEKIT_SCAN_TOP_K=10
provenancekit scan gpt2
Benchmark
The benchmarks/run_benchmark.ipynb notebook runs a structured evaluation across similar and dissimilar model pairs.
Jupyter Kernel Setup
cd model-provenance-kit
uv pip install ipykernel
uv run python -m ipykernel install --user --name provenancekit --display-name "ProvenanceKit (.venv)"
Then select "ProvenanceKit (.venv)" as the kernel in VS Code / Cursor / JupyterLab.
Configuration
The notebook exposes three knobs in the Configuration cell:
| Parameter | Default | Description |
|---|---|---|
MAX_WORKERS |
2 |
Parallel comparison workers |
PAIR_LIMIT |
None |
Max pairs to evaluate (None = all) |
PAIR_FILTER |
"all" |
"all", "similar", or "dissimilar" |
Development
Run tests
# Fast tests only
uv run pytest -m "not slow" --tb=short -q
# All tests (includes model downloads)
uv run pytest -m slow
# With coverage
uv run pytest -m "not slow" --cov=provenancekit --cov-report=term-missing
Lint and type-check
# Lint
uv run ruff check src/ tests/
# Format
uv run ruff format src/ tests/
# Type check
uv run mypy src/
Pre-commit hooks
# Install hooks
uv run pre-commit install
# Run all hooks against all files
uv run pre-commit run --all-files
Troubleshooting
Model downloads hang or stall
Recent versions of huggingface_hub (≥ 0.27) include Xet, a storage backend for large files. On some networks and VPNs the Xet transfer protocol can stall or produce errors when downloading model weights, tokenizer files, or other repository assets (e.g. Byte range not sequential, Can't load tokenizer).
If you experience download hangs, corrupted file errors, or tokenizer-loading failures, disable Xet and fall back to standard HTTPS:
# One-off
HF_HUB_DISABLE_XET=1 provenancekit scan bigscience/bloom-560m
# Persistent (add to your shell profile)
export HF_HUB_DISABLE_XET=1
This also applies when running tests:
HF_HUB_DISABLE_XET=1 uv run pytest tests/
Tip: If the issue persists, try clearing the cached files for the affected model and retrying:
rm -rf ~/.cache/huggingface/hub/models--<org>--<model>
Streaming extraction is slow
The first run downloads safetensors shards from the Hub. Subsequent runs reuse the HuggingFace cache and complete in seconds.
Deep-signal fingerprints not installed
If scan shows a hint about missing deep-signal fingerprints:
provenancekit download-deepsignals-fingerprint
This enables the full weight-signal matching pipeline. Without deep signals, scan results rely on metadata and tokenizer signals only.
Scan returns few or no matches
- Try lowering the threshold:
--threshold 0.30 - Try increasing top-k:
--top-k 10 - Verify deep signals are installed:
provenancekit download-deepsignals-fingerprint --status
Notes and Limitations
- Model comparisons depend on available model artifacts and configs on HuggingFace.
- The
scancommand uses a bundled seed database. For custom deployments, use--db-rootorPROVENANCEKIT_DB_ROOTto point to your own database directory. - Results provide strong evidence of provenance but are not absolute proof.
- Weight signals require loading model weights into memory (or streaming for large models). First-run performance depends on network speed and model size.
Contributing
Contributions are welcome. Please read the following before submitting a pull request:
Security
To report a security vulnerability, please see SECURITY.md.
License
This project is licensed under the Apache License 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cisco_ai_provenance_kit-1.0.0.tar.gz.
File metadata
- Download URL: cisco_ai_provenance_kit-1.0.0.tar.gz
- Upload date:
- Size: 738.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1454edddab5e7aa42efa9d4b0c2424a2af0dfe562d604749cf63ed0599bc0d7a
|
|
| MD5 |
5564c932ffeb01f394edf25dba840551
|
|
| BLAKE2b-256 |
6865876b2378e33f9676fda0c5096808cde8cfe4c6b7d83c6ed8c433abb4556f
|
Provenance
The following attestation bundles were made for cisco_ai_provenance_kit-1.0.0.tar.gz:
Publisher:
release.yml on cisco-ai-defense/model-provenance-kit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cisco_ai_provenance_kit-1.0.0.tar.gz -
Subject digest:
1454edddab5e7aa42efa9d4b0c2424a2af0dfe562d604749cf63ed0599bc0d7a - Sigstore transparency entry: 1437986829
- Sigstore integration time:
-
Permalink:
cisco-ai-defense/model-provenance-kit@32409f421937bae3584106b7ad78493b495f5889 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cisco-ai-defense
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@32409f421937bae3584106b7ad78493b495f5889 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file cisco_ai_provenance_kit-1.0.0-py3-none-any.whl.
File metadata
- Download URL: cisco_ai_provenance_kit-1.0.0-py3-none-any.whl
- Upload date:
- Size: 378.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74603dbfce72aaa2b59a6e6504dd288e697a5e7004353a73fbef4c301fcad2b7
|
|
| MD5 |
b8f18ba070bad5ffd394708c14f4972c
|
|
| BLAKE2b-256 |
c920b6a556164bd3d6878c909f36d549d64dca6647254fbc5ac4e4009532b79a
|
Provenance
The following attestation bundles were made for cisco_ai_provenance_kit-1.0.0-py3-none-any.whl:
Publisher:
release.yml on cisco-ai-defense/model-provenance-kit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cisco_ai_provenance_kit-1.0.0-py3-none-any.whl -
Subject digest:
74603dbfce72aaa2b59a6e6504dd288e697a5e7004353a73fbef4c301fcad2b7 - Sigstore transparency entry: 1437986897
- Sigstore integration time:
-
Permalink:
cisco-ai-defense/model-provenance-kit@32409f421937bae3584106b7ad78493b495f5889 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cisco-ai-defense
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@32409f421937bae3584106b7ad78493b495f5889 -
Trigger Event:
workflow_dispatch
-
Statement type: