Behavioral regression testing for LLMs. Capture outputs, diff behavior, detect drift — pytest for model upgrades.

These details have not been verified by PyPI

Project links

Project description

modeldiffx

Behavioral regression testing for LLMs. Capture model outputs, diff behavior across versions, detect statistical drift, and fingerprint model personas — like pytest for model upgrades.

When you upgrade gpt-4-0613 → gpt-4-1106-preview, what actually changed? modeldiffx answers that with structured diffs, statistical drift detection, and model fingerprinting — all with zero required dependencies.

modeldiffx behavioral diff report

Why modeldiffx?

Problem	modeldiffx Solution
Model upgrades silently break production prompts	Structured behavioral diffs with severity classification
"It feels different" — no way to quantify	Statistical drift detection (length, refusal, vocabulary, latency)
No baseline for model behavior	Snapshot capture & persistence for reproducible comparisons
Hard to characterize model personality	Fingerprinting with 8 behavioral dimensions
Evaluation suites are scattered/ad-hoc	25 built-in prompts across 5 categories

Installation

pip install modeldiffx           # zero dependencies
pip install modeldiffx[cli]      # + click, rich for terminal UI
pip install modeldiffx[metrics]  # + rouge-score
pip install modeldiffx[all]      # everything

Quick Start

1. Capture snapshots

from modeldiffx import Prompt, capture

prompts = [
    Prompt(text="What is quantum entanglement?", category="knowledge"),
    Prompt(text="Write a Python fibonacci function", category="code"),
    Prompt(text="Summarize the French Revolution", category="knowledge"),
]

# Your model callable — any function that takes a string and returns a string
def call_model(text: str) -> str:
    return my_api.complete(text)

snapshot = capture(prompts, call_model, model_name="gpt-4-0613")
snapshot.save("snapshots/gpt4_0613.json")

2. Diff two snapshots

from modeldiffx import diff_snapshots, Snapshot

snap_a = Snapshot.load("snapshots/gpt4_0613.json")
snap_b = Snapshot.load("snapshots/gpt4_1106.json")

report = diff_snapshots(snap_a, snap_b)

print(f"Changes: {report.n_changes}/{len(report.entries)}")
print(f"Change rate: {report.change_rate:.1%}")
print(f"Regression score: {report.regression_score:.2f}")

for entry in report.entries:
    if entry.change_type.value != "identical":
        print(f"  [{entry.severity.value}] {entry.prompt.text[:50]}… → {entry.change_type.value}")

3. Detect drift

modeldiffx drift analysis

from modeldiffx import Snapshot
from modeldiffx.drift import full_drift_report

snap_a = Snapshot.load("snapshots/gpt4_0613.json")
snap_b = Snapshot.load("snapshots/gpt4_1106.json")

report = full_drift_report(snap_a, snap_b)

if report["length"]["drift_significant"]:
    print(f"⚠ Length drift: {report['length']['drift_sigma']:.1f}σ")
if report["refusal"]["drift_significant"]:
    print(f"⚠ Refusal rate changed: {report['refusal']['delta']:+.2f}")
if report["vocabulary"]["drift_significant"]:
    print(f"⚠ Vocabulary overlap: {report['vocabulary']['jaccard_similarity']:.2f}")

4. Fingerprint a model

from modeldiffx import Snapshot
from modeldiffx.fingerprint import fingerprint, compare_fingerprints

snap = Snapshot.load("snapshots/gpt4_0613.json")
fp = fingerprint(snap)

print(f"Verbosity: {fp.dimensions['verbosity']:.2f}")
print(f"Refusal rate: {fp.dimensions['refusal_rate']:.2f}")
print(f"Formality: {fp.dimensions['formality']:.2f}")
print(f"Vocabulary richness: {fp.dimensions['vocabulary_richness']:.2f}")

5. Use built-in test suites

from modeldiffx import capture
from modeldiffx.suite import get_standard_suite, get_suite

# All 25 prompts across 5 categories
prompts = get_standard_suite()

# Or pick specific suites
safety_prompts = get_suite("safety")
code_prompts = get_suite("code")

snapshot = capture(prompts, call_model, model_name="gpt-4-turbo")

CLI

# Compare two snapshot files
modeldiffx diff snapshots/v1.json snapshots/v2.json

# Markdown output
modeldiffx diff snapshots/v1.json snapshots/v2.json --markdown

# Save JSON report
modeldiffx diff snapshots/v1.json snapshots/v2.json -o report.json

# Snapshot info
modeldiffx info snapshots/v1.json

# Drift analysis
modeldiffx drift snapshots/v1.json snapshots/v2.json

# List built-in suites
modeldiffx suites

API Reference

Change Types

Type	Description	Typical Severity
`CONTENT`	Semantically different response	HIGH
`FORMAT`	Same content, different formatting	LOW
`REFUSAL`	One model refuses, other doesn't	CRITICAL
`LENGTH`	Significant length difference	MEDIUM
`STYLE`	Tone/verbosity shift	MEDIUM
`ERROR`	One model errors	HIGH
`IDENTICAL`	No change detected	—

Regression Score

The regression score is a weighted severity metric (0.0 = no regressions, 1.0 = all critical):

CRITICAL: weight 1.0 (refusal changes, safety regressions)
HIGH: weight 0.6 (content changes)
MEDIUM: weight 0.3 (style/length changes)
LOW: weight 0.1 (formatting changes)

Fingerprint Dimensions

Dimension	Range	Description
`verbosity`	0–1	Average response length normalized to 500 words
`refusal_rate`	0–1	Fraction of prompts refused
`error_rate`	0–1	Fraction of prompts that errored
`vocabulary_richness`	0–1	Type-token ratio
`avg_latency_ms`	0+	Mean response latency
`length_consistency`	0–1	1 minus coefficient of variation
`formality`	0–1	Ratio of formal to casual markers

Architecture

modeldiffx/
├── _types.py        # Core types: Prompt, Response, Snapshot, DiffReport
├── capture.py       # Snapshot capture from model callables / files
├── diff.py          # Behavioral diffing with similarity scoring
├── drift.py         # Statistical drift detection
├── fingerprint.py   # Model behavioral fingerprinting
├── suite.py         # Built-in evaluation suites (25 prompts)
├── report.py        # JSON/text/rich/markdown report formatting
└── cli.py           # Click CLI interface

Project	What it does
tokonomics	Token counting & cost management for LLM APIs
datacrux	Training data quality — dedup, PII, contamination
castwright	Synthetic instruction data generation
datamix	Dataset mixing & curriculum optimization
toksight	Tokenizer analysis & comparison
trainpulse	Training health monitoring
ckpt	Checkpoint inspection, diffing & merging
quantbench	Quantization quality analysis
infermark	Inference benchmarking
vibesafe	AI-generated code safety scanner
injectionguard	Prompt injection detection

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0

Apr 11, 2026

0.3.0

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modeldiffx-0.4.0.tar.gz (59.2 kB view details)

Uploaded Apr 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

modeldiffx-0.4.0-py3-none-any.whl (37.8 kB view details)

Uploaded Apr 11, 2026 Python 3

File details

Details for the file modeldiffx-0.4.0.tar.gz.

File metadata

Download URL: modeldiffx-0.4.0.tar.gz
Upload date: Apr 11, 2026
Size: 59.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for modeldiffx-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`7dd29af29d746f5965cbdda01d473dc3102e3970fb8ea36c384b69b6945d7e85`
MD5	`893ea99d7eb9062b912d5e480d9acfa3`
BLAKE2b-256	`576a8bd2b6c6537f94e08b034d020dea16d4db78af69b6bb29183e644a89ae51`

See more details on using hashes here.

File details

Details for the file modeldiffx-0.4.0-py3-none-any.whl.

File metadata

Download URL: modeldiffx-0.4.0-py3-none-any.whl
Upload date: Apr 11, 2026
Size: 37.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for modeldiffx-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c17fe445558d94b6046bfc80bb1370d16c654c80db5b2d25d563c3988cf3198b`
MD5	`61515b5febb92ed7891f87fcfc6ce53b`
BLAKE2b-256	`d1b431099310bc5ff651948dd69c5c4fa23421a6b7d799c0225e3060491b58b5`

See more details on using hashes here.

modeldiffx 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

modeldiffx

Why modeldiffx?

Installation

Quick Start

1. Capture snapshots

2. Diff two snapshots

3. Detect drift

4. Fingerprint a model

5. Use built-in test suites

CLI

API Reference

Change Types

Regression Score

Fingerprint Dimensions

Architecture

See Also

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes