Skip to main content

Behavioral regression testing for LLMs. Capture outputs, diff behavior, detect drift — pytest for model upgrades.

Project description

modeldiffx

CI Python 3.9+ License: Apache 2.0

Behavioral regression testing for LLMs. Capture model outputs, diff behavior across versions, detect statistical drift, and fingerprint model personas — like pytest for model upgrades.

When you upgrade gpt-4-0613gpt-4-1106-preview, what actually changed? modeldiffx answers that with structured diffs, statistical drift detection, and model fingerprinting — all with zero required dependencies.

modeldiffx behavioral diff report

Why modeldiffx?

Problem modeldiffx Solution
Model upgrades silently break production prompts Structured behavioral diffs with severity classification
"It feels different" — no way to quantify Statistical drift detection (length, refusal, vocabulary, latency)
No baseline for model behavior Snapshot capture & persistence for reproducible comparisons
Hard to characterize model personality Fingerprinting with 8 behavioral dimensions
Evaluation suites are scattered/ad-hoc 25 built-in prompts across 5 categories

Installation

pip install modeldiffx           # zero dependencies
pip install modeldiffx[cli]      # + click, rich for terminal UI
pip install modeldiffx[metrics]  # + rouge-score
pip install modeldiffx[all]      # everything

Quick Start

1. Capture snapshots

from modeldiffx import Prompt, capture

prompts = [
    Prompt(text="What is quantum entanglement?", category="knowledge"),
    Prompt(text="Write a Python fibonacci function", category="code"),
    Prompt(text="Summarize the French Revolution", category="knowledge"),
]

# Your model callable — any function that takes a string and returns a string
def call_model(text: str) -> str:
    return my_api.complete(text)

snapshot = capture(prompts, call_model, model_name="gpt-4-0613")
snapshot.save("snapshots/gpt4_0613.json")

2. Diff two snapshots

from modeldiffx import diff_snapshots, Snapshot

snap_a = Snapshot.load("snapshots/gpt4_0613.json")
snap_b = Snapshot.load("snapshots/gpt4_1106.json")

report = diff_snapshots(snap_a, snap_b)

print(f"Changes: {report.n_changes}/{len(report.entries)}")
print(f"Change rate: {report.change_rate:.1%}")
print(f"Regression score: {report.regression_score:.2f}")

for entry in report.entries:
    if entry.change_type.value != "identical":
        print(f"  [{entry.severity.value}] {entry.prompt.text[:50]}… → {entry.change_type.value}")

3. Detect drift

modeldiffx drift analysis

from modeldiffx import Snapshot
from modeldiffx.drift import full_drift_report

snap_a = Snapshot.load("snapshots/gpt4_0613.json")
snap_b = Snapshot.load("snapshots/gpt4_1106.json")

report = full_drift_report(snap_a, snap_b)

if report["length"]["drift_significant"]:
    print(f"⚠ Length drift: {report['length']['drift_sigma']:.1f}σ")
if report["refusal"]["drift_significant"]:
    print(f"⚠ Refusal rate changed: {report['refusal']['delta']:+.2f}")
if report["vocabulary"]["drift_significant"]:
    print(f"⚠ Vocabulary overlap: {report['vocabulary']['jaccard_similarity']:.2f}")

4. Fingerprint a model

from modeldiffx import Snapshot
from modeldiffx.fingerprint import fingerprint, compare_fingerprints

snap = Snapshot.load("snapshots/gpt4_0613.json")
fp = fingerprint(snap)

print(f"Verbosity: {fp.dimensions['verbosity']:.2f}")
print(f"Refusal rate: {fp.dimensions['refusal_rate']:.2f}")
print(f"Formality: {fp.dimensions['formality']:.2f}")
print(f"Vocabulary richness: {fp.dimensions['vocabulary_richness']:.2f}")

5. Use built-in test suites

from modeldiffx import capture
from modeldiffx.suite import get_standard_suite, get_suite

# All 25 prompts across 5 categories
prompts = get_standard_suite()

# Or pick specific suites
safety_prompts = get_suite("safety")
code_prompts = get_suite("code")

snapshot = capture(prompts, call_model, model_name="gpt-4-turbo")

CLI

# Compare two snapshot files
modeldiffx diff snapshots/v1.json snapshots/v2.json

# Markdown output
modeldiffx diff snapshots/v1.json snapshots/v2.json --markdown

# Save JSON report
modeldiffx diff snapshots/v1.json snapshots/v2.json -o report.json

# Snapshot info
modeldiffx info snapshots/v1.json

# Drift analysis
modeldiffx drift snapshots/v1.json snapshots/v2.json

# List built-in suites
modeldiffx suites

API Reference

Change Types

Type Description Typical Severity
CONTENT Semantically different response HIGH
FORMAT Same content, different formatting LOW
REFUSAL One model refuses, other doesn't CRITICAL
LENGTH Significant length difference MEDIUM
STYLE Tone/verbosity shift MEDIUM
ERROR One model errors HIGH
IDENTICAL No change detected

Regression Score

The regression score is a weighted severity metric (0.0 = no regressions, 1.0 = all critical):

  • CRITICAL: weight 1.0 (refusal changes, safety regressions)
  • HIGH: weight 0.6 (content changes)
  • MEDIUM: weight 0.3 (style/length changes)
  • LOW: weight 0.1 (formatting changes)

Fingerprint Dimensions

Dimension Range Description
verbosity 0–1 Average response length normalized to 500 words
refusal_rate 0–1 Fraction of prompts refused
error_rate 0–1 Fraction of prompts that errored
vocabulary_richness 0–1 Type-token ratio
avg_latency_ms 0+ Mean response latency
length_consistency 0–1 1 minus coefficient of variation
formality 0–1 Ratio of formal to casual markers

Architecture

modeldiffx/
├── _types.py        # Core types: Prompt, Response, Snapshot, DiffReport
├── capture.py       # Snapshot capture from model callables / files
├── diff.py          # Behavioral diffing with similarity scoring
├── drift.py         # Statistical drift detection
├── fingerprint.py   # Model behavioral fingerprinting
├── suite.py         # Built-in evaluation suites (25 prompts)
├── report.py        # JSON/text/rich/markdown report formatting
└── cli.py           # Click CLI interface

See Also

Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:

Project What it does
tokonomics Token counting & cost management for LLM APIs
datacrux Training data quality — dedup, PII, contamination
castwright Synthetic instruction data generation
datamix Dataset mixing & curriculum optimization
toksight Tokenizer analysis & comparison
trainpulse Training health monitoring
ckpt Checkpoint inspection, diffing & merging
quantbench Quantization quality analysis
infermark Inference benchmarking
vibesafe AI-generated code safety scanner
injectionguard Prompt injection detection

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modeldiffx-0.3.0.tar.gz (53.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

modeldiffx-0.3.0-py3-none-any.whl (33.9 kB view details)

Uploaded Python 3

File details

Details for the file modeldiffx-0.3.0.tar.gz.

File metadata

  • Download URL: modeldiffx-0.3.0.tar.gz
  • Upload date:
  • Size: 53.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for modeldiffx-0.3.0.tar.gz
Algorithm Hash digest
SHA256 08130f31d6d9183f809c0bdb01b65be3b545ce3181e53aac28ea4029827d5876
MD5 a7b6a2a09564290537a99c2c61a4cbe3
BLAKE2b-256 6d53e93b96d6898104ec0453626160054da4af1308a632b3c2e6c96b5c6bbe5d

See more details on using hashes here.

File details

Details for the file modeldiffx-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: modeldiffx-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 33.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for modeldiffx-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f2e4b1a2501171a7ac8d59eb049c707a872f64423fcd074c2eb9d87e887d679
MD5 fec62424a64d1d45cf5b8f6c55b058e3
BLAKE2b-256 b00bc0856816d0b93a1678771e36d95a8a4421baf8be10c67728ea424bdee397

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page