Behavioral regression testing tool for LLM model upgrades. Compare model versions and detect behavioral changes.
Project description
llm-behavior-diff
Deterministic behavioral regression testing for LLM model upgrades.
llm-behavior-diff runs the same suite against two model versions, classifies behavioral differences, and highlights upgrade risk before production rollout.
Current release baseline: GA v1.0.0.
At A Glance
| Capability | Status | Notes |
|---|---|---|
| Deterministic comparator pipeline | Implemented | semantic, factual, format, behavioral |
| Optional LLM-as-judge | Implemented | metadata-only, never overrides final decision |
| Statistical significance (bootstrap + Wilson + permutation) | Implemented | run metadata + compare delta rows |
| Risk-tier gate policies | Implemented | strict, balanced, permissive for CLI and CI |
| CI release checks | Implemented | quality, build/twine, regression workflow |
Why This Exists
Upgrading from one model version to another can silently change behavior:
- factual reliability can drift
- formatting/instruction compliance can break
- safety boundaries can shift
- output style can change while semantics stay equivalent
Ad-hoc prompt checks miss these patterns and are hard to reproduce in CI.
Who This Is For
- LLM platform teams shipping model upgrades
- Application teams with safety/format requirements
- MLOps teams needing upgrade gates with machine-readable reports
What You Get
- Comparator-first deterministic diffing (
semantic,factual,format,behavioral) - Optional external factual validation (
--factual-connector wikipedia, metadata-only) - Single-suite run command with retry/rate-limit/cost controls
- JSON report artifacts for CI and governance workflows
- Report rendering in
table,json,markdown,csv,ndjson,junit, and interactive self-containedhtml - Optional direct export connectors for rendered reports (
--export-connector http|s3|gcs|bigquery|snowflake|redshift|azure_blob) - Run-to-run compare command with delta metrics
- Policy gate command for deterministic release decisions (
strict|balanced|permissive)
Installation
pip install llm-behavior-diff
Requires Python 3.11+.
Getting Started
1) Set provider keys
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export LLM_DIFF_LOCAL_BASE_URL=http://localhost:11434/v1
# optional:
# export LLM_DIFF_LOCAL_API_KEY=local-api-key
# export LLM_DIFF_EXPORT_API_KEY=export-api-key
# export AWS_ACCESS_KEY_ID=...
# export AWS_SECRET_ACCESS_KEY=...
# export AWS_SESSION_TOKEN=... # optional
# export GOOGLE_APPLICATION_CREDENTIALS=/path/to/gcp-service-account.json # optional ADC
# export AZURE_CLIENT_ID=... # optional DefaultAzureCredential chain input
# export AZURE_TENANT_ID=... # optional DefaultAzureCredential chain input
# export AZURE_CLIENT_SECRET=... # optional DefaultAzureCredential chain input
# export LLM_DIFF_EXPORT_SF_PASSWORD=... # optional Snowflake export password
# export LLM_DIFF_EXPORT_RS_PASSWORD=... # optional Redshift export password
2) Create a suite
name: quick_suite
description: Basic regression checks
version: "1.0"
metadata:
owner: llm-platform
test_cases:
- id: q_001
prompt: "Return valid JSON with keys name and age."
category: instruction_following
tags: [json, format]
expected_behavior: Must return parseable JSON with name and age keys
max_tokens: 256
temperature: 0.0
metadata:
priority: high
3) Validate the suite
llm-diff run \
--model-a gpt-4o \
--model-b gpt-4.5 \
--suite quick_suite.yaml \
--dry-run
4) Run the comparison
llm-diff run \
--model-a gpt-4o \
--model-b gpt-4.5 \
--suite quick_suite.yaml \
--factual-connector wikipedia \
--factual-connector-timeout 8 \
--factual-connector-max-results 3 \
--max-workers 4 \
--max-retries 3 \
--rate-limit-rps 2 \
--output run_report.json
5) Render a report
llm-diff report run_report.json --format table
llm-diff report run_report.json --format html -o run_report.html
llm-diff report run_report.json --format csv -o run_report.csv
llm-diff report run_report.json --format ndjson -o run_report.ndjson
llm-diff report run_report.json --format junit -o run_report.junit.xml
llm-diff report run_report.json --format csv -o run_report.csv \
--export-connector http --export-endpoint https://example.com/ingest
llm-diff report run_report.json --format ndjson -o run_report.ndjson \
--export-connector s3 --export-s3-bucket my-llm-diff-bucket \
--export-s3-prefix team-a/exports --export-s3-region eu-west-1
llm-diff report run_report.json --format markdown -o run_report.md \
--export-connector gcs --export-gcs-bucket my-llm-diff-bucket \
--export-gcs-prefix team-a/exports --export-gcs-project analytics-prj
llm-diff report run_report.json --format ndjson -o run_report.ndjson \
--export-connector bigquery \
--export-bq-project analytics-prj \
--export-bq-dataset llm_diff \
--export-bq-table diff_rows \
--export-bq-location EU
llm-diff report run_report.json --format ndjson -o run_report.ndjson \
--export-connector snowflake \
--export-sf-account xy12345.eu-west-1 \
--export-sf-user svc_llm_diff \
--export-sf-warehouse COMPUTE_WH \
--export-sf-database ANALYTICS_DB \
--export-sf-schema LLM_DIFF \
--export-sf-table DIFF_ROWS
llm-diff report run_report.json --format ndjson -o run_report.ndjson \
--export-connector redshift \
--export-rs-host redshift-cluster.example.amazonaws.com \
--export-rs-port 5439 \
--export-rs-database analytics \
--export-rs-user svc_llm_diff \
--export-rs-schema llm_diff \
--export-rs-table diff_rows \
--export-rs-sslmode require
llm-diff report run_report.json --format markdown -o run_report.md \
--export-connector azure_blob \
--export-az-account-url https://myaccount.blob.core.windows.net \
--export-az-container llm-diff-exports \
--export-az-prefix team-a/exports
6) Compare two runs
llm-diff compare previous_run.json candidate_run.json
llm-diff compare previous_run.json candidate_run.json -o comparison.md
7) Evaluate risk-tier gate policy
llm-diff gate candidate_run.json --policy strict
llm-diff gate candidate_run.json --policy balanced --format json -o gate_result.json
How It Works
- Load and validate one suite YAML.
- Resolve providers from model prefixes:
gpt-*,o1-*,o3-*-> OpenAIclaude-*-> Anthropiclitellm:<model_ref>-> LiteLLMlocal:<model_ref>-> Local OpenAI-compatible endpoint
- Execute each test with model A and B concurrently.
- Apply deterministic comparators:
semantic: semantic equivalence gatefactual: hallucination/knowledge-change rules- optional
factual_external: connector-backed factual evidence signal (metadata-only) format: structure/constraint compliance checksbehavioral: expected-behavior coverage deltas- optional
judge: LLM-as-judge on semantic diffs (metadata-only)
- Aggregate with fixed precedence:
semantic-same > factual > format > behavioral > unknown
- Emit
BehaviorReportwith diffs, category stats, token usage, and estimated cost.- judge outputs never override deterministic final category/regression flags
flowchart LR
A["Suite YAML"] --> B["Runner"]
B --> C["Model A Call"]
B --> D["Model B Call"]
C --> E["Comparator Pipeline"]
D --> E
E --> F["Aggregator"]
F --> G["BehaviorReport JSON"]
G --> H["report command"]
G --> I["compare command"]
Decision Snapshot
run output:
- regressions: 7 (CI: [4.0%, 13.0%])
- improvements: 3 (CI: [1.0%, 8.0%])
compare output:
- regression delta CI: [+2.1, +9.4] pp
- regression delta significant?: yes
Why This Over Ad-Hoc Evals
- Deterministic rules keep regression signals explainable.
- Comparator breakdowns are persisted in report metadata.
- CI workflows can gate upgrades on explicit regression counts.
- One command surface keeps local and CI execution aligned.
Adoption Checklist
- Define domain suites with explicit
expected_behaviorterms. - Start with
--dry-runin CI for suite validation. - Enable retry/rate-limit defaults for provider stability.
- Track
regressions,failed_tests, and estimated cost in artifacts. - Gate upgrades with multi-suite runs in
model-upgrade-regression.yml. - Use
llm-diff gatelocally with the same policy tier used in CI.
Built-In Suites
suites/general_knowledge.yamlsuites/instruction_following.yamlsuites/safety_boundaries.yamlsuites/coding_tasks.yamlsuites/reasoning.yaml
CLI Summary
llm-diff run
Core flags:
--model-a,--model-b,--suite,--output--dry-run--continue-on-error--max-workers,--max-retries,--rate-limit-rps--pricing-file--judge-model(optional metadata-only LLM judge)--factual-connector(none|wikipedia, defaultnone)--factual-connector-timeout(default8.0)--factual-connector-max-results(default3)
llm-diff report
Render one run report as table | json | html | markdown | csv | ndjson | junit.
Optional direct export connectors:
- HTTP:
--export-connector http --export-endpoint ... - S3:
--export-connector s3 --export-s3-bucket ... [--export-s3-prefix ...] [--export-s3-region ...] - GCS:
--export-connector gcs --export-gcs-bucket ... [--export-gcs-prefix ...] [--export-gcs-project ...](ADC auth) - Azure Blob:
--export-connector azure_blob --export-az-account-url ... --export-az-container ... [--export-az-prefix ...](DefaultAzureCredential auth, all non-table formats) - BigQuery (NDJSON only):
--export-connector bigquery --format ndjson --export-bq-project ... --export-bq-dataset ... --export-bq-table ... [--export-bq-location ...] - Snowflake (NDJSON only):
--export-connector snowflake --format ndjson --export-sf-account ... --export-sf-user ... --export-sf-warehouse ... --export-sf-database ... --export-sf-schema ... --export-sf-table ... [--export-sf-role ...](--export-sf-passwordorLLM_DIFF_EXPORT_SF_PASSWORD) - Redshift (NDJSON only):
--export-connector redshift --format ndjson --export-rs-host ... --export-rs-port 5439 --export-rs-database ... --export-rs-user ... --export-rs-schema ... --export-rs-table ... [--export-rs-sslmode ...](--export-rs-passwordorLLM_DIFF_EXPORT_RS_PASSWORD)
llm-diff compare
Compare two run reports and print/write metric deltas.
llm-diff gate
Evaluate one run report with deterministic policy tiers:
strict: regressions must be0balanced: low regression budget + critical-category hard-failpermissive: wider budget + targeted critical-category limits--policy-pack:core(default),risk_averse,velocity--policy-file: optional custom YAML policy file (version: v1) that overrides pack selection for that run
Release & CI
ci.yml: quality checks onmasterpush + PR (ruff,black --check,mypy,pytest)release-check.yml: build/twine/wheel smoke checkspublish-pypi.yml: manual TestPyPI/PyPI publish flowdocker-image.yml: PR/master build+smoke, optional manual GHCR pushmodel-upgrade-regression.yml: manual/reusable regression gate (gate_policy,gate_policy_pack, optionalgate_policy_file; optional factual connector inputs; defaultstrict + core) + per-suite export artifacts (csv,ndjson,junit) + optional direct export connectors (http|s3|gcs|bigquery|snowflake|redshift|azure_blob;gcs,redshift, andazure_blobvalues are env-based via repo vars/secrets)- Node24 deprecation closure: workflows keep
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=trueand now run on Node24-ready major action pins. - Workflow security hardening: all third-party actions are pinned to full commit SHAs; Dependabot auto-updates
github-actionsminor/patch versions weekly, while major bumps are handled in planned maintenance windows.
Local parity commands:
make install-dev
make ci-local
make release-local
Full operational steps and secret matrix are in docs/release-runbook.md.
Documentation
- Docs Home
- Quick Start
- CLI Reference
- Suite Reference
- Architecture
- API Reference
- Release Runbook
- Launch Kit
Current Scope and Future Exploration
Implemented now:
- deterministic comparator pipeline
- optional LLM-as-judge (metadata-only, opt-in)
- optional external factual connector (
wikipedia, metadata-only, opt-in) - retry/rate-limit/cost tracking
- bootstrap + Wilson confidence intervals (run metadata)
- bootstrap delta CI + permutation p-value (compare rows)
- risk-tier gate policies (CLI + model-upgrade workflow)
- enterprise-ready report export artifacts (
csv,ndjson,junit) - optional direct export connectors (
http,s3,gcs,bigquery,snowflake,redshift,azure_blob;gcs/azure_blobsupport all non-table formats,bigquery/snowflake/redshiftare NDJSON-only) - suite templates and CI distribution workflows
Committed roadmap status:
- No open committed roadmap items at this time.
Future exploration candidates (not committed yet):
- additional provider-specific external sinks beyond
s3,gcs,bigquery,snowflake, andredshift(for example warehouse-native connectors)
Contributing
See CONTRIBUTING.md.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_behavior_diff-1.0.0.tar.gz.
File metadata
- Download URL: llm_behavior_diff-1.0.0.tar.gz
- Upload date:
- Size: 82.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1eeb885107746aab223e1a956e9c685d73e11d60c9be1288fe3ce8d8a74f8103
|
|
| MD5 |
f0a33de8c4de39e0f4ded113ca8864a1
|
|
| BLAKE2b-256 |
2578e052df53a298a5e21648f603295a21583313a0a887053bc1ce2547f24d90
|
Provenance
The following attestation bundles were made for llm_behavior_diff-1.0.0.tar.gz:
Publisher:
publish-pypi.yml on ogulcanaydogan/llm-behavior-diff
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_behavior_diff-1.0.0.tar.gz -
Subject digest:
1eeb885107746aab223e1a956e9c685d73e11d60c9be1288fe3ce8d8a74f8103 - Sigstore transparency entry: 1165835486
- Sigstore integration time:
-
Permalink:
ogulcanaydogan/llm-behavior-diff@77744b1ff7d90d6e5fc31e8c247a31846c8efd7f -
Branch / Tag:
refs/heads/master - Owner: https://github.com/ogulcanaydogan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@77744b1ff7d90d6e5fc31e8c247a31846c8efd7f -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file llm_behavior_diff-1.0.0-py3-none-any.whl.
File metadata
- Download URL: llm_behavior_diff-1.0.0-py3-none-any.whl
- Upload date:
- Size: 62.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f0997c49ad3df1d8fc3dd306614bf33962ce045b57561e6e588e8f8d0c981f7
|
|
| MD5 |
02665fde7ab55343445b382ce2072fb5
|
|
| BLAKE2b-256 |
4f74db153b4448cd1bec43adbb7edb1b7ebbfea5b16d10134634a7a314433a82
|
Provenance
The following attestation bundles were made for llm_behavior_diff-1.0.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on ogulcanaydogan/llm-behavior-diff
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_behavior_diff-1.0.0-py3-none-any.whl -
Subject digest:
1f0997c49ad3df1d8fc3dd306614bf33962ce045b57561e6e588e8f8d0c981f7 - Sigstore transparency entry: 1165835543
- Sigstore integration time:
-
Permalink:
ogulcanaydogan/llm-behavior-diff@77744b1ff7d90d6e5fc31e8c247a31846c8efd7f -
Branch / Tag:
refs/heads/master - Owner: https://github.com/ogulcanaydogan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@77744b1ff7d90d6e5fc31e8c247a31846c8efd7f -
Trigger Event:
workflow_dispatch
-
Statement type: