Falsification-first reliability testing for AI systems: perturb inputs, preserve replayable evidence, diff reliability across model changes.
Project description
FalsifyAI
FalsifyAI catches LLM regressions by perturbing inputs, preserving replayable evidence, and diffing reliability across model changes.
Accuracy benchmarks measure correctness. FalsifyAI measures reliability under pressure.
Under the hood, it is a reliability evidence system for stochastic software.
Status: 0.1.0 — Phase 0 MVP. Stable enough to use; spec language and verdict semantics are locked for the 0.1.x line.
pip install falsifyai
For the semantic_equivalence invariant (pulls PyTorch, ~1GB):
pip install "falsifyai[semantic]"
You upgraded your model. Did anything break?
Most LLM evals tell you the new model passes its accuracy benchmark. That's not the same as "the new model behaves like the old one under the same kinds of pressure your users will put on it." FalsifyAI catches the gap.
The investigation takes three commands. Five minutes. One terminal.
1. Define what good looks like
If you pip install'd FalsifyAI, the examples aren't on disk yet. Grab one:
curl -O https://raw.githubusercontent.com/ericckzhou/falsifyai/main/examples/model_migration.yaml
Or git clone https://github.com/ericckzhou/falsifyai for all four. Then open examples/model_migration.yaml:
falsify:
version: "1.0"
name: "Model migration regression test"
model:
provider: openai
model: gpt-4o-mini
run:
seed: 42
cases:
- id: factual_recall
input: { text: "What is the capital of France?" }
expected: { contains: ["Paris"] }
perturbations:
- { type: typo_noise, count: 3 }
- { type: casing }
invariants:
- { type: contains, values: ["Paris"] }
- id: structured_output
input: { text: 'Reply ONLY with a JSON object of the form {"capital": "<city>"}. What is the capital of Japan?' }
expected: { contains: ['"capital"', "Tokyo"] }
perturbations:
- { type: typo_noise, count: 2 }
- { type: casing }
invariants:
- { type: contains, values: ['"capital"', "Tokyo"] }
- id: extraction
input: { text: "Extract only the email addresses from this text: Contact alice@example.com or bob@example.com for details. The deadline is Friday." }
expected: { contains: ["alice@example.com", "bob@example.com"] }
perturbations:
- { type: typo_noise, count: 2 }
- { type: casing }
invariants:
- { type: contains, values: ["alice@example.com", "bob@example.com"] }
- id: policy_summary
input: { text: "Summarize this refund policy in one sentence: Customers can request a refund within 30 days if the item is unused and the receipt is provided." }
expected: { contains: ["30 days", "unused", "receipt"] }
perturbations:
- { type: typo_noise, count: 2 }
- { type: casing }
invariants:
- { type: contains, values: ["30 days", "unused", "receipt"] }
Four cases. One obvious sanity anchor (factual recall) plus three production-shaped contracts: structured output, extraction, and grounded policy summarization. The mix is deliberate — a migration regression then looks like a behavioral pattern across contract types, not a single anecdote.
2. Run against your baseline model
$ falsifyai run examples/model_migration.yaml
case: factual_recall verdict: STABLE confidence: 0.95 (CI: 0.92-0.98)
case: structured_output verdict: STABLE confidence: 0.94 (CI: 0.91-0.97)
case: extraction verdict: STABLE confidence: 0.96 (CI: 0.93-0.99)
case: policy_summary verdict: STABLE confidence: 0.93 (CI: 0.90-0.97)
=================================================================
Session 7c4f...a201 -> .falsifyai/replays.db
4 cases, verdict STABLE, 0 FRAGILE, 0 CONSISTENTLY_WRONG
Exit code: 0. Four contracts, four green rows. Note the session id (7c4f...a201) — that's your known-good baseline. Commit it to your repo if you want it durable.
3. Switch to the new model. Run again.
Change model: in the spec (or set a different OPENAI_MODEL env var), then:
$ falsifyai run examples/model_migration.yaml
case: factual_recall verdict: STABLE confidence: 0.94 (CI: 0.91-0.97)
case: structured_output verdict: CONSISTENTLY_WRONG confidence: 0.00 (CI: 0.00-0.00)
case: extraction verdict: CONSISTENTLY_WRONG confidence: 0.00 (CI: 0.00-0.00)
case: policy_summary verdict: STABLE confidence: 0.92 (CI: 0.88-0.96)
=================================================================
Session 9a32...b1f0 -> .falsifyai/replays.db
4 cases, verdict CONSISTENTLY_WRONG, 0 FRAGILE, 2 CONSISTENTLY_WRONG
Exit code: 2 (FAILURE). The new model still knows the capital of France and can still summarize the refund policy with the required terms — but it dropped the JSON envelope on structured output and refused to do the extraction. Same model. Two contracts broken. Two unchanged.
4. Diff the two runs
$ falsifyai diff 7c4f...a201 9a32...b1f0
Diff: baseline 7c4f...a201 -> candidate 9a32...b1f0
Store: .falsifyai/replays.db
=================================================================
case: extraction baseline: STABLE (0.96) candidate: CONSISTENTLY_WRONG (0.00) REGRESSED
case: structured_output baseline: STABLE (0.94) candidate: CONSISTENTLY_WRONG (0.00) REGRESSED
=================================================================
2 regressed, 0 improved, 2 unchanged, 0 other, 0 added, 0 removed
Exit code: 5 (REGRESSION). Only the rows that changed are shown — two regressions, two unchanged contracts compressed into the footer count. The migration broke structured output and extraction, but preserved factual recall and policy grounding. That is a behavioral pattern, not an anecdote.
One command, two verdict-class downgrades, one number your CI can gate on.
5. Replay any past session
$ falsifyai replay --latest
Loaded session 9a32...b1f0 · created_at 2026-05-21T... from .falsifyai/replays.db
case: factual_recall verdict: STABLE confidence: 0.94 (CI: 0.91-0.97)
case: structured_output verdict: CONSISTENTLY_WRONG confidence: 0.00 (CI: 0.00-0.00)
case: extraction verdict: CONSISTENTLY_WRONG confidence: 0.00 (CI: 0.00-0.00)
case: policy_summary verdict: STABLE confidence: 0.92 (CI: 0.88-0.96)
=================================================================
4 cases, verdict CONSISTENTLY_WRONG, 0 FRAGILE, 2 CONSISTENTLY_WRONG
Replay is read-only. The verdict shown is the one assigned at run time — never re-resolved. The same evidence that triggered the regression alert is preserved indefinitely.
That's the whole product. run → replay → diff is one falsification workflow, not three commands.
What just happened?
Five concepts, one screen each:
Perturbations generate small input variations a real user might produce. The MVP ships typo_noise (character-level mutations) and casing_variant (UPPER / lower / Title).
Invariants judge whether a perturbed output is still "the same answer" as the original. contains checks for required substrings; semantic_equivalence compares embedding cosine similarity to a threshold.
Verdicts compress evidence into one of five labels per case:
| Verdict | Meaning | Exit |
|---|---|---|
STABLE |
All perturbations passed the invariants | 0 |
FRAGILE |
Some perturbations failed; model drifts under pressure | 1 |
CONSISTENTLY_WRONG |
Every output (including baseline) violates the ground truth | 2 |
INSUFFICIENT |
Not enough evidence to decide (too few perturbations) | 4 |
INVALID_EVAL |
The evaluation itself is invalid or contradictory | 2 |
Verdicts use stratified bootstrap CI — each perturbation family is resampled independently, and the worst-case CI lower bound wins. A model that survives typos but breaks under casing reports the casing stability number, not an aggregated average that hides the failure.
Replay artifacts preserve the full evidence trail per session — every perturbed input, every model output, every invariant judgment, the verdict, and the per-family stability distribution. Replay shows historical evidence; it does not re-resolve. The CLI compresses; the artifact preserves the receipts.
Diff compares two artifacts case-by-case. The regression criterion is a binary verdict-class downgrade — STABLE → FRAGILE, STABLE → CONSISTENTLY_WRONG, or FRAGILE → CONSISTENTLY_WRONG. A competent user can predict the exit code from the two verdicts; there are no hidden thresholds.
For the full philosophy — including why evidence density beats evidence volume, what resolver inflation is and why we resist it, and how the four pillars hang together — see docs/ARCHITECTURE.md.
Architecture
Three layers, separated by design. Each new feature belongs in exactly one of them.
flowchart LR
subgraph Generation["Evidence generation"]
Spec[Spec / YAML]
Mat[Materialize]
Exec[Execute]
Spec --> Mat --> Exec
end
subgraph Interpretation["Evidence interpretation"]
Inv[Invariants]
Res[Verdict resolver]
Ren[CLI render]
Inv --> Res --> Ren
end
subgraph Preservation["Evidence preservation"]
Art[ReplayArtifact]
Store[ReplayStore]
Art --> Store
end
Exec --> Inv
Ren --> Art
Store -.->|replay/diff| Ren
ASCII fallback (for PyPI / mobile readers):
EVIDENCE GENERATION EVIDENCE INTERPRETATION EVIDENCE PRESERVATION
───────────────────── ─────────────────────── ─────────────────────
spec.yaml invariants ReplayArtifact
│ verdict resolver ReplayStore
▼ CLI render ▲
materialize │ │
│ │ │
▼ ▼ │
execute ────────────────────────▶ judge ────────────▶ resolve ───────┘
│
┌── falsifyai run │
│── falsifyai replay │ (consumers read
└── falsifyai diff │ the artifact)
A future feature touches exactly one layer. Adaptive evidence collection is interpretation, not generation. A new perturbation family is generation, not interpretation. A new verdict shape is interpretation, not preservation. The separation is what keeps the resolver explainable as the project grows — see docs/ARCHITECTURE.md and the philosophy section of CONTRIBUTING.md.
CLI reference
Three subcommands, one workflow:
falsifyai run <spec.yaml> [--store-path PATH]
falsifyai replay <session_id> [--store-path PATH]
falsifyai replay --latest [--store-path PATH]
falsifyai diff <baseline_id> <candidate_id> [--store-path PATH]
| Exit code | Meaning |
|---|---|
| 0 | SUCCESS — session verdict STABLE |
| 1 | DEGRADED — session verdict FRAGILE |
| 2 | FAILURE — session verdict CONSISTENTLY_WRONG or INVALID_EVAL |
| 3 | ERROR — infrastructure failure (bad spec, missing credential, model call failure) |
| 4 | INSUFFICIENT — not enough evidence to decide |
| 5 | REGRESSION — falsifyai diff detected a verdict-class downgrade |
Default --store-path is .falsifyai/replays.db. Use :memory: for ephemeral runs (test-only; replay and diff need a persistent store).
CI integration
The launch wedge in a GitHub Actions step:
- name: Reliability regression gate
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
KNOWN_GOOD="${{ vars.FALSIFYAI_BASELINE_SESSION_ID }}"
falsifyai run eval.yaml
CANDIDATE=$(sqlite3 .falsifyai/replays.db \
"SELECT session_id FROM sessions ORDER BY created_at_iso DESC LIMIT 1;")
falsifyai diff "$KNOWN_GOOD" "$CANDIDATE"
# Exit 5 = regression; the job fails.
The KNOWN_GOOD variable is a session id you captured locally against the production model and committed as a repo / org variable. CI runs the eval against the candidate model and diffs — exit 5 (REGRESSION) fails the job. Zero thresholds to tune; the regression criterion is the verdict-class downgrade.
Examples
Four dogfooded specs, all verified in CI (tests/integration/test_examples.py):
| Example | Verdict | What it demonstrates |
|---|---|---|
examples/stable.yaml |
STABLE (exit 0) |
A sane model under perturbation; both perturbation families + both invariants. |
examples/fragile.yaml |
FRAGILE (exit 1) |
Model drift: baseline correct, perturbations wrong. |
examples/consistently_wrong.yaml |
CONSISTENTLY_WRONG (exit 2) |
Confident hallucination: same wrong answer under every perturbation. |
examples/model_migration.yaml |
regression (exit 5) | The launch wedge — run twice, diff, exit 5 if any case regressed. |
Run any of them:
falsifyai run examples/stable.yaml
A real provider is required at runtime (OPENAI_API_KEY or the equivalent env var for your provider). The dogfood tests in CI bypass real model calls by injecting MockAdapter through a test seam — see tests/integration/test_examples.py for the pattern.
Writing your own spec
The shortest valid spec (tests/fixtures/specs/minimal.yaml):
falsify:
version: "1.0"
name: "minimal"
model:
provider: openai
model: gpt-4o-mini
run:
seed: 42
cases:
- id: hello
input:
text: "Say hi."
perturbations:
- type: typo_noise
invariants:
- type: contains
values: ["hi"]
The full spec schema (perturbation parameters, invariant types, verdict thresholds) is in plan.md §6. The spec language is locked for the 0.1.x line.
Local development
Requires Python 3.13+ and uv.
git clone https://github.com/ericckzhou/falsifyai
cd falsifyai
uv sync --extra dev
uv run pytest
Contributions follow the conventions in CONTRIBUTING.md. Architectural constraints (especially: resist resolver inflation) are non-negotiable; see that doc for the trust test any resolver-touching PR must pass.
Status and roadmap
0.1.0 (this release) — Phase 0 MVP. Spec language, perturbation runtime, materializer, invariants, execution adapter, replay store, real verdict resolver (stratified bootstrap CI, CONSISTENTLY_WRONG, falsifiability scoring), and the three-command CLI (run + replay + diff).
Phase 1 (post-0.1.0). Driven by real-world usage feedback. Likely additions: full ConsistencyOracle (embedding-based contradiction detection), falsifyai history --case <id> (time-series across sessions), falsifyai inspect <session_id> (deep-dive per-case view), exit code 6 (LOW_FALSIFIABILITY) wiring, --strict / --show-trending flags on diff, paraphrase + retrieval perturbation families, real-LiteLLM smoke testing in CI.
Phase 1 features will be evaluated against the question: does this preserve evidence density and resolver predictability, or does it inflate the surface? See docs/ARCHITECTURE.md and CONTRIBUTING.md for the discipline.
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file falsifyai-0.1.0.tar.gz.
File metadata
- Download URL: falsifyai-0.1.0.tar.gz
- Upload date:
- Size: 440.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e65612d76680bd4b8ee8e684f352d1811852700eede566c83cdbb295a480cb7
|
|
| MD5 |
bff4efa9f8e4d1dbacfa488fc209a09d
|
|
| BLAKE2b-256 |
d13c5d8eae3b884b7b9e8d24634b0e1ab1db2fd9e6743d8b1f7c43791a81a508
|
File details
Details for the file falsifyai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: falsifyai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 64.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2717b2ab34ee1e05fdc4cc1733d8efc62d9af2f658f322d69f9c503121c186cb
|
|
| MD5 |
beb5ade6c4255440832f26de2e12dee8
|
|
| BLAKE2b-256 |
2d400ee12044929c19bba0e933b2ac21167a01dcbb2cba2cfde47e93358ea4d4
|