Falsification-first reliability testing for AI systems: perturb inputs, preserve replayable evidence, diff reliability across model changes.

These details have not been verified by PyPI

Project links

Project description

FalsifyAI

FalsifyAI catches LLM regressions by perturbing inputs, preserving replayable evidence, and diffing reliability across model changes.

Accuracy benchmarks measure correctness. FalsifyAI measures reliability under pressure.

Under the hood, it is a reliability evidence system for stochastic software.

Status: 0.1.0 — Phase 0 MVP. Stable enough to use; spec language and verdict semantics are locked for the 0.1.x line.

pip install falsifyai

For the semantic_equivalence invariant (pulls PyTorch, ~1GB):

pip install "falsifyai[semantic]"

You upgraded your model. Did anything break?

Most LLM evals tell you the new model passes its accuracy benchmark. That's not the same as "the new model behaves like the old one under the same kinds of pressure your users will put on it." FalsifyAI catches the gap.

The investigation takes three commands. Five minutes. One terminal.

1. Define what good looks like

If you pip install'd FalsifyAI, the examples aren't on disk yet. Grab one:

curl -O https://raw.githubusercontent.com/ericckzhou/falsifyai/main/examples/model_migration.yaml

Or git clone https://github.com/ericckzhou/falsifyai for all four. Then open examples/model_migration.yaml:

falsify:
  version: "1.0"
  name: "Model migration regression test"
model:
  provider: openai
  model: gpt-4o-mini
run:
  seed: 42
cases:
  - id: factual_recall
    input: { text: "What is the capital of France?" }
    expected: { contains: ["Paris"] }
    perturbations:
      - { type: typo_noise, count: 3 }
      - { type: casing }
    invariants:
      - { type: contains, values: ["Paris"] }

  - id: structured_output
    input: { text: 'Reply ONLY with a JSON object of the form {"capital": "<city>"}. What is the capital of Japan?' }
    expected: { contains: ['"capital"', "Tokyo"] }
    perturbations:
      - { type: typo_noise, count: 2 }
      - { type: casing }
    invariants:
      - { type: contains, values: ['"capital"', "Tokyo"] }

  - id: extraction
    input: { text: "Extract only the email addresses from this text: Contact alice@example.com or bob@example.com for details. The deadline is Friday." }
    expected: { contains: ["alice@example.com", "bob@example.com"] }
    perturbations:
      - { type: typo_noise, count: 2 }
      - { type: casing }
    invariants:
      - { type: contains, values: ["alice@example.com", "bob@example.com"] }

  - id: policy_summary
    input: { text: "Summarize this refund policy in one sentence: Customers can request a refund within 30 days if the item is unused and the receipt is provided." }
    expected: { contains: ["30 days", "unused", "receipt"] }
    perturbations:
      - { type: typo_noise, count: 2 }
      - { type: casing }
    invariants:
      - { type: contains, values: ["30 days", "unused", "receipt"] }

Four cases. One obvious sanity anchor (factual recall) plus three production-shaped contracts: structured output, extraction, and grounded policy summarization. The mix is deliberate — a migration regression then looks like a behavioral pattern across contract types, not a single anecdote.

2. Run against your baseline model

$ falsifyai run examples/model_migration.yaml
case: factual_recall     verdict: STABLE  confidence: 0.95 (CI: 0.92-0.98)
case: structured_output  verdict: STABLE  confidence: 0.94 (CI: 0.91-0.97)
case: extraction         verdict: STABLE  confidence: 0.96 (CI: 0.93-0.99)
case: policy_summary     verdict: STABLE  confidence: 0.93 (CI: 0.90-0.97)
=================================================================
Session 7c4f...a201 -> .falsifyai/replays.db
4 cases, verdict STABLE, 0 FRAGILE, 0 CONSISTENTLY_WRONG

Exit code: 0. Four contracts, four green rows. Note the session id (7c4f...a201) — that's your known-good baseline. Commit it to your repo if you want it durable.

3. Switch to the new model. Run again.

Change model: in the spec (or set a different OPENAI_MODEL env var), then:

$ falsifyai run examples/model_migration.yaml
case: factual_recall     verdict: STABLE              confidence: 0.94 (CI: 0.91-0.97)
case: structured_output  verdict: CONSISTENTLY_WRONG  confidence: 0.00 (CI: 0.00-0.00)
case: extraction         verdict: CONSISTENTLY_WRONG  confidence: 0.00 (CI: 0.00-0.00)
case: policy_summary     verdict: STABLE              confidence: 0.92 (CI: 0.88-0.96)
=================================================================
Session 9a32...b1f0 -> .falsifyai/replays.db
4 cases, verdict CONSISTENTLY_WRONG, 0 FRAGILE, 2 CONSISTENTLY_WRONG

Exit code: 2 (FAILURE). The new model still knows the capital of France and can still summarize the refund policy with the required terms — but it dropped the JSON envelope on structured output and refused to do the extraction. Same model. Two contracts broken. Two unchanged.

4. Diff the two runs

$ falsifyai diff 7c4f...a201 9a32...b1f0
Diff: baseline 7c4f...a201 -> candidate 9a32...b1f0
Store: .falsifyai/replays.db
=================================================================
case: extraction         baseline: STABLE (0.96)  candidate: CONSISTENTLY_WRONG (0.00)  REGRESSED
case: structured_output  baseline: STABLE (0.94)  candidate: CONSISTENTLY_WRONG (0.00)  REGRESSED
=================================================================
2 regressed, 0 improved, 2 unchanged, 0 other, 0 added, 0 removed

Exit code: 5 (REGRESSION). Only the rows that changed are shown — two regressions, two unchanged contracts compressed into the footer count. The migration broke structured output and extraction, but preserved factual recall and policy grounding. That is a behavioral pattern, not an anecdote.

One command, two verdict-class downgrades, one number your CI can gate on.

5. Replay any past session

$ falsifyai replay --latest
Loaded session 9a32...b1f0 · created_at 2026-05-21T... from .falsifyai/replays.db
case: factual_recall     verdict: STABLE              confidence: 0.94 (CI: 0.91-0.97)
case: structured_output  verdict: CONSISTENTLY_WRONG  confidence: 0.00 (CI: 0.00-0.00)
case: extraction         verdict: CONSISTENTLY_WRONG  confidence: 0.00 (CI: 0.00-0.00)
case: policy_summary     verdict: STABLE              confidence: 0.92 (CI: 0.88-0.96)
=================================================================
4 cases, verdict CONSISTENTLY_WRONG, 0 FRAGILE, 2 CONSISTENTLY_WRONG

Replay is read-only. The verdict shown is the one assigned at run time — never re-resolved. The same evidence that triggered the regression alert is preserved indefinitely.

That's the whole product. run → replay → diff is one falsification workflow, not three commands.

What just happened?

Five concepts, one screen each:

Perturbations generate small input variations a real user might produce. The MVP ships typo_noise (character-level mutations) and casing_variant (UPPER / lower / Title).

Invariants judge whether a perturbed output is still "the same answer" as the original. contains checks for required substrings; semantic_equivalence compares embedding cosine similarity to a threshold.

Verdicts compress evidence into one of five labels per case:

Verdict	Meaning	Exit
`STABLE`	All perturbations passed the invariants	0
`FRAGILE`	Some perturbations failed; model drifts under pressure	1
`CONSISTENTLY_WRONG`	Every output (including baseline) violates the ground truth	2
`INSUFFICIENT`	Not enough evidence to decide (too few perturbations)	4
`INVALID_EVAL`	The evaluation itself is invalid or contradictory	2

Verdicts use stratified bootstrap CI — each perturbation family is resampled independently, and the worst-case CI lower bound wins. A model that survives typos but breaks under casing reports the casing stability number, not an aggregated average that hides the failure.

Replay artifacts preserve the full evidence trail per session — every perturbed input, every model output, every invariant judgment, the verdict, and the per-family stability distribution. Replay shows historical evidence; it does not re-resolve. The CLI compresses; the artifact preserves the receipts.

Diff compares two artifacts case-by-case. The regression criterion is a binary verdict-class downgrade — STABLE → FRAGILE, STABLE → CONSISTENTLY_WRONG, or FRAGILE → CONSISTENTLY_WRONG. A competent user can predict the exit code from the two verdicts; there are no hidden thresholds.

For the full philosophy — including why evidence density beats evidence volume, what resolver inflation is and why we resist it, and how the four pillars hang together — see docs/ARCHITECTURE.md.

Architecture

Three layers, separated by design. Each new feature belongs in exactly one of them.

flowchart LR
    subgraph Generation["Evidence generation"]
        Spec[Spec / YAML]
        Mat[Materialize]
        Exec[Execute]
        Spec --> Mat --> Exec
    end
    subgraph Interpretation["Evidence interpretation"]
        Inv[Invariants]
        Res[Verdict resolver]
        Ren[CLI render]
        Inv --> Res --> Ren
    end
    subgraph Preservation["Evidence preservation"]
        Art[ReplayArtifact]
        Store[ReplayStore]
        Art --> Store
    end
    Exec --> Inv
    Ren --> Art
    Store -.->|replay/diff| Ren

ASCII fallback (for PyPI / mobile readers):

  EVIDENCE GENERATION             EVIDENCE INTERPRETATION         EVIDENCE PRESERVATION
  ─────────────────────           ───────────────────────         ─────────────────────
  spec.yaml                       invariants                      ReplayArtifact
     │                            verdict resolver                ReplayStore
     ▼                            CLI render                            ▲
  materialize                            │                              │
     │                                   │                              │
     ▼                                   ▼                              │
  execute  ────────────────────────▶ judge ────────────▶ resolve ───────┘
                                                            │
                                       ┌── falsifyai run    │
                                       │── falsifyai replay │  (consumers read
                                       └── falsifyai diff   │   the artifact)

A future feature touches exactly one layer. Adaptive evidence collection is interpretation, not generation. A new perturbation family is generation, not interpretation. A new verdict shape is interpretation, not preservation. The separation is what keeps the resolver explainable as the project grows — see docs/ARCHITECTURE.md and the philosophy section of CONTRIBUTING.md.

CLI reference

Three subcommands, one workflow:

falsifyai run <spec.yaml> [--store-path PATH]
falsifyai replay <session_id> [--store-path PATH]
falsifyai replay --latest      [--store-path PATH]
falsifyai diff <baseline_id> <candidate_id> [--store-path PATH]

Exit code	Meaning
0	SUCCESS — session verdict STABLE
1	DEGRADED — session verdict FRAGILE
2	FAILURE — session verdict CONSISTENTLY_WRONG or INVALID_EVAL
3	ERROR — infrastructure failure (bad spec, missing credential, model call failure)
4	INSUFFICIENT — not enough evidence to decide
5	REGRESSION — `falsifyai diff` detected a verdict-class downgrade

Default --store-path is .falsifyai/replays.db. Use :memory: for ephemeral runs (test-only; replay and diff need a persistent store).

CI integration

The launch wedge in a GitHub Actions step:

- name: Reliability regression gate
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  run: |
    KNOWN_GOOD="${{ vars.FALSIFYAI_BASELINE_SESSION_ID }}"
    falsifyai run eval.yaml
    CANDIDATE=$(sqlite3 .falsifyai/replays.db \
      "SELECT session_id FROM sessions ORDER BY created_at_iso DESC LIMIT 1;")
    falsifyai diff "$KNOWN_GOOD" "$CANDIDATE"
    # Exit 5 = regression; the job fails.

The KNOWN_GOOD variable is a session id you captured locally against the production model and committed as a repo / org variable. CI runs the eval against the candidate model and diffs — exit 5 (REGRESSION) fails the job. Zero thresholds to tune; the regression criterion is the verdict-class downgrade.

Examples

Four dogfooded specs, all verified in CI (tests/integration/test_examples.py):

Example	Verdict	What it demonstrates
`examples/stable.yaml`	`STABLE` (exit 0)	A sane model under perturbation; both perturbation families + both invariants.
`examples/fragile.yaml`	`FRAGILE` (exit 1)	Model drift: baseline correct, perturbations wrong.
`examples/consistently_wrong.yaml`	`CONSISTENTLY_WRONG` (exit 2)	Confident hallucination: same wrong answer under every perturbation.
`examples/model_migration.yaml`	regression (exit 5)	The launch wedge — run twice, diff, exit 5 if any case regressed.

Run any of them:

falsifyai run examples/stable.yaml

A real provider is required at runtime (OPENAI_API_KEY or the equivalent env var for your provider). The dogfood tests in CI bypass real model calls by injecting MockAdapter through a test seam — see tests/integration/test_examples.py for the pattern.

Writing your own spec

The shortest valid spec (tests/fixtures/specs/minimal.yaml):

falsify:
  version: "1.0"
  name: "minimal"
model:
  provider: openai
  model: gpt-4o-mini
run:
  seed: 42
cases:
  - id: hello
    input:
      text: "Say hi."
    perturbations:
      - type: typo_noise
    invariants:
      - type: contains
        values: ["hi"]

The full spec schema (perturbation parameters, invariant types, verdict thresholds) is in plan.md §6. The spec language is locked for the 0.1.x line.

Local development

Requires Python 3.13+ and uv.

git clone https://github.com/ericckzhou/falsifyai
cd falsifyai
uv sync --extra dev
uv run pytest

Contributions follow the conventions in CONTRIBUTING.md. Architectural constraints (especially: resist resolver inflation) are non-negotiable; see that doc for the trust test any resolver-touching PR must pass.

Status and roadmap

0.1.0 (this release) — Phase 0 MVP. Spec language, perturbation runtime, materializer, invariants, execution adapter, replay store, real verdict resolver (stratified bootstrap CI, CONSISTENTLY_WRONG, falsifiability scoring), and the three-command CLI (run + replay + diff).

Phase 1 (post-0.1.0). Driven by real-world usage feedback. Likely additions: full ConsistencyOracle (embedding-based contradiction detection), falsifyai history --case <id> (time-series across sessions), falsifyai inspect <session_id> (deep-dive per-case view), exit code 6 (LOW_FALSIFIABILITY) wiring, --strict / --show-trending flags on diff, paraphrase + retrieval perturbation families, real-LiteLLM smoke testing in CI.

Phase 1 features will be evaluated against the question: does this preserve evidence density and resolver predictability, or does it inflate the surface? See docs/ARCHITECTURE.md and CONTRIBUTING.md for the discipline.

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 23, 2026

This version

0.1.0

May 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

falsifyai-0.1.0.tar.gz (440.2 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

falsifyai-0.1.0-py3-none-any.whl (64.1 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file falsifyai-0.1.0.tar.gz.

File metadata

Download URL: falsifyai-0.1.0.tar.gz
Upload date: May 21, 2026
Size: 440.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for falsifyai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7e65612d76680bd4b8ee8e684f352d1811852700eede566c83cdbb295a480cb7`
MD5	`bff4efa9f8e4d1dbacfa488fc209a09d`
BLAKE2b-256	`d13c5d8eae3b884b7b9e8d24634b0e1ab1db2fd9e6743d8b1f7c43791a81a508`

See more details on using hashes here.

File details

Details for the file falsifyai-0.1.0-py3-none-any.whl.

File metadata

Download URL: falsifyai-0.1.0-py3-none-any.whl
Upload date: May 21, 2026
Size: 64.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for falsifyai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2717b2ab34ee1e05fdc4cc1733d8efc62d9af2f658f322d69f9c503121c186cb`
MD5	`beb5ade6c4255440832f26de2e12dee8`
BLAKE2b-256	`2d400ee12044929c19bba0e933b2ac21167a01dcbb2cba2cfde47e93358ea4d4`

See more details on using hashes here.

falsifyai 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

FalsifyAI

You upgraded your model. Did anything break?

1. Define what good looks like

2. Run against your baseline model

3. Switch to the new model. Run again.

4. Diff the two runs

5. Replay any past session

What just happened?

Architecture

CLI reference

CI integration

Examples

Writing your own spec

Local development

Status and roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes