Falsification-first reliability testing for AI systems: perturb inputs, preserve replayable evidence, diff reliability across model changes.

These details have not been verified by PyPI

Project description

FalsifyAI

FalsifyAI produces replayable, inspectable evidence that AI systems behave reliably under realistic pressure.

Most evaluation tools produce metrics. FalsifyAI produces evidence — durable, structured artifacts that survive the run and let you support reliability claims about a model migration with preserved, inspectable proof.

Status: 0.2.0 — Phase 1 first wave shipped (inspect, history, paraphrase, canonical case study, automated PyPI publishing). Spec language and verdict semantics remain locked for the 0.x line.

pip install falsifyai

For the semantic_equivalence invariant (pulls PyTorch, ~1GB):

pip install "falsifyai[semantic]"

What kind of tool is this?

FalsifyAI is evidence infrastructure for reliability claims about stochastic systems — most immediately, LLMs.

Think of it the way you'd think of:

Domain	Evidence infrastructure
Software supply chain	SBOM (CycloneDX, SPDX) — what's in this build, with provenance
Static analysis	SARIF — the structured record of what was scanned and found
Build provenance	Sigstore / in-toto — cryptographic attestations about what was built and by whom
Security events	Audit logs — preserved, inspectable, defensible after the fact
Stochastic-system reliability	FalsifyAI replay artifact — preserved, inspectable, defensible evidence that a model behaved reliably under realistic pressure

The underlying pattern isn't new. Applying it to stochastic-system reliability is. FalsifyAI is the stochastic-systems analogue of an evidence layer you already know.

The novelty isn't that we preserve evidence — it's what we preserve: every perturbed input, every model output, every invariant judgment, the verdict, the materialized spec, and the identity that ties them together. The CLI compresses; the artifact preserves the receipts.

The core terms

Three definitions that anchor everything else in this document:

Stochastic software can produce meaningfully different outputs for equivalent requests due to probabilistic inference, retrieval variability, tool interactions, or adaptive behavior. LLMs are the most common case today; future AI systems will extend the category.

A reliability claim is a bounded statement about how a stochastic system behaves under specified perturbation pressure, judged by specified invariants. "This case is STABLE under typo_noise and casing" is a reliability claim. "This model is reliable" is not — it's unfalsifiable and unbounded.

Reliability evidence is the preserved, replayable proof supporting a reliability claim. Without evidence, claims are anecdotes. With evidence, claims become inspectable.

In one sentence: FalsifyAI is a tool for producing reliability evidence that supports bounded reliability claims about stochastic software. The replay artifact is the durable object; everything else exists to produce, interpret, or consume one.

The 5-minute proof

The investigation takes three commands. One terminal. Real models. Replayable session IDs at the end.

1. Define what good looks like

If you pip install'd FalsifyAI, the examples aren't on disk yet. Grab one:

curl -O https://raw.githubusercontent.com/ericckzhou/falsifyai/main/examples/model_migration.yaml

Or git clone https://github.com/ericckzhou/falsifyai for all four. Then open examples/model_migration.yaml:

falsify:
  version: "1.0"
  name: "Model migration regression test"
model:
  provider: groq
  model: llama-3.3-70b-versatile
run:
  seed: 42
cases:
  - id: factual_recall
    input: { text: "What is the capital of France?" }
    expected: { contains: ["Paris"] }
    perturbations:
      - { type: typo_noise, count: 3 }
      - { type: casing }
    invariants:
      - { type: contains, values: ["Paris"] }

  - id: structured_output
    input: { text: 'Reply ONLY with a JSON object of the form {"capital": "<city>"}. What is the capital of Japan?' }
    expected: { contains: ['"capital"', "Tokyo"] }
    perturbations:
      - { type: typo_noise, count: 2 }
      - { type: casing }
    invariants:
      - { type: contains, values: ['"capital"', "Tokyo"] }

  - id: extraction
    input: { text: "Extract only the email addresses from this text: Contact alice@example.com or bob@example.com for details. The deadline is Friday." }
    expected: { contains: ["alice@example.com", "bob@example.com"] }
    perturbations:
      - { type: typo_noise, count: 2 }
      - { type: casing }
    invariants:
      - { type: contains, values: ["alice@example.com", "bob@example.com"] }

  - id: policy_summary
    input: { text: "Summarize this refund policy in one sentence: Customers can request a refund within 30 days if the item is unused and the receipt is provided." }
    expected: { contains: ["30 days", "unused", "receipt"] }
    perturbations:
      - { type: typo_noise, count: 2 }
      - { type: casing }
    invariants:
      - { type: contains, values: ["30 days", "unused", "receipt"] }

Four cases. One sanity anchor (factual recall) plus three production-shaped contracts: structured output, extraction, grounded policy summarization. The mix is deliberate — a migration regression then looks like a behavioral pattern across contract types, not a single anecdote.

2. Run against your baseline model

$ falsifyai run examples/model_migration.yaml
case: factual_recall     verdict: STABLE   confidence: 1.00 (CI: 1.00-1.00)
case: structured_output  verdict: STABLE   confidence: 1.00 (CI: 1.00-1.00)
case: extraction         verdict: FRAGILE  confidence: 0.00 (CI: 0.00-0.00)  worst: typo_noise
case: policy_summary     verdict: STABLE   confidence: 1.00 (CI: 1.00-1.00)
=================================================================
Session 7e51299481d5420d9181e71ba0449348 -> .falsifyai/replays.db
4 cases, verdict FRAGILE, 1 FRAGILE, 0 CONSISTENTLY_WRONG, falsifiability 0.36

Exit code: 1 (FRAGILE). Three contracts hold under pressure; one (extraction) is already fragile on this baseline — typo noise on alice@example.com corrupts the token and the model drops the address. That's a known weakness, now preserved as evidence. Note the session id — that's your baseline evidence artifact. Commit it to your repo if you want it durable.

3. Switch to the new model. Run again.

Swap model: llama-3.3-70b-versatile for model: openai/gpt-oss-120b (OpenAI's open-weights model, also on Groq). Run again:

$ falsifyai run examples/model_migration.yaml
case: factual_recall     verdict: STABLE   confidence: 1.00 (CI: 1.00-1.00)
case: structured_output  verdict: STABLE   confidence: 1.00 (CI: 1.00-1.00)
case: extraction         verdict: FRAGILE  confidence: 0.00 (CI: 0.00-0.00)  worst: typo_noise
case: policy_summary     verdict: FRAGILE  confidence: 0.00 (CI: 0.00-0.00)  worst: typo_noise
=================================================================
Session 4332c0d246bc4b3e875392ecdf3b1780 -> .falsifyai/replays.db
4 cases, verdict FRAGILE, 2 FRAGILE, 0 CONSISTENTLY_WRONG, falsifiability 0.36

Exit code: 1. The new (larger, more recent) model has the same pre-existing extraction weakness — plus a new failure: policy_summary is now fragile under the same typo perturbation that left the baseline untouched. Same spec. Different model. A real, quietly-introduced regression.

4. Diff the two evidence artifacts

$ falsifyai diff 7e51299481d5420d9181e71ba0449348 4332c0d246bc4b3e875392ecdf3b1780
Diff: baseline 7e51299481d5420d9181e71ba0449348 -> candidate 4332c0d246bc4b3e875392ecdf3b1780
Store: .falsifyai/replays.db
=================================================================
case: policy_summary  baseline: STABLE (1.00)  candidate: FRAGILE (0.00)  REGRESSED
=================================================================
1 regressed, 0 improved, 3 unchanged, 0 other, 0 added, 0 removed

Exit code: 5 (REGRESSION). Only the row that changed is shown. The pre-existing extraction fragility is compressed into the unchanged-count footer — that's not the news; the policy summary regression is.

One command. One verdict-class downgrade. One exit code your CI can gate on. One preserved evidence trail you can re-open six months from now and inspect.

5. Replay any past session

$ falsifyai replay --latest
Loaded session 4332c0d246bc4b3e875392ecdf3b1780 · created_at 2026-05-22T... from .falsifyai/replays.db
case: factual_recall     verdict: STABLE   confidence: 1.00 (CI: 1.00-1.00)
case: structured_output  verdict: STABLE   confidence: 1.00 (CI: 1.00-1.00)
case: extraction         verdict: FRAGILE  confidence: 0.00 (CI: 0.00-0.00)  worst: typo_noise
case: policy_summary     verdict: FRAGILE  confidence: 0.00 (CI: 0.00-0.00)  worst: typo_noise
=================================================================
4 cases, verdict FRAGILE, 2 FRAGILE, 0 CONSISTENTLY_WRONG, falsifiability 0.36

Replay is read-only. The verdict shown is the one assigned at run time — never re-resolved. The same evidence that triggered the regression alert is preserved indefinitely, even if the model is later deprecated, the API endpoint changes, or your spec evolves.

Without replay artifacts, this entire workflow is anecdotes. "The new model failed our eval on Tuesday" is unverifiable by Friday — the API may have changed, your harness may have been refactored, your colleague may want proof.

With replay artifacts, the workflow produces inspectable evidence. Re-open the artifact six months from now and the claim still stands on its own. That's the whole product. run → replay → diff is one falsification workflow that ends in a preserved, inspectable evidence artifact. Not three commands — one evidence workflow producing one durable record.

For a deeper walkthrough of these same sessions — including history, inspect, the cross-model extraction finding, and the U+202F invisible-character regression — see Invisible character substitution.

What's in the evidence?

The replay artifact (one row in .falsifyai/replays.db, one row per session) preserves:

Identity — session_id (UUID), spec_hash (sha256 of source YAML), materialized_hash (sha256 of realized perturbations), created_at_iso, FalsifyAI version
The materialized spec — every realized perturbation string with its seed and lineage, so the inputs are exactly reproducible
Every model output — original and perturbed, raw, no post-processing
Every invariant judgment — which invariant ran on which output, pass/fail, evidence string
The verdict — assigned at run time using a deterministic priority chain, never re-resolved on read
Per-perturbation-family stability — stratified bootstrap CI per family, so the "worst case" is attributable

This is the evidence FalsifyAI exists to produce. The CLI compresses it into one row per case + a session summary; the artifact preserves the receipts.

Five concepts, one screen each

Perturbations generate small input variations a real user might produce. Three families ship: typo_noise (character-level mutations), casing_variant (UPPER / lower / Title), and paraphrase (LLM-generated semantic-preserving rewrites, validity-gated via embedding similarity). The first two test character-level robustness; paraphrase tests semantic robustness — an orthogonal pressure axis.

Invariants judge whether a perturbed output is still "the same answer" as the original. contains checks for required substrings; semantic_equivalence compares embedding cosine similarity to a threshold.

Verdicts compress evidence into one of five labels per case:

Verdict	Meaning	Exit
`STABLE`	All perturbations passed the invariants	0
`FRAGILE`	Some perturbations failed; model drifts under pressure	1
`CONSISTENTLY_WRONG`	Every output (including baseline) violates the ground truth	2
`INSUFFICIENT`	Not enough evidence to decide (too few perturbations)	4
`INVALID_EVAL`	The evaluation itself is invalid or contradictory	2

Verdicts use stratified bootstrap CI — each perturbation family is resampled independently, and the worst-case CI lower bound wins. A model that survives typos but breaks under casing reports the casing stability number, not an aggregated average that hides the failure. The verdict is a claim, and the artifact is what the claim rests on.

Replay artifacts are the system's promise that claims are inspectable evidence, not anecdotes. They preserve the full evidence trail per session as described above. The verdict shown on replay is the one assigned at run time — replay never re-resolves.

Diff compares two artifacts case-by-case. The regression criterion is a binary verdict-class downgrade — STABLE → FRAGILE, STABLE → CONSISTENTLY_WRONG, or FRAGILE → CONSISTENTLY_WRONG. A competent user can predict the exit code from the two verdicts; there are no hidden thresholds. That predictability is the whole point — see "Resolver predictability" below.

For the full evidence-system semantics — what guarantees the artifact makes, what the verdict means as a claim — see docs/EVIDENCE.md. For the full philosophy, see docs/ARCHITECTURE.md.

Resolver predictability

The verdict resolver is the epistemic authority of the framework — the thing that says "this case is FRAGILE". Every downstream claim (replay, diff, CI gate, migration decision) rests on it.

The architectural discipline: a competent user must be able to predict the resolver's output from the inputs. If a careful engineer reading the spec, the perturbations, the executions, and the invariant results can reasonably anticipate the verdict, the resolver is legible. If they can't, it's a black box — regardless of how technically correct its internals are.

This isn't just an aesthetic choice. It's what makes the evidence auditable. An opaque resolver produces unfalsifiable claims; a predictable one produces defensible claims. The discipline is in service of the evidence — it's why an auditor (or a future you) can trust what's in the artifact.

See docs/ARCHITECTURE.md for the full discussion and the architectural rules that protect predictability as the project grows.

What FalsifyAI is not

The category clarity above implies things FalsifyAI deliberately is not, and is not aspiring to become:

Not a prompt optimization suite. No prompt tuning, no automated A/B over wordings. The spec is authored deliberately; the framework tests what's authored.
Not a telemetry platform. No streaming, no production dashboards, no time-series. The artifact is per-run preserved evidence, not a continuous-monitoring data point.
Not a generalized observability product. The CLI compresses; the artifact preserves. That's prioritized visibility, not less visibility — the headline tells you whether to look, the artifact tells you what to look at. There is no firehose drill-down.
Not a workflow orchestrator. No DAG runner, no pipeline engine. The three commands (run / replay / diff) are the entire surface.
Not an AI governance suite. Governance platforms consume reliability evidence; FalsifyAI produces it. Different layer.

These exclusions matter because they keep the surface compressible. Adding any of the above corrupts the discipline — evidence density requires evidence boundaries.

Architecture

Three layers, separated by design. The replay artifact is the central object; the other two layers exist to produce and interpret it.

flowchart LR
    subgraph Generation["Evidence generation"]
        Spec[Spec / YAML]
        Mat[Materialize]
        Exec[Execute]
        Spec --> Mat --> Exec
    end
    subgraph Interpretation["Evidence interpretation"]
        Inv[Invariants]
        Res[Verdict resolver]
        Ren[CLI render]
        Inv --> Res --> Ren
    end
    subgraph Preservation["Evidence preservation — the product"]
        Art[ReplayArtifact]
        Store[ReplayStore]
        Art --> Store
    end
    Exec --> Inv
    Ren --> Art
    Store -.->|replay/diff| Ren

ASCII fallback (for PyPI / mobile readers):

  EVIDENCE GENERATION             EVIDENCE INTERPRETATION         EVIDENCE PRESERVATION
  ─────────────────────           ───────────────────────         (the durable product)
  spec.yaml                       invariants                      ─────────────────────
     │                            verdict resolver                ReplayArtifact
     ▼                            CLI render                      ReplayStore
  materialize                            │                              ▲
     │                                   │                              │
     ▼                                   ▼                              │
  execute  ────────────────────────▶ judge ────────────▶ resolve ───────┘
                                                            │
                                       ┌── falsifyai run    │
                                       │── falsifyai replay │
                                       │── falsifyai inspect│  (consumers read
                                       │── falsifyai diff   │   the artifact)
                                       └── falsifyai history│

A future feature touches exactly one layer. Adaptive evidence collection is interpretation, not generation. A new perturbation family is generation, not interpretation. A new verdict shape is interpretation, not preservation. The separation is what keeps the resolver explainable as the project grows — see docs/ARCHITECTURE.md and the philosophy section of CONTRIBUTING.md.

CLI reference

Three subcommands, one workflow:

falsifyai run <spec.yaml> [--store-path PATH]
falsifyai replay <session_id> [--store-path PATH]
falsifyai replay --latest      [--store-path PATH]
falsifyai inspect <session_id> [--case CASE_ID] [--full] [--store-path PATH]
falsifyai diff <baseline_id> <candidate_id> [--store-path PATH]
falsifyai history <case_id> [--limit N] [--store-path PATH]

Exit code	Meaning
0	SUCCESS — session verdict STABLE
1	DEGRADED — session verdict FRAGILE
2	FAILURE — session verdict CONSISTENTLY_WRONG or INVALID_EVAL
3	ERROR — infrastructure failure (bad spec, missing credential, model call failure)
4	INSUFFICIENT — not enough evidence to decide
5	REGRESSION — `falsifyai diff` detected a verdict-class downgrade

Default --store-path is .falsifyai/replays.db. Use :memory: for ephemeral runs (test-only; replay and diff need a persistent store).

CI integration

Ship the evidence with your PR, not just the pass/fail signal:

- name: Reliability regression gate
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  run: |
    KNOWN_GOOD="${{ vars.FALSIFYAI_BASELINE_SESSION_ID }}"
    falsifyai run eval.yaml
    CANDIDATE=$(sqlite3 .falsifyai/replays.db \
      "SELECT session_id FROM sessions ORDER BY created_at_iso DESC LIMIT 1;")
    falsifyai diff "$KNOWN_GOOD" "$CANDIDATE"
    # Exit 5 = regression; the job fails.

The KNOWN_GOOD variable is a session id you captured locally against the production model and committed as a repo / org variable. CI runs the eval against the candidate model and diffs — exit 5 (REGRESSION) fails the job. Zero thresholds to tune; the regression criterion is the verdict-class downgrade. The full evidence artifact is preserved in .falsifyai/replays.db and can be archived as a CI artifact for later inspection.

Examples

Four dogfooded specs, all verified in CI (tests/integration/test_examples.py):

Example	Verdict	What it demonstrates
`examples/stable.yaml`	`STABLE` (exit 0)	A sane model under perturbation; both perturbation families + both invariants.
`examples/fragile.yaml`	`FRAGILE` (exit 1)	Model drift: baseline correct, perturbations wrong.
`examples/consistently_wrong.yaml`	`CONSISTENTLY_WRONG` (exit 2)	Confident hallucination: same wrong answer under every perturbation.
`examples/model_migration.yaml`	regression (exit 5)	The launch wedge — run twice, diff, exit 5 if any case regressed. The 5-minute proof above uses this spec.

Run any of them:

falsifyai run examples/stable.yaml

A real provider is required at runtime (OPENAI_API_KEY, GROQ_API_KEY, etc. — whichever your spec's provider: field points at). The dogfood tests in CI bypass real model calls by injecting MockAdapter through a test seam — see tests/integration/test_examples.py for the pattern.

Case studies

Worked tours of FalsifyAI's evidence infrastructure over real preserved artifacts. Each case study is itself a FalsifyAI artifact: a ReplayStore bundle plus prose that walks through what history, diff, inspect, and replay reveal when read against it.

#	Title	What it demonstrates
01	Invisible character substitution	Cross-model `contains`-contract brittleness as a persistent class; a model-migration regression (U+202F substitution between "30" and "days") as the vivid instance.

See docs/case-studies/ for the index, the bundled replay artifact (SHA256 in provenance README), and the framing convention case studies follow.

Writing your own spec

The shortest valid spec (tests/fixtures/specs/minimal.yaml):

falsify:
  version: "1.0"
  name: "minimal"
model:
  provider: openai
  model: gpt-4o-mini
run:
  seed: 42
cases:
  - id: hello
    input:
      text: "Say hi."
    perturbations:
      - type: typo_noise
    invariants:
      - type: contains
        values: ["hi"]

The full spec schema (perturbation parameters, invariant types, verdict thresholds) is in plan.md §6. The spec language is locked for the 0.1.x line.

Local development

Requires Python 3.13+ and uv.

git clone https://github.com/ericckzhou/falsifyai
cd falsifyai
uv sync --extra dev
uv run pytest

Contributions follow the conventions in CONTRIBUTING.md. Architectural constraints (especially: resist resolver inflation) are non-negotiable; see that doc for the trust test any resolver-touching PR must pass.

Status and roadmap

0.2.0 (current release) — Phase 1 first wave. Adds:

✅ falsifyai inspect <session_id> — per-case deep-dive over preserved evidence. Surfaces every perturbed input, output, and invariant judgment. --case <case_id> expands one case; --full disables truncation. Pure consumer surface — the artifact already contained the data.
✅ paraphrase perturbation family — LLM-generated semantic-preserving rewrites with embedding-similarity validity gating. Tests semantic robustness as an orthogonal pressure axis to the character-level families. Configurable per-spec (count, similarity_threshold, max_attempts, optional model override).
✅ falsifyai history <case_id> — temporal view of one case across saved sessions. Newest-first, one row per session, showing verdict + CI + worst family per row. Reads case.verdict from preserved artifacts; no aggregation, no trend inference, no reinterpretation.
✅ Canonical case study — Invisible character substitution: cross-model contains-contract brittleness as the thesis (history), Pair 3 model-migration regression as the vivid concrete proof (diff + inspect), over a bundled replay artifact you can re-open and reproduce verbatim.
✅ Automated PyPI publishing via Trusted Publisher (OIDC) — .github/workflows/publish.yml fires on any v* tag push: verifies version match, re-runs tests, builds, validates, publishes. No long-lived tokens in repo.

0.1.0 — Phase 0 MVP. Spec language, perturbation runtime, materializer, invariants, execution adapter, replay store, real verdict resolver (stratified bootstrap CI, CONSISTENTLY_WRONG, falsifiability scoring), and the three-command CLI (run + replay + diff).

Coming next — selected by evidence, not theoretical completeness:

diff sharpening — --strict, --show-trending, exit code 6 for low-falsifiability gates. Tightens the binary regression criterion for users who want finer CI control without compromising resolver predictability.
Artifact infrastructure track — falsifyai verify <session_id> (integrity + provenance), falsifyai export --bundle (productize the case-study extraction pattern), and a persisted CLI-invocation field in ReplayArtifact. Locked sequence; reassess after a second case study or real user pressure.

Each addition is evaluated against: does this preserve evidence density, resolver predictability, and the discipline that makes the artifact trustworthy? See docs/ARCHITECTURE.md, docs/EVIDENCE.md, and CONTRIBUTING.md for the discipline.

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

May 23, 2026

0.1.0

May 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

falsifyai-0.2.0.tar.gz (326.4 kB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

falsifyai-0.2.0-py3-none-any.whl (79.5 kB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file falsifyai-0.2.0.tar.gz.

File metadata

Download URL: falsifyai-0.2.0.tar.gz
Upload date: May 23, 2026
Size: 326.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for falsifyai-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`06a869054a0a5aa2ca197cd05c1fa51aeca47d56f5c72b77f381b868071d6fdd`
MD5	`cb89d54258cffda5d6b11bf4e92f23a4`
BLAKE2b-256	`552383eb5e80ca57d84e0e4e6d49c508d4b22e74494287851d101a1041ad6061`

See more details on using hashes here.

Provenance

The following attestation bundles were made for falsifyai-0.2.0.tar.gz:

Publisher: publish.yml on ericckzhou/falsifyai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: falsifyai-0.2.0.tar.gz
- Subject digest: 06a869054a0a5aa2ca197cd05c1fa51aeca47d56f5c72b77f381b868071d6fdd
- Sigstore transparency entry: 1610687661
- Sigstore integration time: May 23, 2026
Source repository:
- Permalink: ericckzhou/falsifyai@e0084ea92eab148597784ae1783c3dee327d0a06
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ericckzhou
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e0084ea92eab148597784ae1783c3dee327d0a06
- Trigger Event: push

File details

Details for the file falsifyai-0.2.0-py3-none-any.whl.

File metadata

Download URL: falsifyai-0.2.0-py3-none-any.whl
Upload date: May 23, 2026
Size: 79.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for falsifyai-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fa01d32a875435ed6c92d1660fc9a1ceb505a75197f773e420e0ff891b64ec56`
MD5	`5198896efd28c7293e684631af4d5735`
BLAKE2b-256	`9f8299c72706b7507336bd6dfd5d3303aeef1fe67fd6837af09fa96861fbbe8d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for falsifyai-0.2.0-py3-none-any.whl:

Publisher: publish.yml on ericckzhou/falsifyai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: falsifyai-0.2.0-py3-none-any.whl
- Subject digest: fa01d32a875435ed6c92d1660fc9a1ceb505a75197f773e420e0ff891b64ec56
- Sigstore transparency entry: 1610687746
- Sigstore integration time: May 23, 2026
Source repository:
- Permalink: ericckzhou/falsifyai@e0084ea92eab148597784ae1783c3dee327d0a06
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ericckzhou
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e0084ea92eab148597784ae1783c3dee327d0a06
- Trigger Event: push

falsifyai 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

FalsifyAI

What kind of tool is this?

The core terms

The 5-minute proof

1. Define what good looks like

2. Run against your baseline model

3. Switch to the new model. Run again.

4. Diff the two evidence artifacts

5. Replay any past session

What's in the evidence?

Five concepts, one screen each

Resolver predictability

What FalsifyAI is not

Architecture

CLI reference

CI integration

Examples

Case studies

Writing your own spec

Local development

Status and roadmap

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance