Open experiment runner for LLM behavior changes. Fork production traces, replay with a proposed change, score the diff, emit a PR-ready verdict report.
Project description
whatifd
whatifd's product is the verdict's defensibility. Fork production traces, replay with a proposed change, score the diff — and ship a Ship / Don't Ship / Inconclusive verdict a reviewer can read, follow the reasoning, and either trust or know exactly which assumption to challenge.
When you change a prompt, model, or tool in an LLM system, you don't actually know whether it improves behavior — you guess, with a handful of cherry-picked traces and inconsistent evaluation. Every step in the workflow has a tool: Langfuse for traces, Inspect AI for scoring, GitHub for PRs. The experiment doesn't.
whatifd is the experiment runner. Fork production traces (failed cases plus a representative baseline), replay them with your proposed change (original tool outputs cached so side effects don't re-fire), score with the judge of your choice, and produce a Markdown + JSON verdict report you can attach to the PR. You stop shipping changes that fix one failure while silently regressing ten others. You go from "this feels better" to "this improved 14/20, regressed 3 — here's exactly where, and here's the evidence I'd defend in review."
Stop shipping LLM changes on gut feel.
Status
v0.2.0 — alpha. v0.2 widens v0.1 along five axes: a regression_check experiment shape joins failure_rescue (Phase A/C); a doctrinally-correct paired-percentile bootstrap replaces the v0.1 empirical-quantile shortcut, and MethodologyDisclosure.bootstrap.method declares the real method (Phase E.1/E.2); the Arize Phoenix / OpenInference adapter ships as whatifd-phoenix (Phase D); a whatifd-fork GitHub Action wraps the CLI for PR-comment + status-annotation workflows (Phase I); and cardinal #4 widens from top-level-only to per-field opt-in inside RunManifest, with cross-platform CI byte-equality enforcement (Phase J). inspect_ai is now reachable from YAML via scorer.score_fn (Phase B).
| Version | Status | What it does |
|---|---|---|
| v0.1 | shipped (2026-05-09) | Langfuse ingest, prompt override, cached-tool replay, Inspect AI scorer, evidence-first Markdown + JSON reports, CI exit codes. |
| v0.2 | shipped (2026-05-10) | regression_check shape; paired-percentile bootstrap; Phoenix / OpenInference adapter; whatifd-fork GitHub Action; per-field determinism widening + cross-platform CI; YAML-loaded inspect_ai scorer. |
| v0.3 | M12 | Cluster-paired bootstrap; LangSmith adapter; marketplace publication of the GitHub Action; environment.dependencies ordering canonicalization; live-tool replay (opt-in, allowlist). |
| v1.0 | year 2 | The pre-merge regression gate for LLM behavior. |
Install
uv pip install whatifd whatifd-langfuse whatifd-phoenix whatifd-inspect-ai
# From source (uv workspace):
git clone https://github.com/victoralfred/whatifd
cd whatifd
uv sync --all-extras --dev --group workspace
Quickstart (programmatic — works today)
The library API is the load-bearing surface. The snippet below is shape-only — it omits RunManifest, MethodologyDisclosure, and CacheSummary construction plus the actual run_pipeline(...) call to keep the README focused. The full runnable end-to-end example lives at docs/getting-started.md. Minimal shape:
from whatifd.adapters.stub import StubTraceSource, StubTraceSpec
from whatifd.adapters.factory import build_scorer
from whatifd.cli_pipeline import build_delta_fn
from whatifd.config import ChangeConfig, ScorerConfig
from whatifd.pipeline import run_pipeline
from whatifd.runner_loader import load_runner
# Your runner satisfies the contract Protocol — see docs/runner-contract.md
loaded_runner = load_runner("python:my_agent.replay:run")
scorer = build_scorer(ScorerConfig(adapter="stub")) # or wire a real Inspect AI scorer
trace_source = StubTraceSource(specs=[
StubTraceSpec(trace_id="f-1", user_message="...", original_response="...", cohort="failure"),
# ...
])
delta_fn = build_delta_fn(
loaded_runner=loaded_runner,
scorer=scorer,
change=ChangeConfig(system_prompt="new prompt"),
replay_timeout_seconds=60.0,
)
# Construct floor / policy / runtime / methodology / cache_summary,
# then call run_pipeline → ReportV01.
# Full worked example: docs/getting-started.md.
Quickstart (CLI — stub adapters work today)
# Write a config:
cat > whatifd.config.yaml <<EOF
source:
adapter: stub
target:
runner: python:examples.minimal_agent.replay:run
selection:
failure_cohort: { limit: 5 }
baseline_cohort: { limit: 5 }
change:
system_prompt: my new prompt
scorer:
adapter: stub
decision: {}
reporting: {}
timeouts: {}
EOF
# Run the fork:
uv run whatifd fork --config whatifd.config.yaml
# Exit codes:
# 0 = Ship verdict
# 1 = Don't Ship verdict
# 2 = Inconclusive verdict / setup failure / floor violation
Real Langfuse traces require LANGFUSE_HOST (or LANGFUSE_BASE_URL) + LANGFUSE_PUBLIC_KEY + LANGFUSE_SECRET_KEY in the environment. Real Inspect AI scoring is reachable from YAML via scorer.score_fn: python:<module>:<attr> (Phase B); the v0.1 programmatic-only path is preserved.
How it composes
whatifd doesn't replace your tracer or your eval framework — it composes them into an experiment.
- Tracers (reads from): Langfuse (v0.1); Arize Phoenix / OpenInference (v0.2); LangSmith / OpenTelemetry GenAI (v0.3+).
- Scorers (wraps): Inspect AI (v0.1, real adapter shipped); pluggable via the scorer registry.
- Your agent (calls back into): any Python callable matching the runner contract.
- Downstream of
whatifd's decisions: your existing CI (GitHub Actions, GitLab CI), SLO platforms (Nobl9, Sloth, Honeycomb), incident tooling.
What whatifd is not
- Not a tracer (use Langfuse / Phoenix / LangSmith / OpenTelemetry GenAI).
- Not an offline eval harness (use Inspect AI / Promptfoo; whatifd wraps them).
- Not an SLO platform (use Nobl9 / Sloth / Honeycomb downstream of whatifd's decisions).
- Not an agent runtime — the runner contract is the boundary.
- Not a UI or dashboard.
- Not a substitute for production monitoring; not a benchmark suite; not a load test; not a causal estimator beyond replay association; not a judge-quality validator (see docs/concepts.md).
Documentation
docs/concepts.md— the conceptual model: defensible verdicts, non-claims, trust floor vs decision policy, failure-as-data, evidence and audit bundledocs/getting-started.md— worked end-to-end exampledocs/runner-contract.md— the user-facing extension point referencedocs/schema/v0.1.md—ReportV01consumer compatibility guidedocs/walkthroughs/— six rendered scenarios as reference (Ship, Don't Ship, Inconclusive)examples/minimal-agent/— copy-paste reference Runner
Design
The full design — problem framing, prior art, runner contract, report shape, eval target, milestones, risks — lives in DESIGN.md. The doctrine and cardinal rules are in .claude/skills/whatifd-design/SKILL.md.
Contributing
Pre-alpha. Issues and design discussion welcome; pull requests deferred until v0.1 ships.
License
Apache 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file whatifd-0.2.0.tar.gz.
File metadata
- Download URL: whatifd-0.2.0.tar.gz
- Upload date:
- Size: 5.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00f47c8fe33aee3766083247a7ca7c93f248020d784e3aa3cd437f2851a23f95
|
|
| MD5 |
bf306db1c1ec026de222fd1e42c63c32
|
|
| BLAKE2b-256 |
974443bf135ba26e7cf7afca6511800f21beff6dc38b94a3e8516412fb2925c7
|
Provenance
The following attestation bundles were made for whatifd-0.2.0.tar.gz:
Publisher:
release.yml on victoralfred/whatifd
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
whatifd-0.2.0.tar.gz -
Subject digest:
00f47c8fe33aee3766083247a7ca7c93f248020d784e3aa3cd437f2851a23f95 - Sigstore transparency entry: 1498126896
- Sigstore integration time:
-
Permalink:
victoralfred/whatifd@9cadf0c5cacdb8826311364713746099b68aff3a -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/victoralfred
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9cadf0c5cacdb8826311364713746099b68aff3a -
Trigger Event:
push
-
Statement type:
File details
Details for the file whatifd-0.2.0-py3-none-any.whl.
File metadata
- Download URL: whatifd-0.2.0-py3-none-any.whl
- Upload date:
- Size: 231.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68aac281fb1893cfeb1f7e0b28871c6799b877aa6473f33e94990a40abaed7c8
|
|
| MD5 |
11d6ea8f5b9caf75425b085842f62757
|
|
| BLAKE2b-256 |
b5c949d1a1c3ebeab9917a8ba2e2906c3b56be350051e539adc2d9129ccb2d4b
|
Provenance
The following attestation bundles were made for whatifd-0.2.0-py3-none-any.whl:
Publisher:
release.yml on victoralfred/whatifd
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
whatifd-0.2.0-py3-none-any.whl -
Subject digest:
68aac281fb1893cfeb1f7e0b28871c6799b877aa6473f33e94990a40abaed7c8 - Sigstore transparency entry: 1498127032
- Sigstore integration time:
-
Permalink:
victoralfred/whatifd@9cadf0c5cacdb8826311364713746099b68aff3a -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/victoralfred
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9cadf0c5cacdb8826311364713746099b68aff3a -
Trigger Event:
push
-
Statement type: