Skip to main content

Open experiment runner for LLM behavior changes. Fork production traces, replay with a proposed change, score the diff, emit a PR-ready verdict report.

Project description

whatifd

CI License: Apache 2.0 Python 3.11+ Ruff Status

whatifd's product is the verdict's defensibility. Fork production traces, replay with a proposed change, score the diff — and ship a Ship / Don't Ship / Inconclusive verdict a reviewer can read, follow the reasoning, and either trust or know exactly which assumption to challenge.

whatifd workflow

When you change a prompt, model, or tool in an LLM system, you don't actually know whether it improves behavior — you guess, with a handful of cherry-picked traces and inconsistent evaluation. Every step in the workflow has a tool: Langfuse for traces, Inspect AI for scoring, GitHub for PRs. The experiment doesn't.

whatifd is the experiment runner. Fork production traces (failed cases plus a representative baseline), replay them with your proposed change (original tool outputs cached so side effects don't re-fire), score with the judge of your choice, and produce a Markdown + JSON verdict report you can attach to the PR. You stop shipping changes that fix one failure while silently regressing ten others. You go from "this feels better" to "this improved 14/20, regressed 3 — here's exactly where, and here's the evidence I'd defend in review."

Stop shipping LLM changes on gut feel.


whatifd on one page

Status

v0.2.0 — alpha. v0.2 widens v0.1 along five axes: a regression_check experiment shape joins failure_rescue (Phase A/C); a doctrinally-correct paired-percentile bootstrap replaces the v0.1 empirical-quantile shortcut, and MethodologyDisclosure.bootstrap.method declares the real method (Phase E.1/E.2); the Arize Phoenix / OpenInference adapter ships as whatifd-phoenix (Phase D); a whatifd-fork GitHub Action wraps the CLI for PR-comment + status-annotation workflows (Phase I); and cardinal #4 widens from top-level-only to per-field opt-in inside RunManifest, with cross-platform CI byte-equality enforcement (Phase J). inspect_ai is now reachable from YAML via scorer.score_fn (Phase B).

Version Status What it does
v0.1 shipped (2026-05-09) Langfuse ingest, prompt override, cached-tool replay, Inspect AI scorer, evidence-first Markdown + JSON reports, CI exit codes.
v0.2 shipped (2026-05-10) regression_check shape; paired-percentile bootstrap; Phoenix / OpenInference adapter; whatifd-fork GitHub Action; per-field determinism widening + cross-platform CI; YAML-loaded inspect_ai scorer.
v0.3 M12 Cluster-paired bootstrap; LangSmith adapter; marketplace publication of the GitHub Action; environment.dependencies ordering canonicalization; live-tool replay (opt-in, allowlist).
v1.0 year 2 The pre-merge regression gate for LLM behavior.

Install

uv pip install whatifd whatifd-langfuse whatifd-phoenix whatifd-inspect-ai

# From source (uv workspace):
git clone https://github.com/victoralfred/whatifd
cd whatifd
uv sync --all-extras --dev --group workspace

Quickstart (programmatic — works today)

The library API is the load-bearing surface. The snippet below is shape-only — it omits RunManifest, MethodologyDisclosure, and CacheSummary construction plus the actual run_pipeline(...) call to keep the README focused. The full runnable end-to-end example lives at docs/getting-started.md. Minimal shape:

from whatifd.adapters.stub import StubTraceSource, StubTraceSpec
from whatifd.adapters.factory import build_scorer
from whatifd.cli_pipeline import build_delta_fn
from whatifd.config import ChangeConfig, ScorerConfig
from whatifd.pipeline import run_pipeline
from whatifd.runner_loader import load_runner

# Your runner satisfies the contract Protocol — see docs/runner-contract.md
loaded_runner = load_runner("python:my_agent.replay:run")

scorer = build_scorer(ScorerConfig(adapter="stub"))  # or wire a real Inspect AI scorer

trace_source = StubTraceSource(specs=[
    StubTraceSpec(trace_id="f-1", user_message="...", original_response="...", cohort="failure"),
    # ...
])

delta_fn = build_delta_fn(
    loaded_runner=loaded_runner,
    scorer=scorer,
    change=ChangeConfig(system_prompt="new prompt"),
    replay_timeout_seconds=60.0,
)

# Construct floor / policy / runtime / methodology / cache_summary,
# then call run_pipeline → ReportV01.
# Full worked example: docs/getting-started.md.

Quickstart (CLI — stub adapters work today)

# Write a config:
cat > whatifd.config.yaml <<EOF
source:
  adapter: stub
target:
  runner: python:examples.minimal_agent.replay:run
selection:
  failure_cohort: { limit: 5 }
  baseline_cohort: { limit: 5 }
change:
  system_prompt: my new prompt
scorer:
  adapter: stub
decision: {}
reporting: {}
timeouts: {}
EOF

# Run the fork:
uv run whatifd fork --config whatifd.config.yaml

# Exit codes:
#   0 = Ship verdict
#   1 = Don't Ship verdict
#   2 = Inconclusive verdict / setup failure / floor violation

Real Langfuse traces require LANGFUSE_HOST (or LANGFUSE_BASE_URL) + LANGFUSE_PUBLIC_KEY + LANGFUSE_SECRET_KEY in the environment. Real Inspect AI scoring is reachable from YAML via scorer.score_fn: python:<module>:<attr> (Phase B); the v0.1 programmatic-only path is preserved.

How it composes

whatifd doesn't replace your tracer or your eval framework — it composes them into an experiment.

  • Tracers (reads from): Langfuse (v0.1); Arize Phoenix / OpenInference (v0.2); LangSmith / OpenTelemetry GenAI (v0.3+).
  • Scorers (wraps): Inspect AI (v0.1, real adapter shipped); pluggable via the scorer registry.
  • Your agent (calls back into): any Python callable matching the runner contract.
  • Downstream of whatifd's decisions: your existing CI (GitHub Actions, GitLab CI), SLO platforms (Nobl9, Sloth, Honeycomb), incident tooling.

What whatifd is not

  • Not a tracer (use Langfuse / Phoenix / LangSmith / OpenTelemetry GenAI).
  • Not an offline eval harness (use Inspect AI / Promptfoo; whatifd wraps them).
  • Not an SLO platform (use Nobl9 / Sloth / Honeycomb downstream of whatifd's decisions).
  • Not an agent runtime — the runner contract is the boundary.
  • Not a UI or dashboard.
  • Not a substitute for production monitoring; not a benchmark suite; not a load test; not a causal estimator beyond replay association; not a judge-quality validator (see docs/concepts.md).

Documentation

Design

The full design — problem framing, prior art, runner contract, report shape, eval target, milestones, risks — lives in DESIGN.md. The doctrine and cardinal rules are in .claude/skills/whatifd-design/SKILL.md.

Contributing

Pre-alpha. Issues and design discussion welcome; pull requests deferred until v0.1 ships.

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whatifd-0.2.0.tar.gz (5.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whatifd-0.2.0-py3-none-any.whl (231.8 kB view details)

Uploaded Python 3

File details

Details for the file whatifd-0.2.0.tar.gz.

File metadata

  • Download URL: whatifd-0.2.0.tar.gz
  • Upload date:
  • Size: 5.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for whatifd-0.2.0.tar.gz
Algorithm Hash digest
SHA256 00f47c8fe33aee3766083247a7ca7c93f248020d784e3aa3cd437f2851a23f95
MD5 bf306db1c1ec026de222fd1e42c63c32
BLAKE2b-256 974443bf135ba26e7cf7afca6511800f21beff6dc38b94a3e8516412fb2925c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for whatifd-0.2.0.tar.gz:

Publisher: release.yml on victoralfred/whatifd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file whatifd-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: whatifd-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 231.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for whatifd-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 68aac281fb1893cfeb1f7e0b28871c6799b877aa6473f33e94990a40abaed7c8
MD5 11d6ea8f5b9caf75425b085842f62757
BLAKE2b-256 b5c949d1a1c3ebeab9917a8ba2e2906c3b56be350051e539adc2d9129ccb2d4b

See more details on using hashes here.

Provenance

The following attestation bundles were made for whatifd-0.2.0-py3-none-any.whl:

Publisher: release.yml on victoralfred/whatifd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page