Skip to main content

Open experiment runner for LLM behavior changes. Fork production traces, replay with a proposed change, score the diff, emit a PR-ready verdict report.

Project description

whatifd

CI License: Apache 2.0 Python 3.11+ Ruff Status

whatifd's product is the verdict's defensibility. Fork production traces, replay with a proposed change, score the diff — and ship a Ship / Don't Ship / Inconclusive verdict a reviewer can read, follow the reasoning, and either trust or know exactly which assumption to challenge.

whatifd workflow

When you change a prompt, model, or tool in an LLM system, you don't actually know whether it improves behavior — you guess, with a handful of cherry-picked traces and inconsistent evaluation. Every step in the workflow has a tool: Langfuse for traces, Inspect AI for scoring, GitHub for PRs. The experiment doesn't.

whatifd is the experiment runner. Fork production traces (failed cases plus a representative baseline), replay them with your proposed change (original tool outputs cached so side effects don't re-fire), score with the judge of your choice, and produce a Markdown + JSON verdict report you can attach to the PR. You stop shipping changes that fix one failure while silently regressing ten others. You go from "this feels better" to "this improved 14/20, regressed 3 — here's exactly where, and here's the evidence I'd defend in review."

Stop shipping LLM changes on gut feel.


whatifd on one page

Status

Pre-alpha; v0.1 release candidate. The library API runs end-to-end against the synthetic stub adapter and against the real whatifd-langfuse + whatifd-inspect-ai adapters; the whatifd fork CLI dispatcher is wired through the full factory → runner-loader → delta_fn → run_pipeline → render path. PyPI publication is pending.

Version Target What it does
v0.1 M10 (release candidate) Langfuse ingest, prompt override, cached-tool replay, Inspect AI scorer, evidence-first Markdown + JSON reports, CI exit codes.
v0.2 M11 Stratified bootstrap CI, scorer cache wiring, second tracer adapter, model swap, GitHub Action wrapper.
v0.3 M12 Live-tool replay (opt-in, allowlist), worked CI sample repo.
v1.0 year 2 The pre-merge regression gate for LLM behavior.

Install

# Once published to PyPI:
uv pip install whatifd whatifd-langfuse whatifd-inspect-ai

# From source (uv workspace):
git clone https://github.com/victoralfred/whatifd
cd whatifd
uv sync --all-extras --dev --group workspace

Quickstart (programmatic — works today)

The library API is the load-bearing surface. The snippet below is shape-only — it omits RunManifest, MethodologyDisclosure, and CacheSummary construction plus the actual run_pipeline(...) call to keep the README focused. The full runnable end-to-end example lives at docs/getting-started.md. Minimal shape:

from whatifd.adapters.stub import StubTraceSource, StubTraceSpec
from whatifd.adapters.factory import build_scorer
from whatifd.cli_pipeline import build_delta_fn
from whatifd.config import ChangeConfig, ScorerConfig
from whatifd.pipeline import run_pipeline
from whatifd.runner_loader import load_runner

# Your runner satisfies the contract Protocol — see docs/runner-contract.md
loaded_runner = load_runner("python:my_agent.replay:run")

scorer = build_scorer(ScorerConfig(adapter="stub"))  # or wire a real Inspect AI scorer

trace_source = StubTraceSource(specs=[
    StubTraceSpec(trace_id="f-1", user_message="...", original_response="...", cohort="failure"),
    # ...
])

delta_fn = build_delta_fn(
    loaded_runner=loaded_runner,
    scorer=scorer,
    change=ChangeConfig(system_prompt="new prompt"),
    replay_timeout_seconds=60.0,
)

# Construct floor / policy / runtime / methodology / cache_summary,
# then call run_pipeline → ReportV01.
# Full worked example: docs/getting-started.md.

Quickstart (CLI — stub adapters work today)

# Write a config:
cat > whatifd.config.yaml <<EOF
source:
  adapter: stub
target:
  runner: python:examples.minimal_agent.replay:run
selection:
  failure_cohort: { limit: 5 }
  baseline_cohort: { limit: 5 }
change:
  system_prompt: my new prompt
scorer:
  adapter: stub
decision: {}
reporting: {}
timeouts: {}
EOF

# Run the fork:
uv run whatifd fork --config whatifd.config.yaml

# Exit codes:
#   0 = Ship verdict
#   1 = Don't Ship verdict
#   2 = Inconclusive verdict / setup failure / floor violation

Real Langfuse traces require LANGFUSE_HOST (or LANGFUSE_BASE_URL) + LANGFUSE_PUBLIC_KEY + LANGFUSE_SECRET_KEY in the environment. Real Inspect AI scoring requires the programmatic API in v0.1 (config-loaded score_fn is a v0.2 cascade entry — see phases.md).

How it composes

whatifd doesn't replace your tracer or your eval framework — it composes them into an experiment.

  • Tracers (reads from): Langfuse (v0.1, real adapter shipped); Phoenix / LangSmith / OpenTelemetry GenAI (v0.2+).
  • Scorers (wraps): Inspect AI (v0.1, real adapter shipped); pluggable via the scorer registry.
  • Your agent (calls back into): any Python callable matching the runner contract.
  • Downstream of whatifd's decisions: your existing CI (GitHub Actions, GitLab CI), SLO platforms (Nobl9, Sloth, Honeycomb), incident tooling.

What whatifd is not

  • Not a tracer (use Langfuse / Phoenix / LangSmith / OpenTelemetry GenAI).
  • Not an offline eval harness (use Inspect AI / Promptfoo; whatifd wraps them).
  • Not an SLO platform (use Nobl9 / Sloth / Honeycomb downstream of whatifd's decisions).
  • Not an agent runtime — the runner contract is the boundary.
  • Not a UI or dashboard.
  • Not a substitute for production monitoring; not a benchmark suite; not a load test; not a causal estimator beyond replay association; not a judge-quality validator (see docs/concepts.md).

Documentation

Design

The full design — problem framing, prior art, runner contract, report shape, eval target, milestones, risks — lives in DESIGN.md. The doctrine and cardinal rules are in .claude/skills/whatifd-design/SKILL.md.

Contributing

Pre-alpha. Issues and design discussion welcome; pull requests deferred until v0.1 ships.

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whatifd-0.1.0.tar.gz (5.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whatifd-0.1.0-py3-none-any.whl (211.6 kB view details)

Uploaded Python 3

File details

Details for the file whatifd-0.1.0.tar.gz.

File metadata

  • Download URL: whatifd-0.1.0.tar.gz
  • Upload date:
  • Size: 5.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for whatifd-0.1.0.tar.gz
Algorithm Hash digest
SHA256 348b5d696fe9eaefdb095628fbb01ba19a33c0d58e4a4e84e9596e4208f522e1
MD5 14890e14fadd746ae4f892fb1abd18b5
BLAKE2b-256 e850c7e524dfa00e07a0c30606eba3e0b02185d9340678973cb416a238f6b6cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for whatifd-0.1.0.tar.gz:

Publisher: release.yml on victoralfred/whatifd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file whatifd-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: whatifd-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 211.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for whatifd-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f1f499245aca7c3bb2705da7241c158553e73d7a0a17e7e3ce1839eb6f25fc4
MD5 0d28805e8a7c8c3cfd5021bcf8f82c0b
BLAKE2b-256 d52195b0756b7c46e4b7eeb36430c0924587298fe5dd3029aba6ee386f8fbbcd

See more details on using hashes here.

Provenance

The following attestation bundles were made for whatifd-0.1.0-py3-none-any.whl:

Publisher: release.yml on victoralfred/whatifd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page