Skip to main content

Generate fictional-but-coherent causal operations worlds (executable sim + ground-truth answer-key) from a natural-language description, for benchmarking causal-discovery agents.

Project description

causal-worlds

PyPI CI License: MIT Python 3.13+

Turn a plain-language description of an operation into a fictional causal world with a declared, ground-truth causal graph — then benchmark whether a causal-discovery method can recover it.

Because the structure is declared (not learned from data), it's an answer key: run any discovery method on the generated data and score how well it recovered the world. The worlds are fiction-first — plausible and internally consistent, not models of any real system — so there is no data to leak and nothing to memorize, which is exactly what makes a causal benchmark trustworthy.

from causal_worlds import worlds, grade_spec, InterventionalCiDiscoverer

spec = worlds.get("coffee")                          # a hidden confounder + a regime sign-flip
report = grade_spec(spec, InterventionalCiDiscoverer())
print(report)   # directed_shd=0  skeleton_shd=0  f1=1.0  confounded_reported=0
#                ^ swap in YOUR discoverer to benchmark it against a known truth

Status — v0.6, beta. The full loop works: natural language → an admitted causal world, persisted with provenance. A Claude author proposes the world; an independent Gemini judge (a different model family) plus statistical gates admit only worlds that are valid, recoverable, and not guessable from variable names. The deterministic engine (specify → sample → grade → score) and all grading run with no API key; only authoring needs keys. Worlds are currently tabular SCMs with do() interventions — a Gymnasium env, temporal lags, and counterfactual replay are on the roadmap. See the CHANGELOG.

Install

pip install causal-worlds              # or: uv add causal-worlds
pip install 'causal-worlds[discover]'  # + the baseline discovery stack (PC/GES/FCI/GIES)
pip install 'causal-worlds[llm]'       # + natural-language authoring (Claude + Gemini)

The base install (engine, grading, built-in worlds, CLI) needs only typer, pydantic, numpy.

60-second quickstart (no API key)

causal-worlds worlds                     # list built-in worlds: coffee, ecommerce
causal-worlds gate coffee                # run the validity gates -> admitted=True
causal-worlds grade coffee               # grade the reference discoverer -> directed_shd=0 ...
causal-worlds score benchmark/v0.5/world_01   # grade the reference on a shipped benchmark world

New to it? Walk through the getting-started guide or run the examples.

Benchmark your own discoverer

Implement one method — recover(substrate, *, seed) -> set[(src, dst)] — and grade it against any world's answer key:

from causal_worlds import grade_spec, worlds

class MyDiscoverer:
    def recover(self, substrate, *, seed):
        sample = substrate.sample(2000, seed=seed)        # observational data...
        flows = substrate.sample(2000, seed=seed, do={"price": 1.0})  # ...or interventional
        return {("price", "demand")}                       # your recovered edges

print(grade_spec(worlds.get("coffee"), MyDiscoverer()))

Or from the CLI on a persisted world: causal-worlds score <bundle> --discoverer your_pkg:YourClass.

Author a world from a description (needs [llm] + keys)

Set ANTHROPIC_API_KEY and GEMINI_API_KEY (see .env.example; the CLI auto-loads a local .env), then:

causal-worlds generate "a coffee chain with weekend swings and variable lead times" ./my-world

Observability: with the observability extra + Langfuse keys and CAUSAL_WORLDS_LANGFUSE_ENABLED=true, every run is traced (generateauthorgate) in Langfuse.

from causal_worlds import generate
from causal_worlds.author import build_claude_author
from causal_worlds.judge import build_gemini_judge

world = generate(
    "a hospital ED with triage staffing and bed pressure",
    author=build_claude_author(complexity="hard"),   # easy | standard | hard
    judge=build_gemini_judge(),                       # independent model family
)
print(world.report.difficulty, world.report.grade)

What the crossover shows (and what it doesn't)

Across the 35-world benchmark/v0.5 set (3 seeds each):

method gets interventions? latent-aware? mean skeleton-SHD ↓ confounded pair kept as causal ↓
interventional-ci (reference) yes yes 1.47 0
GIES yes no 2.37 17
PC no no 2.81 13
FCI no partly 2.68 8

The honest reading: the dividing line is latent-awareness, not interventions alone. GIES gets the same interventional budget as the reference and recovers the skeleton fine — but, assuming causal sufficiency, it still reports the hidden-confounded pair as a causal edge in most worlds; PC/FCI (observational) likewise. Only the latent-aware interventional rule keeps it at zero. So this is best read as an identifiability result (you cannot tell confounding from causation without both interventions and a latent-aware method), not "our method beats the toolbox."

Caveats we're not hiding (see evals/ and the issues): (1) the worlds are currently admitted by the reference grader itself (gate T3), so admission and the headline aren't yet fully decoupled. (2) We measured a real flaw — the worlds leak the causal order through marginal variance (varsortability 0.94; a trivial sort-by-variance baseline scores F1 0.74), the classic synthetic-DAG giveaway; variance standardization (next release) removes it. (3) Structural difficulty correlates with observational error (r≈0.8, partly mechanically) and with the interventional advantage (ΔF1, r≈0.36, n=35, no CIs) — a descriptive axis, not a validated predictor. Fixing (1) and (2), plus a name-only-at-chance baseline, is the next milestone (#9).

What you get per world

  1. An executable SCM — sample observational data and do()-intervene, deterministically by seed.
  2. A time-series dataset — the observed variables (the input to a discovery method).
  3. An answer key — the declared causal edges + the hidden-confounded pairs, derived from the spec.
  4. A manifest — full provenance (models, grader version, seed, difficulty) and an honesty label.

Concepts

  • Spec / IR — variables (with roles, incl. hidden), linear-Gaussian mechanisms, regime sign-flips.
  • Answer key — directed edges over observed variables + the hidden-confounded pairs; derived from the spec, never stored separately, so they can't disagree.
  • Gates — T1 validity · T2 sample-sanity · T3 non-triviality vs a random-graph null · T4 anti-cliché (the judge can't guess it from names). A world is admitted only if all pass.
  • Reference grader — an interventional-CI discoverer that uses do() data to tell confounding from causation, where PC/GES/GIES/FCI (which assume causal sufficiency) cannot.

Depth: docs/scope.md · docs/hld.md · docs/lld.md · docs/architecture.md · docs/validation.md.

Roadmap

Shipped: NL authoring, independent judge + anti-cliché gate, artifact persistence, the baseline crossover, a structural-difficulty axis, a 35-world benchmark, temporal worlds (lagged edges + autoregression — see the built-in supply), and time-series grading (PCMCI+, LPCMCI, VARLiNGAM, Granger — grade_temporal_spec), and authoring temporal worlds (an LLM-authored lagged world, admitted through a PCMCI+ temporal gate). Next: a temporal benchmark set (scale + crossover at n>1), a Gymnasium env with perturbations + counterfactual replay, scaling to 100+ worlds, and conversational elicitation. Tracked as issues.

Why this is the unoccupied intersection

Today's tools each own one corner — natural-language authoring × executable causal simulator × ground-truth answer-key for discovery is the gap:

Tool Corner it owns What it lacks (for this job)
G-Sim LLM authors a sim + calibrates to data needs real data; aimed at fidelity, not a declared answer-key
DEVS-Gen NL → executable discrete-event ops sim no declared causal-graph answer-key
SD-SCM LLM fills mechanisms → counterfactuals needs a user-supplied DAG; tabular, not an executable sim
TimeGraph known-graph time-series for discovery parametric/templated; no natural-language authoring

Built on the shoulders of pgmpy, DoWhy, CausalPlayground, causal-learn, and Gymnasium.

Contributing

Issues and PRs welcome. The bar: make validate green (ruff select=ALL, mypy strict, pytest with a coverage floor) — see docs/engineering.md. Atomic, conventional commits.

License

MIT. An open-source project from Noumenal.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

causal_worlds-0.12.0.tar.gz (4.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

causal_worlds-0.12.0-py3-none-any.whl (55.4 kB view details)

Uploaded Python 3

File details

Details for the file causal_worlds-0.12.0.tar.gz.

File metadata

  • Download URL: causal_worlds-0.12.0.tar.gz
  • Upload date:
  • Size: 4.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for causal_worlds-0.12.0.tar.gz
Algorithm Hash digest
SHA256 b1d6644f0652970dcf6c092699f03dcb44d3fcb72478d0ae53ccc28e9c494a2f
MD5 e0a03a557c458b2b24d931272343ba4f
BLAKE2b-256 d85923cd9b31d6e89bcde46bd75e03d944dc1aa4541de3a8620365566f3ca8f7

See more details on using hashes here.

File details

Details for the file causal_worlds-0.12.0-py3-none-any.whl.

File metadata

  • Download URL: causal_worlds-0.12.0-py3-none-any.whl
  • Upload date:
  • Size: 55.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for causal_worlds-0.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6781e19ade1633e79a1e2035a7189352d7196d1295b9576c742da16bd5150a46
MD5 6d87e465053f8b6e65910b7382428f35
BLAKE2b-256 5743ced7bb05e5e164d38b0296d848c8908ece87294ef745e1f4c2888e091508

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page