Generate fictional-but-coherent causal operations worlds (executable sim + ground-truth answer-key) from a natural-language description, for benchmarking causal-discovery agents.
Project description
causal-worlds
Turn a plain-language description of an operation into a fictional causal world with a declared, ground-truth causal graph — then benchmark whether a causal-discovery method can recover it.
Because the structure is declared (not learned from data), it's an answer key: run any discovery method on the generated data and score how well it recovered the world. The worlds are fiction-first — plausible and internally consistent, not models of any real system — so there is no data to leak and nothing to memorize, which is exactly what makes a causal benchmark trustworthy.
from causal_worlds import worlds, grade_spec, InterventionalCiDiscoverer
spec = worlds.get("coffee") # a hidden confounder + a regime sign-flip
report = grade_spec(spec, InterventionalCiDiscoverer())
print(report) # directed_shd=0 skeleton_shd=0 f1=1.0 confounded_reported=0
# ^ swap in YOUR discoverer to benchmark it against a known truth
Status — v0.6, beta. The full loop works: natural language → an admitted causal world, persisted with provenance. A Claude author proposes the world; an independent Gemini judge (a different model family) plus statistical gates admit only worlds that are valid, recoverable, and not guessable from variable names. The deterministic engine (specify → sample → grade → score) and all grading run with no API key; only authoring needs keys. Worlds are currently tabular SCMs with
do()interventions — a Gymnasium env, temporal lags, and counterfactual replay are on the roadmap. See the CHANGELOG.
Install
pip install causal-worlds # or: uv add causal-worlds
pip install 'causal-worlds[discover]' # + the baseline discovery stack (PC/GES/FCI/GIES)
pip install 'causal-worlds[llm]' # + natural-language authoring (Claude + Gemini)
The base install (engine, grading, built-in worlds, CLI) needs only typer, pydantic, numpy.
60-second quickstart (no API key)
causal-worlds worlds # list built-in worlds: coffee, ecommerce
causal-worlds gate coffee # run the validity gates -> admitted=True
causal-worlds grade coffee # grade the reference discoverer -> directed_shd=0 ...
causal-worlds score benchmark/v0.5/world_01 # grade the reference on a shipped benchmark world
New to it? Walk through the getting-started guide or run the examples.
Benchmark your own discoverer
Implement one method — recover(substrate, *, seed) -> set[(src, dst)] — and grade it against any
world's answer key:
from causal_worlds import grade_spec, worlds
class MyDiscoverer:
def recover(self, substrate, *, seed):
sample = substrate.sample(2000, seed=seed) # observational data...
flows = substrate.sample(2000, seed=seed, do={"price": 1.0}) # ...or interventional
return {("price", "demand")} # your recovered edges
print(grade_spec(worlds.get("coffee"), MyDiscoverer()))
Or from the CLI on a persisted world: causal-worlds score <bundle> --discoverer your_pkg:YourClass.
Author a world from a description (needs [llm] + keys)
Set ANTHROPIC_API_KEY and GEMINI_API_KEY (see .env.example; the CLI auto-loads a
local .env), then:
causal-worlds generate "a coffee chain with weekend swings and variable lead times" ./my-world
Or describe a world conversationally. A one-shot prompt is underspecified, so elicit runs a
short dialogue first — it asks the minimal clarifying questions (entities & roles, what drives what,
regimes, hidden causes, the objective), shows the accumulating brief, and authors only once the brief
is complete (or you type go):
causal-worlds elicit ./my-world # interactive: answer a few questions, then it generates
Observability: with the observability extra + Langfuse keys and
CAUSAL_WORLDS_LANGFUSE_ENABLED=true, every run is traced (generate → author → gate) in Langfuse.
from causal_worlds import generate
from causal_worlds.author import build_claude_author
from causal_worlds.judge import build_gemini_judge
world = generate(
"a hospital ED with triage staffing and bed pressure",
author=build_claude_author(complexity="hard"), # easy | standard | hard | adversarial
judge=build_gemini_judge(), # independent model family
)
print(world.report.difficulty, world.report.grade)
What the crossover shows (and what it doesn't)
Across the 35-world benchmark/v0.5 set (3 seeds each). The comparison is
information-fair: the +do methods get the same interventional budget (pooled observational +
per-variable do() environments) as the latent-aware reference — so we compare methods, not data
access (full table + bootstrap CIs):
| method | data | latent-aware? | mean skeleton-SHD ↓ | confounded pair kept as causal ↓ |
|---|---|---|---|---|
| interventional-ci (reference) | interventional | yes | 1.44 | 0 |
| GIES | interventional | no | 4.62 | 17 |
| PC | observational | no | 2.72 | 14.3 |
| PC + interventions | interventional | no | 3.31 | 15.0 |
| FCI | observational | partly | 2.68 | 9.7 |
| FCI + interventions | interventional | partly | 3.29 | 6.7 |
| DAGMA | observational | no | 5.73 | 16.0 |
| DirectLiNGAM | observational | no | 5.64 | 14.7 |
(DAGMA and DirectLiNGAM run at default hyperparameters, and LiNGAM's non-Gaussian assumption is violated by these linear-Gaussian worlds, so their skeleton accuracy is not their best — but the relevant, robust verdict is confounded-kept, and like every causal-sufficiency method they keep it.)
The honest reading: the dividing line is latent-awareness, not interventions. The decisive row is
PC + interventions — given the same interventional budget as the reference, it still keeps the
hidden-confounded pair as a causal edge in ~15 worlds (no better than observational PC's 14.3);
GIES likewise (17). Only the latent-aware interventional rule reaches 0. The interventional
advantage is robust: ΔF1 = F1(reference) − F1(method) is +0.29, 95% CI [0.22, 0.35] for
pc+do (every method's CI excludes 0). So this is an identifiability result (you cannot tell
confounding from causation without both interventions and a latent-aware method), not "our method
beats the toolbox."
Caveats we're not hiding (see evals/ and the issues): (1) the worlds are admitted by
the reference grader itself Fixed in v0.15: admission (gate T3) is now grader-independent —
a world is admitted iff its declared SCM is faithful by construction (every edge induces a
detectable partial correlation; regimes genuinely modulate), computed in closed form from the spec
with no discovery method run. The reference grader's score is reported, never gates. (2)
Simulated-DAG leakage — synthetic SCMs can leak the causal order through marginal variance
(varsortability) and through scale-invariant predictability
(R²-sortability). v0.14 generates worlds with internal standardization (iSCM), dropping
varsortability to 0.54 and R²-sortability 0.73 → 0.60; both trivial sorting baselines fall to F1
≈ 0.33–0.37, well under the real methods. The residual R²-sortability (0.60 > 0.5) is disclosed, not
yet fully closed. (3) Difficulty vs skeleton-SHD error is descriptive, not a validated predictor:
with bootstrap CIs (n=35), the observational methods show r≈0.40 (PC [0.07, 0.68], FCI [0.08, 0.68] —
just excluding 0) while the latent-aware reference is flat (r≈0.24, [−0.06, 0.51], includes 0).
(4) The shipped benchmark/v0.5 is still name-guessable — being fixed. A name-only LLM baseline
scores F1 0.71 vs a 0.20 chance floor (names and roles leak). v0.19
hardens the machinery for the next generation: T4 now admits only worlds with difficulty ≥ 0.5
(named-prior F1 < 0.5, down from the old 0.9 bar) plus a blind control (the name+role-anonymized
prior must sit near chance), and an adversarial author tier writes worlds where the obvious
name-based guess is wrong (phantom edges, reversed edges, regime sign-flips — keeping every true
edge detectable). The v0.5 set predates this; regenerating it under the strict gate is the next
scaled run.
What you get per world
- An executable SCM — sample observational data and
do()-intervene, deterministically by seed. - A time-series dataset — the observed variables (the input to a discovery method).
- An answer key — the declared causal edges + the hidden-confounded pairs, derived from the spec.
- A manifest — full provenance (models, grader version, seed, difficulty) and an honesty label.
Concepts
- Spec / IR — variables (with roles, incl. hidden), linear-Gaussian mechanisms, regime sign-flips.
- Answer key — directed edges over observed variables + the hidden-confounded pairs; derived from the spec, never stored separately, so they can't disagree.
- Gates — T1 validity · T2 sample-sanity · T3 faithfulness (grader-independent: the declared SCM is faithful & non-trivial by construction) · T4 anti-cliché (the named prior recovers < half — difficulty ≥ 0.5 — and a name+role-blind prior stays near chance). A world is admitted only if all pass.
- Reference grader — an interventional-CI discoverer that uses
do()data to tell confounding from causation, where PC/GES/GIES/FCI (which assume causal sufficiency) cannot.
Depth: docs/scope.md · docs/hld.md · docs/lld.md
· docs/architecture.md · docs/validation.md.
Roadmap
Shipped: NL authoring, independent judge + anti-cliché gate, artifact persistence, the baseline
crossover, a structural-difficulty axis, a 35-world benchmark, temporal worlds (lagged edges +
autoregression — see the built-in supply), and time-series grading (PCMCI+, LPCMCI, VARLiNGAM,
Granger — grade_temporal_spec), authoring temporal worlds (an LLM-authored lagged world,
admitted through a PCMCI+ temporal gate), and conversational elicitation (causal-worlds elicit
— a dialogue that builds a WorldBrief before authoring). Next: tightening the anti-cliché gate
(the name-only baseline shows worlds are still guessable — #12), a control track (objective +
optimal-policy answer-key + regret-under-perturbation — scope §1a), a temporal
benchmark set (n>1), a Gymnasium env, and scaling to 100+ worlds. Tracked as
issues.
Why this is the unoccupied intersection
Today's tools each own one corner — natural-language authoring × executable causal simulator × ground-truth answer-key for discovery is the gap:
| Tool | Corner it owns | What it lacks (for this job) |
|---|---|---|
| G-Sim | LLM authors a sim + calibrates to data | needs real data; aimed at fidelity, not a declared answer-key |
| DEVS-Gen | NL → executable discrete-event ops sim | no declared causal-graph answer-key |
| SD-SCM | LLM fills mechanisms → counterfactuals | needs a user-supplied DAG; tabular, not an executable sim |
| TimeGraph | known-graph time-series for discovery | parametric/templated; no natural-language authoring |
Built on the shoulders of pgmpy, DoWhy, CausalPlayground, causal-learn, and Gymnasium.
Contributing
Issues and PRs welcome. The bar: make validate green (ruff select=ALL, mypy strict, pytest with
a coverage floor) — see docs/engineering.md. Atomic, conventional commits.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file causal_worlds-0.20.0.tar.gz.
File metadata
- Download URL: causal_worlds-0.20.0.tar.gz
- Upload date:
- Size: 5.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b93e4531b0a529a167a0fa1bd2a04d7701bbab32a430cc4e17b65f85cd334046
|
|
| MD5 |
40cef0f3169b676445abe212db9f959b
|
|
| BLAKE2b-256 |
6ce44ae65aefa707191a7d293793a628ee4c7c0627b3cf9982dc7a99b9cca5eb
|
File details
Details for the file causal_worlds-0.20.0-py3-none-any.whl.
File metadata
- Download URL: causal_worlds-0.20.0-py3-none-any.whl
- Upload date:
- Size: 71.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28a943ca3edf1949d3615324d13be20cb9ff24f767aca338f5d7c0b81fcc0717
|
|
| MD5 |
9652afe25f67b566a8269867fd46d5c5
|
|
| BLAKE2b-256 |
b6d9fb0141dc2a67c6492d4dcf7b3e2db665b0cd94ab89403b49333088aa4c25
|