Generate fictional-but-coherent causal operations worlds (executable sim + ground-truth answer-key) from a natural-language description, for benchmarking causal-discovery agents.

These details have not been verified by PyPI

Project links

Project description

causal-worlds

Turn a plain-language description of an operation into a fictional causal world with a declared, ground-truth causal graph — then benchmark whether a causal-discovery method can recover it.

Because the structure is declared (not learned from data), it's an answer key: run any discovery method on the generated data and score how well it recovered the world. The worlds are fiction-first — plausible and internally consistent, not models of any real system — so there is no data to leak and nothing to memorize, which is exactly what makes a causal benchmark trustworthy.

📖 The story: Correlation lies — we built a causal world that can prove it, then tried to break it ourselves (Medium).

Correlation lies — and because the world is declared, you can prove it

In the built-in coffee world, overtime and sales rise together (correlation 0.64) — a naive analyst "discovers" that overtime drives sales. But force overtime with do() and sales doesn't move: the true causal effect is 0.00. The whole association was a hidden confounder the benchmark planted — and it knows, so it can score who gets fooled:

import numpy as np
from causal_worlds import build_substrate, worlds

sub = build_substrate(worlds.get("coffee"), standardize=False)
ov, sa = sub.variables.index("overtime"), sub.variables.index("sales")

seen = sub.sample(40_000, seed=0).data
corr = np.corrcoef(seen[:, ov], seen[:, sa])[0, 1]                     # ≈ 0.64  → looks causal
hi = sub.sample(40_000, seed=1, do={"overtime":  1.0}).data[:, sa].mean()
lo = sub.sample(40_000, seed=1, do={"overtime": -1.0}).data[:, sa].mean()
print(round(corr, 2), round((hi - lo) / 2, 2))   # 0.64  0.00  → strong correlation, ZERO causal effect

That's the whole game: a method that only sees the data keeps the mirage; one that can act (and is latent-aware) doesn't. Score any discovery method against the known truth:

from causal_worlds import worlds, grade_spec, InterventionalCiDiscoverer
print(grade_spec(worlds.get("coffee"), InterventionalCiDiscoverer()))
# directed_shd=0  f1=1.0  confounded_reported=0     ← swap in YOUR discoverer

See the world it builds

An SCM is a DAG, so look at it. to_mermaid(spec) and to_dot(spec) are zero-dependency string renderers (or run causal-worlds viz coffee). The hidden confounder is drawn dashed — it's the latent structure a discovery method never gets to see, and the reason the world is hard:

The declared SCM for the built-in "coffee" world: a hidden confounder (dashed) drives several observed variables

from causal_worlds import worlds, to_mermaid
print(to_mermaid(worlds.get("coffee")))   # paste into a ```mermaid block — GitHub renders it live

Status — on PyPI, beta. The full loop works: natural language → an admitted causal world, persisted with provenance. A Claude author proposes the world; an independent Gemini judge (a different model family) + statistical gates admit only worlds that are valid, recoverable, faithful, and not guessable from variable names. The engine (specify → sample → grade → score), all grading, and the renderers run with no API key; only authoring needs keys. Shipped: temporal (lagged) worlds, a control track, and a Gymnasium env. See the CHANGELOG.

Causality in three rungs (a 2-minute tour)

That picture isn't decoration — it's a causal model, and causality has exactly three levels (Judea Pearl's Ladder of Causation). causal-worlds lets you stand on each rung with real code, on a world whose true answer you already know.

Rung 1 — Association · what goes with what? In the data, footfall and sales rise together. But is that footfall → sales, or is the hidden local_buzz (a street festival, good weather) quietly driving both? From the data alone you cannot tell — two things moving together always have a hidden third suspect. That gap is the whole difficulty of causality.

substrate = build_substrate(worlds.get("coffee"))
data = substrate.sample(2000, seed=0)                  # just watch the world

Rung 2 — Intervention · what if I act? Don't watch a variable — set it. do() is surgery on the graph: it cuts every arrow pointing into the variable and keeps every arrow pointing out, so what then moves sales is the real effect, mirage removed. The data slopes sales on footfall at 1.56 — but do(footfall) reveals the true effect is only 0.90 (the data overstated it by 0.66, the confounder again); and for the pure-mirage pair above, do(overtime) gives exactly 0.00.

hi = sub.sample(40_000, seed=1, do={"footfall":  1.0}).data[:, sa].mean()
lo = sub.sample(40_000, seed=1, do={"footfall": -1.0}).data[:, sa].mean()
print(round((hi - lo) / 2, 2))   # 0.9  — the true causal effect, vs the 1.56 the raw data suggested

You can see the surgery — to_dot(spec, do={"footfall": 1.0}) (or causal-worlds viz coffee --format dot) draws the mutilated graph: every arrow into footfall is gone, its arrows out remain. Our do() is verified genuine surgery, not statistical conditioning (docs/scope.md):

Rung 2 — do(footfall): the arrows into footfall are cut, its effects still flow

Rung 3 — Counterfactual · what would have happened? "We sold what we sold last Saturday — would we have sold more had we set price differently, that same Saturday?" That needs the full model, with that specific day's hidden local_buzz held fixed. Because causal-worlds declares the entire SCM, it's exact — Pearl's abduction → action → prediction:

from causal_worlds import counterfactual
cf = counterfactual(worlds.get("coffee"), do={"price": 3.0}, seed=0)
print(cf.factual["sales"], "->", cf.counterfactual["sales"])   # what happened -> what would have
print(cf.effect["sales"])                                       # the change attributable to the act

(Same noise, same hidden local_buzz, only price changed — so the difference is caused by the act. On a weekend unit the price → demand sign is flipped, so the counterfactual can surprise you.)

Why this makes a benchmark. A method that only sees (Rung 1) keeps the local_buzz mirage as a real overtime → sales edge; only a latent-aware method that does (Rung 2) escapes it — which is exactly the crossover result. New to causality? This is the tour — read the diagrams, run the snippets, and you've climbed all three rungs.

Install

pip install causal-worlds              # or: uv add causal-worlds
pip install 'causal-worlds[discover]'  # + the baseline discovery stack (PC/GES/FCI/GIES)
pip install 'causal-worlds[llm]'       # + natural-language authoring (Claude + Gemini)

The base install (engine, grading, renderers, built-in worlds, CLI) needs only typer, pydantic, numpy.

60-second quickstart (no API key)

causal-worlds worlds                     # list built-in worlds: braking, coffee, ecommerce
causal-worlds viz coffee                 # print the SCM as Mermaid (--format dot for Graphviz)
causal-worlds gate coffee                # run the validity gates -> admitted=True
causal-worlds grade coffee               # grade the reference discoverer -> directed_shd=0 ...
causal-worlds score benchmark/v0.6/world_01   # grade the reference on a shipped benchmark world

New to it? Walk through the getting-started guide or run the examples (each prints its expected output, so you can read without running).

Benchmark your own discoverer

Implement one method — recover(substrate, *, seed) -> set[(src, dst)] — and grade it against any world's answer key:

from causal_worlds import grade_spec, worlds

class MyDiscoverer:
    def recover(self, substrate, *, seed):
        sample = substrate.sample(2000, seed=seed)        # observational data...
        flows = substrate.sample(2000, seed=seed, do={"price": 1.0})  # ...or interventional
        return {("price", "demand")}                       # your recovered edges

print(grade_spec(worlds.get("coffee"), MyDiscoverer()))

Or from the CLI on a persisted world: causal-worlds score <bundle> --discoverer your_pkg:YourClass.

Benchmark a controller (Stage 2 — control)

The same worlds are a control benchmark: pick lever values to maximise an objective. Because the mechanisms are declared, the best the levers can do is computable — a by-construction optimal policy (scope §1a) — so a policy is graded by regret against it, no external data needed.

from causal_worlds import default_objective, grade_control, worlds

spec = worlds.get("coffee")
objective = default_objective(spec)                  # controllables raise the outcome KPI, quadratic cost
report = grade_control(spec, objective, {"price": 3.0}, seed=7)   # your policy here
print(report.regret)                                 # regret vs the declared optimum (0 = optimal play)

A pluggable Controller is graded by grade_controller. Or drive it as a Gymnasium env (pip install 'causal-worlds[gym]') where the regime shifts between steps (a perturbation):

from causal_worlds.gym import ControlEnv
env = ControlEnv(worlds.get("coffee"))               # action = lever values; reward = objective
obs, info = env.reset(seed=0)                          # info["optimal_reward"] / info["regret"] per step

Author a world from a description (needs `[llm]` + keys)

Set ANTHROPIC_API_KEY and GEMINI_API_KEY (see .env.example; the CLI auto-loads a local .env), then:

causal-worlds generate "a coffee chain with weekend swings and variable lead times" ./my-world
causal-worlds viz ./my-world             # ...then look at what it built

Or describe one conversationally — causal-worlds elicit ./my-world asks the minimal clarifying questions (entities & roles, what drives what, regimes, hidden causes, the objective), shows the accumulating brief, and authors once it's complete. In Python:

from causal_worlds import generate
from causal_worlds.author import build_claude_author
from causal_worlds.judge import build_gemini_judge

world = generate(
    "a hospital ED with triage staffing and bed pressure",
    author=build_claude_author(complexity="hard"),   # easy | standard | hard | adversarial
    judge=build_gemini_judge(),                       # independent model family
)
print(world.report.difficulty, world.report.grade)

By default authoring is benchmark mode: a world is rejected if it is guessable from its variable names/roles (the published benchmark must not be name-guessable, so intuitive operations often fail on purpose). To just describe a world and get it, use playground mode — --playground on the CLI, or anti_cliche=False in Python. It keeps the faithfulness check and still reports difficulty (now advisory), but never rejects on guessability:

causal-worlds generate "a regional power grid with rooftop solar and time-of-use pricing" ./grid --playground

(With the observability extra + Langfuse keys, every generate → author → gate run is traced.)

What the benchmark shows

Across the 26-world hardened benchmark/v0.6 set, with an information-fair comparison (the +do methods get the same interventional budget as the latent-aware reference): latent-awareness — not interventions — is the dividing line. PC + interventions still scores the hidden-confounded pair as causal just as often as observational PC — confounded-kept 30 vs 29 (summed over the 26 worlds, seed-averaged); only the latent-aware rule reaches 0 (ΔF1 +0.37, 95% CI [0.33, 0.42]). It's an identifiability result, not "we beat the toolbox." Full table, bootstrap CIs, and the honest caveats (admission circularity, simulated-DAG leakage, difficulty-as-descriptor, anti-cliché role leakage) are in docs/findings.md.

What you get per world

An executable SCM — sample observational data and do()-intervene, deterministically by seed.
A time-series dataset — the observed variables (the input to a discovery method).
An answer key — the declared causal edges + the hidden-confounded pairs, derived from the spec.
A manifest — full provenance (models, grader version, seed, difficulty) and an honesty label.

Concepts

Spec / IR — variables (with roles, incl. hidden), additive mechanisms (linear-Gaussian by default; per-term Transforms give additive-nonlinear forms like square), regime sign-flips.
Answer key — directed edges over observed variables + the hidden-confounded pairs; derived from the spec, never stored separately, so they can't disagree.
Gates — T1 validity · T2 sample-sanity · T3 faithfulness (grader-independent) · T4 anti-cliché (named prior recovers < half and a name+role-blind prior stays near chance). All must pass to admit.
Reference grader — an interventional-CI discoverer that uses do() data to tell confounding from causation, where PC/GES/GIES/FCI (which assume causal sufficiency) cannot.

Depth: docs/foundations.md (are these worlds truly causal? — the three rungs, grounded in Pearl/Wright/Haavelmo) · docs/scope.md · docs/hld.md · docs/lld.md · docs/architecture.md · docs/findings.md · docs/validation.md.

Roadmap

Shipped: NL authoring · independent judge + anti-cliché gate · artifact persistence · the baseline crossover · a structural-difficulty axis · a 26-world hardened benchmark (v0.6) · temporal worlds (lagged edges + autoregression) and time-series grading (PCMCI+, LPCMCI, VARLiNGAM, Granger) · conversational elicitation · the control track (by-construction optimal policy, regret, and regret-under-perturbation) + a Gymnasium env · graph renderers (Mermaid / DOT). additive-nonlinear mechanisms (square/tanh/… transforms — the built-in braking world's stopping distance grows with speed², which linear discovery can't see; #10). Next: nonlinear interactions (products of parents) + a post-nonlinear form, and a temporal benchmark set (n>1). Tracked as issues.

Why this is the unoccupied intersection

Today's tools each own one corner — natural-language authoring × executable causal simulator × ground-truth answer-key for discovery is the gap:

Tool	Corner it owns	What it lacks (for this job)
G-Sim	LLM authors a sim + calibrates to data	needs real data; aimed at fidelity, not a declared answer-key
DEVS-Gen	NL → executable discrete-event ops sim	no declared causal-graph answer-key
SD-SCM	LLM fills mechanisms → counterfactuals	needs a user-supplied DAG; tabular, not an executable sim
TimeGraph	known-graph time-series for discovery	parametric/templated; no natural-language authoring

Built on the shoulders of pgmpy, DoWhy, CausalPlayground, causal-learn, and Gymnasium.

Contributing

Issues and PRs welcome. The bar: make validate green (ruff select=ALL, mypy strict, pytest with a coverage floor) — see docs/engineering.md. Atomic, conventional commits.

License

MIT. An open-source project from Noumenal.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.48.0

Jul 4, 2026

0.47.0

Jul 3, 2026

0.41.0

Jul 3, 2026

0.40.0

Jul 3, 2026

0.39.0

Jul 3, 2026

0.38.0

Jul 3, 2026

0.37.1

Jul 2, 2026

0.36.0

Jun 29, 2026

This version

0.35.0

Jun 27, 2026

0.34.0

Jun 25, 2026

0.33.0

Jun 25, 2026

0.32.0

Jun 25, 2026

0.31.0

Jun 25, 2026

0.30.0

Jun 25, 2026

0.29.0

Jun 25, 2026

0.28.0

Jun 24, 2026

0.27.0

Jun 23, 2026

0.26.0

Jun 23, 2026

0.25.0

Jun 23, 2026

0.24.0

Jun 23, 2026

0.23.0

Jun 23, 2026

0.22.0

Jun 23, 2026

0.21.0

Jun 23, 2026

0.20.0

Jun 23, 2026

0.19.0

Jun 23, 2026

0.18.0

Jun 23, 2026

0.17.0

Jun 23, 2026

0.16.0

Jun 23, 2026

0.15.0

Jun 23, 2026

0.14.0

Jun 23, 2026

0.13.0

Jun 23, 2026

0.12.0

Jun 23, 2026

0.11.1

Jun 23, 2026

0.11.0

Jun 23, 2026

0.10.0

Jun 23, 2026

0.9.0

Jun 23, 2026

0.8.1

Jun 23, 2026

0.8.0

Jun 23, 2026

0.7.0

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

causal_worlds-0.35.0.tar.gz (9.8 MB view details)

Uploaded Jun 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

causal_worlds-0.35.0-py3-none-any.whl (93.9 kB view details)

Uploaded Jun 27, 2026 Python 3

File details

Details for the file causal_worlds-0.35.0.tar.gz.

File metadata

Download URL: causal_worlds-0.35.0.tar.gz
Upload date: Jun 27, 2026
Size: 9.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for causal_worlds-0.35.0.tar.gz
Algorithm	Hash digest
SHA256	`579a1a197ec768b366421b5194d4215b457843b3242cf9ff97a87e80da5cd397`
MD5	`0a19bcac45fec2f8133a1baa5c102a79`
BLAKE2b-256	`cc9817b0f13475c60a8baeb8a8f14eae2c8ce13aeb2388354d77d664a606d04e`

See more details on using hashes here.

File details

Details for the file causal_worlds-0.35.0-py3-none-any.whl.

File metadata

Download URL: causal_worlds-0.35.0-py3-none-any.whl
Upload date: Jun 27, 2026
Size: 93.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for causal_worlds-0.35.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c743189cbd7a3ac7d2186b97ae6c5269fba1b29bdce0468fa5548a3eebf40977`
MD5	`159b0e5fe1b623f90890fe7e48bcd75a`
BLAKE2b-256	`e6d2e6594aa9f621b034c8fee2fe2e894a15b0ccd8f0549a1b4665e28b701dab`

See more details on using hashes here.

causal-worlds 0.35.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

causal-worlds

Correlation lies — and because the world is declared, you can prove it

See the world it builds

Causality in three rungs (a 2-minute tour)

Install

60-second quickstart (no API key)

Benchmark your own discoverer

Benchmark a controller (Stage 2 — control)

Author a world from a description (needs [llm] + keys)

What the benchmark shows

What you get per world

Concepts

Roadmap

Why this is the unoccupied intersection

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Author a world from a description (needs `[llm]` + keys)