Synthetic tabular data generator for causal modeling

These details have not been verified by PyPI

Project description

dagzoo

dagzoo generates reproducible synthetic tabular datasets from latent causal structure.

Why dagzoo

Start from a curated recipe catalog instead of reverse-engineering the full internal config surface.
Generate datasets from sampled latent DAGs instead of treating each column as independent noise.
Use the same recipe surface from the packaged CLI and the PyTorch bridge.
Publish portable handoff roots directly to Hugging Face Hub without exposing dagzoo-only sidecars.
Reproduce runs with effective_config.yaml, effective_config_trace.yaml, and stable dataset metadata.

Start

Use the packaged CLI when you want the public workflow without a repo checkout. These are the main dagzoo commands most users start with:

uv tool install dagzoo

# Inspect the curated recipe catalog and see the stable public names.
dagzoo recipe list

# Generate a general-purpose baseline run under data/default_baseline/.
dagzoo generate --config recipe:default-baseline --num-datasets 25 --out data/default_baseline

# Generate a portable handoff root and publish it to Hugging Face Hub.
dagzoo generate --config recipe:default-baseline --num-datasets 25 --handoff-root handoffs/default_baseline
hf auth login
dagzoo publish hub --handoff-root handoffs/default_baseline --repo-id your-name/default-baseline-corpus

Use a repo checkout when you want to edit configs, run docs tooling, or work on the codebase:

./scripts/dev bootstrap
source .venv/bin/activate
.venv/bin/nox -s quick

For in-process training loops, use the same recipe references through the PyTorch bridge. build_dataloader(...) is the in-process equivalent of running dagzoo generate --config recipe:<name> from the CLI:

from dagzoo import build_dataloader

# Load the same baseline recipe directly into a training loop.
loader = build_dataloader(
    "recipe:default-baseline",
    num_datasets=10,
    seed=7,
    device="cpu",
)
sample = next(iter(loader))
print(sample["X_train"].shape)

Large heterogeneous runs can switch to runtime.layout_mode: stratified to let the generator batch compatible (n_rows, n_features) strata without collapsing datasets onto one shared layout. Public runtime.layout_mode: fixed is no longer supported.

How it works

At a high level, dagzoo resolves a recipe or YAML config, derives deterministic seeds, samples a latent causal structure plus feature/target assignments, executes that latent graph, emits the target from one selected latent node, and only then applies optional missingness as an observation model over emitted features.

flowchart LR
    classDef setup fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#01579b
    classDef core fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#e65100
    classDef out fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#4a148c
    classDef post fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,color:#1b5e20

    Config[Recipe or YAML config] --> Seed[Deterministic seeding]
    Seed --> Layout[Sample layout plus target-node selection]
    Layout --> DAG[Sample latent DAG plus feature and target assignments]
    DAG --> Exec[Execute latent node pipelines plus converters]
    Exec --> XComplete[Assemble complete features X_complete]
    Exec --> TargetConvert[Convert selected latent target node into y]
    XComplete --> Split[Apply split checks and postprocess]
    TargetConvert --> Split
    Split --> Missingness[Optional missingness over emitted features]
    Missingness --> Bundle[[Emit DatasetBundle or shard artifacts]]
    Bundle -. optional later replay .-> Filter[dagzoo filter]

    class Config,Seed setup
    class Layout,DAG,Exec,XComplete,TargetConvert core
    class Split,Missingness,Filter post
    class Bundle out

Unlike generators that treat each column as independent noise, dagzoo generates both features and target from a latent causal structure. One node in the sampled graph can branch into multiple observable features, and one selected latent node is chosen during layout and assignment sampling, then later emits the target through its converter stack after latent execution. Optional missingness can later censor the emitted feature table without changing how y was derived.

flowchart LR
    classDef latent fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#01579b,stroke-dasharray: 5 5
    classDef observable fill:#f5f5f5,stroke:#212121,stroke-width:2px,color:#212121

    subgraph LatentSpace [Latent Causal DAG]
        NodeA((Node A)) --> NodeB((Node B))
    end

    subgraph ObservableSpace [Tabular Dataset Layout]
        Feat1[Feature 1: Numeric]
        Feat2[Feature 2: Categorical]
        Feat3[Feature 3: Numeric]
        Target[Target Variable]
    end

    NodeA -. mapping .-> Feat1
    NodeA -. mapping .-> Feat2
    NodeB -. mapping .-> Feat3
    NodeB -. target mapping .-> Target

    class NodeA,NodeB latent
    class Feat1,Feat2,Feat3,Target observable

    style LatentSpace fill:#f0faff,stroke:#01579b,stroke-dasharray: 5 5
    style ObservableSpace fill:#fafafa,stroke:#212121

In practice, that means target-node selection happens early, target values are emitted later after latent execution, and optional missingness only affects the observed feature values emitted afterward.

Public Surface

If you're new, start with the named recipes. The public surface is small on purpose:

dagzoo recipe list shows the curated recipe catalog.
dagzoo generate --config recipe:<name> generates datasets from one of those published recipes.
dagzoo publish hub --handoff-root ... --repo-id ... publishes a portable handoff root to a Hugging Face dataset repo.
build_dataloader("recipe:<name>", ...) gives you the same recipe surface inside Python.

recipe:<name> is the stable public config handle most users should reach for first. recipes/*.yaml are the published YAML files behind those names, so you can inspect exactly what a recipe contains. Repo-local configs/*.yaml are for custom local authoring and may change more often than the named recipe surface.

For example, this command generates 25 datasets from the baseline recipe:

# recipe:default-baseline is the named public config.
# --out chooses the run directory on disk.
dagzoo generate --config recipe:default-baseline --num-datasets 25 --out data/default_baseline

What Lands on Disk

After that generate command finishes, this is the kind of layout you should expect under the run root:

data/default_baseline/
  effective_config.yaml
  effective_config_trace.yaml
  shard_00000/
    train.parquet
    test.parquet
    dataset_catalog.ndjson
  internal/
    shard_00000/
      replay_catalog.ndjson
      lineage/
        adjacency.bitpack.bin
        adjacency.index.json

The shard_* directories hold the stable public dataset artifacts. The internal/ tree holds dagzoo-only replay and lineage sidecars used by tooling such as dagzoo filter; it is not the stable public contract. effective_config.yaml records the fully resolved config for the run, and effective_config_trace.yaml records where overrides came from so the run is reproducible. The full artifact contract lives in docs/output-format.md. The exhaustive field catalog lives in docs/export-contract-fields.md.

Docs

Published docs site: bensonlee5.github.io/dagzoo
Start
Reference Packs
Publish to Hugging Face Hub
Advanced Controls
Artifacts & API
Export Contract Fields
How It Works
Feature Guides

Community

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.20.0

Apr 14, 2026

0.19.13

Apr 14, 2026

0.19.12

Apr 10, 2026

0.19.11

Apr 10, 2026

This version

0.19.10

Apr 7, 2026

0.19.9

Apr 6, 2026

0.19.8

Apr 6, 2026

0.19.7

Apr 6, 2026

0.19.6

Apr 5, 2026

0.19.5

Apr 3, 2026

0.19.4

Apr 3, 2026

0.19.3

Apr 3, 2026

0.19.2

Apr 3, 2026

0.19.1

Apr 2, 2026

0.19.0

Apr 1, 2026

0.18.0

Apr 1, 2026

0.17.0

Apr 1, 2026

0.16.1

Mar 31, 2026

0.16.0

Mar 31, 2026

0.15.2

Mar 31, 2026

0.15.1

Mar 27, 2026

0.15.0

Mar 27, 2026

0.14.5

Mar 27, 2026

0.14.4

Mar 26, 2026

0.14.3

Mar 26, 2026

0.14.2

Mar 26, 2026

0.14.1

Mar 26, 2026

0.14.0

Mar 26, 2026

0.13.0

Mar 25, 2026

0.12.0

Mar 24, 2026

0.11.0

Mar 22, 2026

0.10.3

Mar 19, 2026

0.10.2

Mar 15, 2026

0.10.1

Mar 15, 2026

0.10.0

Mar 15, 2026

0.9.11

Mar 15, 2026

0.9.10

Mar 15, 2026

0.9.9

Mar 15, 2026

0.9.8

Mar 13, 2026

0.9.7

Mar 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dagzoo-0.19.10.tar.gz (621.8 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dagzoo-0.19.10-py3-none-any.whl (278.7 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file dagzoo-0.19.10.tar.gz.

File metadata

Download URL: dagzoo-0.19.10.tar.gz
Upload date: Apr 7, 2026
Size: 621.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dagzoo-0.19.10.tar.gz
Algorithm	Hash digest
SHA256	`a483b38657c2648cf7381de30c009efbd35d1756fabfb8f66113ea68d48b594e`
MD5	`8322d161a5e9e871fd60e06b929829d8`
BLAKE2b-256	`91d4bdb33765e3210507a4307a17b5a96d6c551ca8aace064c1c465b85ce8ad2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dagzoo-0.19.10.tar.gz:

Publisher: package.yml on bensonlee5/dagzoo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dagzoo-0.19.10.tar.gz
- Subject digest: a483b38657c2648cf7381de30c009efbd35d1756fabfb8f66113ea68d48b594e
- Sigstore transparency entry: 1248708849
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: bensonlee5/dagzoo@624ae3fef4f156f7d56366f79ef419ed9d65ab50
- Branch / Tag: refs/heads/main
- Owner: https://github.com/bensonlee5
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: package.yml@624ae3fef4f156f7d56366f79ef419ed9d65ab50
- Trigger Event: push

File details

Details for the file dagzoo-0.19.10-py3-none-any.whl.

File metadata

Download URL: dagzoo-0.19.10-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 278.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dagzoo-0.19.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dd8a4da7306a4254e31beaf4d25e24f61d01b610d443183b20d21a7b7ba63846`
MD5	`4b4949542b2c51cdb8e980d51c64487f`
BLAKE2b-256	`f402ba94522d4e9f5e4ec94865dca47b40e9e3fd03ca61ed92f854d04de8993e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dagzoo-0.19.10-py3-none-any.whl:

Publisher: package.yml on bensonlee5/dagzoo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dagzoo-0.19.10-py3-none-any.whl
- Subject digest: dd8a4da7306a4254e31beaf4d25e24f61d01b610d443183b20d21a7b7ba63846
- Sigstore transparency entry: 1248708917
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: bensonlee5/dagzoo@624ae3fef4f156f7d56366f79ef419ed9d65ab50
- Branch / Tag: refs/heads/main
- Owner: https://github.com/bensonlee5
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: package.yml@624ae3fef4f156f7d56366f79ef419ed9d65ab50
- Trigger Event: push

dagzoo 0.19.10

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

dagzoo

Why dagzoo

Start

How it works

Public Surface

What Lands on Disk

Docs

Community

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance