Skip to main content

Synthetic tabular data generator for causal modeling

Project description

dagzoo

High-throughput synthetic tabular data generation built around causal structure. Use it to generate, benchmark, and stress-test tabular datasets with deterministic seed behavior.

flowchart LR
    %% Class Definitions
    classDef setup fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#01579b
    classDef core fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#e65100
    classDef out fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#4a148c

    Seed([Root Seed]) --> RNG[Deterministic Seeding]
    RNG --> Layout[Layout & DAG Sampling]
    Layout --> Mechanisms[Random Functional Mechanisms]
    Mechanisms --> Converters[Feature/Target Converters]
    Converters --> Bundle[[DatasetBundle: X, y, Metadata]]

    %% Assign Classes
    class Seed,RNG setup
    class Layout,Mechanisms,Converters core
    class Bundle out

From Latent DAG to Tabular Data

Unlike many generators that treat each column as an independent noise source, dagzoo generates data from a latent causal structure. A single node in the causal graph can branch into multiple observable features, preserving complex dependency patterns.

flowchart LR
    %% Class Definitions
    classDef latent fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#01579b,stroke-dasharray: 5 5
    classDef observable fill:#f5f5f5,stroke:#212121,stroke-width:2px,color:#212121

    subgraph LatentSpace [Latent Causal DAG]
        NodeA((Node A)) --> NodeB((Node B))
    end

    subgraph ObservableSpace [Tabular Dataset Layout]
        Feat1[Feature 1: Numeric]
        Feat2[Feature 2: Categorical]
        Feat3[Feature 3: Numeric]
        Target[Target Variable]
    end

    %% Mapping connections
    NodeA -. mapping .-> Feat1
    NodeA -. mapping .-> Feat2
    NodeB -. mapping .-> Feat3
    NodeB -. mapping .-> Target

    %% Assign Classes
    class NodeA,NodeB latent
    class Feat1,Feat2,Feat3,Target observable

    style LatentSpace fill:#f0faff,stroke:#01579b,stroke-dasharray: 5 5
    style ObservableSpace fill:#fafafa,stroke:#212121

Why dagzoo

Researchers need synthetic tabular corpora whose structure, regime, and robustness envelope they can control. The graph structure, functional relationships, noise, shift, and missingness settings chosen at generation time directly shape what downstream models train on.

dagzoo provides explicit control over graph structure, mechanism families, noise distributions, distribution shift, missingness, and canonical fixed-layout generation semantics. It is designed for researchers who need repeatable synthetic tabular generation with clear control over the main axes of variation in the resulting corpus.

dagzoo is for situations where you need synthetic tabular data that is:

  • Causally structured: datasets are generated from a sampled latent DAG, not independent column noise.
  • Reproducible: deterministic seed fan-out and effective-config trace artifacts make runs auditable.
  • Stress-testable: shift, noise, and missingness controls let you probe model robustness under controlled distribution changes.
  • Operationally scalable: canonical fixed-layout generation and benchmark guardrails support repeatable high-throughput workflows.

Quick Start

Examples in this README assume a repo checkout (so configs/*.yaml is available):

./scripts/dev bootstrap
source .venv/bin/activate
./scripts/dev doctor all

Install the packaged CLI globally when you do not need repo presets/config files:

uv tool install dagzoo

Generate a default batch from the repo:

dagzoo generate --config configs/default.yaml --num-datasets 10 --out data/run1

Or stream canonical task samples directly into a PyTorch training loop:

from dagzoo import build_dataloader

loader = build_dataloader(
    "configs/default.yaml",
    num_datasets=10,
    seed=7,
    device="cpu",
)
sample = next(iter(loader))
print(sample.keys())

Use build_dataloader(...) as the recommended PyTorch entrypoint for task-sized samples with X_train, y_train, X_test, y_test, feature_types, and metadata. Reach for DagzooDataset only when you need the lower-level iterable dataset interface. The current v1 bridge supports num_workers=0; see the usage guide for the full API contract.

Each generate run writes effective_config.yaml and effective_config_trace.yaml in the resolved output directory. dagzoo generate samples one internal fixed-layout plan per run, so all datasets emitted in the same run share one sampled layout/execution plan. Generate configs must not include runtime.worker_count or runtime.worker_index.

Run a downstream handoff workflow from generate:

dagzoo generate --config configs/default.yaml --num-datasets 10 --handoff-root handoffs/run1 --device cpu --hardware-policy none

dagzoo generate --handoff-root writes one stable handoff root with:

  • handoff_manifest.json as the downstream machine-readable entrypoint
  • generated/ for raw shard outputs plus effective-config artifacts

Run a smoke benchmark:

dagzoo benchmark --suite smoke --preset cpu --out-dir benchmarks/results/smoke_cpu

--device is a single-preset benchmark override. For multi-preset benchmark runs, set the device in each preset/config instead of passing one shared CLI override.

Inspect detected hardware tier:

dagzoo hardware

Workflow Surfaces

dagzoo is the canonical packaged CLI. Use ./scripts/dev as the fast repo-local path for bootstrap, doctor, review-base, impact, ready, and verify flows.

Surface Use it for
dagzoo Canonical packaged CLI for generation, benchmarking, corpus-audit, and hardware workflows.
./scripts/dev Fast repo-local bootstrap, doctor, review, and verification flows.

Use --help in this order:

  1. dagzoo --help
  2. dagzoo <command> --help

CLI layout:

dagzoo
├── generate
├── filter
├── benchmark
├── diversity-audit
└── hardware

Local repo workflow before review:

./scripts/dev review-base
./scripts/dev ready

For focused local analysis outside the pre-review flow:

./scripts/dev impact
./scripts/dev verify quick

Documentation

Primary docs site:

Start here for end-user workflows and contracts:

If you are integrating dagzoo downstream, treat these as the stable references:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dagzoo-0.14.3.tar.gz (559.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dagzoo-0.14.3-py3-none-any.whl (241.2 kB view details)

Uploaded Python 3

File details

Details for the file dagzoo-0.14.3.tar.gz.

File metadata

  • Download URL: dagzoo-0.14.3.tar.gz
  • Upload date:
  • Size: 559.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dagzoo-0.14.3.tar.gz
Algorithm Hash digest
SHA256 be882dca771dca5ace0cbcdfb6aedcbe0686a456108cf00260f561270821a428
MD5 ac3a664f58f145e0ecdc37105187360d
BLAKE2b-256 c1b121237b1a440a066979d7d7a28f003f1f8a8cd3458c231269f058d0a47bca

See more details on using hashes here.

Provenance

The following attestation bundles were made for dagzoo-0.14.3.tar.gz:

Publisher: package.yml on bensonlee5/dagzoo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dagzoo-0.14.3-py3-none-any.whl.

File metadata

  • Download URL: dagzoo-0.14.3-py3-none-any.whl
  • Upload date:
  • Size: 241.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dagzoo-0.14.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a71f6e051e69bf3aed576e42cdc48876930519962ec9af1ccb21131ac772d075
MD5 ddec52287e80ad0e6974a14f3b1a992d
BLAKE2b-256 88b5c2e4b918a3c5856b1406874bfd9bc08fc1525979a6216388d3be87204f66

See more details on using hashes here.

Provenance

The following attestation bundles were made for dagzoo-0.14.3-py3-none-any.whl:

Publisher: package.yml on bensonlee5/dagzoo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page