Synthetic tabular data generator for causal modeling
Project description
dagzoo
High-throughput synthetic tabular data generation built around causal structure. Use it to generate, benchmark, and stress-test tabular datasets with deterministic seed behavior.
flowchart LR
%% Class Definitions
classDef setup fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#01579b
classDef core fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#e65100
classDef out fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#4a148c
Seed([Root Seed]) --> RNG[Deterministic Seeding]
RNG --> Layout[Layout & DAG Sampling]
Layout --> Mechanisms[Random Functional Mechanisms]
Mechanisms --> Converters[Feature/Target Converters]
Converters --> Bundle[[DatasetBundle: X, y, Metadata]]
%% Assign Classes
class Seed,RNG setup
class Layout,Mechanisms,Converters core
class Bundle out
From Latent DAG to Tabular Data
Unlike many generators that treat each column as an independent noise source, dagzoo generates data from a latent causal structure. A single node in the causal graph can branch into multiple observable features, preserving complex dependency patterns.
flowchart LR
%% Class Definitions
classDef latent fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#01579b,stroke-dasharray: 5 5
classDef observable fill:#f5f5f5,stroke:#212121,stroke-width:2px,color:#212121
subgraph LatentSpace [Latent Causal DAG]
NodeA((Node A)) --> NodeB((Node B))
end
subgraph ObservableSpace [Tabular Dataset Layout]
Feat1[Feature 1: Numeric]
Feat2[Feature 2: Categorical]
Feat3[Feature 3: Numeric]
Target[Target Variable]
end
%% Mapping connections
NodeA -. mapping .-> Feat1
NodeA -. mapping .-> Feat2
NodeB -. mapping .-> Feat3
NodeB -. mapping .-> Target
%% Assign Classes
class NodeA,NodeB latent
class Feat1,Feat2,Feat3,Target observable
style LatentSpace fill:#f0faff,stroke:#01579b,stroke-dasharray: 5 5
style ObservableSpace fill:#fafafa,stroke:#212121
Why dagzoo
Researchers need synthetic tabular corpora whose structure, regime, and robustness envelope they can control. The graph structure, functional relationships, noise, shift, and missingness settings chosen at generation time directly shape what downstream models train on.
dagzoo provides explicit control over graph structure, mechanism families,
noise distributions, distribution shift, missingness, and canonical fixed-layout
generation semantics. It is designed for researchers who need repeatable
synthetic tabular generation with clear control over the main axes of variation
in the resulting corpus.
dagzoo is for situations where you need synthetic tabular data that is:
- Causally structured: datasets are generated from a sampled latent DAG, not independent column noise.
- Reproducible: deterministic seed fan-out and effective-config trace artifacts make runs auditable.
- Stress-testable: shift, noise, and missingness controls let you probe model robustness under controlled distribution changes.
- Operationally scalable: canonical fixed-layout generation and benchmark guardrails support repeatable high-throughput workflows.
Quick Start
Examples in this README assume a repo checkout (so configs/*.yaml is available):
./scripts/dev bootstrap
source .venv/bin/activate
./scripts/dev doctor all
Install the packaged CLI globally when you do not need repo presets/config files:
uv tool install dagzoo
Generate a default batch from the repo:
dagzoo generate --config configs/default.yaml --num-datasets 10 --out data/run1
Or stream canonical task samples directly into a PyTorch training loop:
from dagzoo import build_dataloader
loader = build_dataloader(
"configs/default.yaml",
num_datasets=10,
seed=7,
device="cpu",
)
sample = next(iter(loader))
print(sample.keys())
Use build_dataloader(...) as the recommended PyTorch entrypoint for
task-sized samples with X_train, y_train, X_test, y_test,
feature_types, and metadata. Reach for DagzooDataset only when you need
the lower-level iterable dataset interface. The current v1 bridge supports
num_workers=0; see the usage guide for the full API contract.
Each generate run writes effective_config.yaml and effective_config_trace.yaml
in the resolved output directory.
dagzoo generate samples one internal fixed-layout plan per run, so all
datasets emitted in the same run share one sampled layout/execution plan.
Generate configs must not include runtime.worker_count or
runtime.worker_index.
Run a downstream handoff workflow from generate:
dagzoo generate --config configs/default.yaml --num-datasets 10 --handoff-root handoffs/run1 --device cpu --hardware-policy none
dagzoo generate --handoff-root writes one stable handoff root with:
handoff_manifest.jsonas the downstream machine-readable entrypointgenerated/for raw shard outputs plus effective-config artifacts
Run a smoke benchmark:
dagzoo benchmark --suite smoke --preset cpu --out-dir benchmarks/results/smoke_cpu
--device is a single-preset benchmark override. For multi-preset benchmark
runs, set the device in each preset/config instead of passing one shared CLI
override.
Inspect detected hardware tier:
dagzoo hardware
Workflow Surfaces
dagzoo is the canonical packaged CLI. Use ./scripts/dev as the fast
repo-local path for bootstrap, doctor, review-base, impact, ready, and verify
flows.
| Surface | Use it for |
|---|---|
dagzoo |
Canonical packaged CLI for generation, benchmarking, corpus-audit, and hardware workflows. |
./scripts/dev |
Fast repo-local bootstrap, doctor, review, and verification flows. |
Use --help in this order:
dagzoo --helpdagzoo <command> --help
CLI layout:
dagzoo
├── generate
├── filter
├── benchmark
├── diversity-audit
└── hardware
Local repo workflow before review:
./scripts/dev review-base
./scripts/dev ready
For focused local analysis outside the pre-review flow:
./scripts/dev impact
./scripts/dev verify quick
Documentation
Primary docs site:
Start here for end-user workflows and contracts:
- How It Works: System flow and terminology.
- Transforms (Math Reference): Formal transform math, notation, and operator definitions.
- Usage Guide: Primary workflow hub.
- Output Format: Output schema and artifacts.
- Feature Guides: Diagnostics, missingness, many-class, shift, noise, and benchmark guardrails.
If you are integrating dagzoo downstream, treat these as the stable
references:
- Handoff workflow and CLI usage: Usage Guide
- Generated artifacts and handoff manifest schema: Output Format
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dagzoo-0.14.3.tar.gz.
File metadata
- Download URL: dagzoo-0.14.3.tar.gz
- Upload date:
- Size: 559.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be882dca771dca5ace0cbcdfb6aedcbe0686a456108cf00260f561270821a428
|
|
| MD5 |
ac3a664f58f145e0ecdc37105187360d
|
|
| BLAKE2b-256 |
c1b121237b1a440a066979d7d7a28f003f1f8a8cd3458c231269f058d0a47bca
|
Provenance
The following attestation bundles were made for dagzoo-0.14.3.tar.gz:
Publisher:
package.yml on bensonlee5/dagzoo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dagzoo-0.14.3.tar.gz -
Subject digest:
be882dca771dca5ace0cbcdfb6aedcbe0686a456108cf00260f561270821a428 - Sigstore transparency entry: 1186233431
- Sigstore integration time:
-
Permalink:
bensonlee5/dagzoo@469be6b8a8856c535dcb2791de4421eb8075a0cb -
Branch / Tag:
refs/heads/main - Owner: https://github.com/bensonlee5
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
package.yml@469be6b8a8856c535dcb2791de4421eb8075a0cb -
Trigger Event:
push
-
Statement type:
File details
Details for the file dagzoo-0.14.3-py3-none-any.whl.
File metadata
- Download URL: dagzoo-0.14.3-py3-none-any.whl
- Upload date:
- Size: 241.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a71f6e051e69bf3aed576e42cdc48876930519962ec9af1ccb21131ac772d075
|
|
| MD5 |
ddec52287e80ad0e6974a14f3b1a992d
|
|
| BLAKE2b-256 |
88b5c2e4b918a3c5856b1406874bfd9bc08fc1525979a6216388d3be87204f66
|
Provenance
The following attestation bundles were made for dagzoo-0.14.3-py3-none-any.whl:
Publisher:
package.yml on bensonlee5/dagzoo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dagzoo-0.14.3-py3-none-any.whl -
Subject digest:
a71f6e051e69bf3aed576e42cdc48876930519962ec9af1ccb21131ac772d075 - Sigstore transparency entry: 1186233435
- Sigstore integration time:
-
Permalink:
bensonlee5/dagzoo@469be6b8a8856c535dcb2791de4421eb8075a0cb -
Branch / Tag:
refs/heads/main - Owner: https://github.com/bensonlee5
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
package.yml@469be6b8a8856c535dcb2791de4421eb8075a0cb -
Trigger Event:
push
-
Statement type: