Synthetic tabular data generator for causal modeling
Project description
dagzoo
High-throughput synthetic tabular data generation built around causal structure. Use it to generate, benchmark, and stress-test tabular datasets with deterministic seed behavior.
flowchart LR
%% Class Definitions
classDef setup fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#01579b
classDef core fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#e65100
classDef gate fill:#f1f8e9,stroke:#33691e,stroke-width:2px,color:#33691e
classDef out fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#4a148c
Seed([Root Seed]) --> RNG[Deterministic Seeding]
RNG --> Layout[Layout & DAG Sampling]
Layout --> Mechanisms[Random Functional Mechanisms]
Mechanisms --> Converters[Feature/Target Converters]
Converters --> Filter[Learnability Filter]
Filter --> Bundle[[DatasetBundle: X, y, Metadata]]
%% Assign Classes
class Seed,RNG setup
class Layout,Mechanisms,Converters core
class Filter gate
class Bundle out
From Latent DAG to Tabular Data
Unlike many generators that treat each column as an independent noise source, dagzoo generates data from a latent causal structure. A single node in the causal graph can branch into multiple observable features, preserving complex dependency patterns.
flowchart LR
%% Class Definitions
classDef latent fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#01579b,stroke-dasharray: 5 5
classDef observable fill:#f5f5f5,stroke:#212121,stroke-width:2px,color:#212121
subgraph LatentSpace [Latent Causal DAG]
NodeA((Node A)) --> NodeB((Node B))
end
subgraph ObservableSpace [Tabular Dataset Layout]
Feat1[Feature 1: Numeric]
Feat2[Feature 2: Categorical]
Feat3[Feature 3: Numeric]
Target[Target Variable]
end
%% Mapping connections
NodeA -. mapping .-> Feat1
NodeA -. mapping .-> Feat2
NodeB -. mapping .-> Feat3
NodeB -. mapping .-> Target
%% Assign Classes
class NodeA,NodeB latent
class Feat1,Feat2,Feat3,Target observable
style LatentSpace fill:#f0faff,stroke:#01579b,stroke-dasharray: 5 5
style ObservableSpace fill:#fafafa,stroke:#212121
Why dagzoo
dagzoo is for situations where you need synthetic tabular data that is:
- Causally structured: datasets are generated from a sampled latent DAG, not independent column noise.
- Reproducible: deterministic seed fan-out and effective-config trace artifacts make runs auditable.
- Stress-testable: shift, noise, missingness, and deferred filter controls let you probe model robustness under controlled distribution changes.
- Operationally scalable: canonical fixed-layout generation and benchmark guardrails support repeatable high-throughput workflows.
Quick Start
Examples in this README assume a repo checkout (so configs/*.yaml is available):
uv sync --group dev
source .venv/bin/activate
./scripts/dev doctor all
Install the packaged CLI globally when you do not need repo presets/config files:
uv tool install dagzoo
Generate a default batch from the repo:
dagzoo generate --config configs/default.yaml --num-datasets 10 --out data/run1
Each generate run writes effective_config.yaml and effective_config_trace.yaml
in the resolved output directory.
dagzoo generate samples one internal fixed-layout plan per run, so all
datasets emitted in the same run share one sampled layout/execution plan.
Run dagzoo filter as a separate stage for acceptance decisions.
Deferred filtering now replays strictly from embedded shard metadata; generated
artifacts must include metadata.config.dataset.task and metadata.config.filter.
Generate configs must not include runtime.worker_count or
runtime.worker_index.
Run deferred filtering on generated shards:
dagzoo filter --in data/run1 --out data/run1_filter
Run a downstream handoff workflow from a concise request file:
dagzoo request --request requests/tab_foundry_smoke.yaml --device cpu --hardware-policy none
dagzoo request writes one stable request-run root with:
handoff_manifest.jsonas the downstream machine-readable entrypointgenerated/for raw shard outputs plus effective-config artifactsfilter/for deferred-filter artifactscurated/for accepted-only shards
Run a smoke benchmark:
dagzoo benchmark --suite smoke --preset cpu --out-dir benchmarks/results/smoke_cpu
--device is a single-preset benchmark override. For multi-preset benchmark
runs, set the device in each preset/config instead of passing one shared CLI
override.
Inspect detected hardware tier:
dagzoo hardware
View help and available options for commands:
dagzoo --help
dagzoo generate --help
dagzoo filter --help
dagzoo benchmark --help
Local repo workflow before review:
./scripts/dev impact
./scripts/dev verify quick
Documentation
Primary docs site:
Start here for end-user workflows and contracts:
- How It Works: System flow and terminology.
- Transforms (Math Reference): Formal transform math, notation, and operator definitions.
- Usage Guide: Primary workflow hub.
- Output Format: Output schema and artifacts.
- Request File Contract: Public request schema and one-way handoff contract for downstream consumers.
- Feature Guides: Diagnostics, missingness, many-class, shift, noise, and benchmark guardrails.
If you are integrating dagzoo downstream, treat these as the stable
references:
- Request inputs for
dagzoo request: Request File Contract - Generated and request-run artifacts: Output Format
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dagzoo-0.9.10.tar.gz.
File metadata
- Download URL: dagzoo-0.9.10.tar.gz
- Upload date:
- Size: 470.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f888263fd511b3705f10b3751d1833d72ac48b7c530dd89ff82a483bbf8ae17
|
|
| MD5 |
35b0338d3cda4e4f7516a59c823f20bb
|
|
| BLAKE2b-256 |
431b5fd3c1504ec309e426dc8eaa02ddad5e9ac5c763f568338c9161f44730d8
|
Provenance
The following attestation bundles were made for dagzoo-0.9.10.tar.gz:
Publisher:
package.yml on bensonlee5/dagzoo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dagzoo-0.9.10.tar.gz -
Subject digest:
5f888263fd511b3705f10b3751d1833d72ac48b7c530dd89ff82a483bbf8ae17 - Sigstore transparency entry: 1108089436
- Sigstore integration time:
-
Permalink:
bensonlee5/dagzoo@24c6d7e936e02f610d67a6bd8804e1df779e0c24 -
Branch / Tag:
refs/heads/codex/hypothesis-testing - Owner: https://github.com/bensonlee5
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
package.yml@24c6d7e936e02f610d67a6bd8804e1df779e0c24 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file dagzoo-0.9.10-py3-none-any.whl.
File metadata
- Download URL: dagzoo-0.9.10-py3-none-any.whl
- Upload date:
- Size: 214.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7f928647d1e5422d986752b2d390abd2e7d7ef331444db6034d1983cda5dc0e
|
|
| MD5 |
15c632bdbe3013fadbb8e2aa08f3fe1c
|
|
| BLAKE2b-256 |
2ac68285339be4d12fa6d33ad603c704b91a7affc4fe266f6240268a4d42fc6b
|
Provenance
The following attestation bundles were made for dagzoo-0.9.10-py3-none-any.whl:
Publisher:
package.yml on bensonlee5/dagzoo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dagzoo-0.9.10-py3-none-any.whl -
Subject digest:
f7f928647d1e5422d986752b2d390abd2e7d7ef331444db6034d1983cda5dc0e - Sigstore transparency entry: 1108089438
- Sigstore integration time:
-
Permalink:
bensonlee5/dagzoo@24c6d7e936e02f610d67a6bd8804e1df779e0c24 -
Branch / Tag:
refs/heads/codex/hypothesis-testing - Owner: https://github.com/bensonlee5
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
package.yml@24c6d7e936e02f610d67a6bd8804e1df779e0c24 -
Trigger Event:
workflow_dispatch
-
Statement type: