Local-first, open-source engine for controllable, reproducible synthetic data.
Project description
DataDoom
Local-first, open-source engine for controllable, reproducible synthetic data.
Design the dataset the way you reason about it — distributions, causal relationships, difficulty, and failure modes — and regenerate it identically, forever, from a single spec file.
North star: a synthetic dataset should be as version-controllable, shareable, and reproducible as source code.
📖 Docs: https://santhoshreddy352.github.io/datadoom/ · authoritative design in
docs_v2/ (start at docs_v2/00_README_Index.md).
Why DataDoom
Synthetic data usually forces a trade-off: it's either realistic but a black box (you can't say what relationships or flaws it contains) or controllable but throwaway (you can't regenerate the exact same dataset tomorrow). That makes it hard to teach with, benchmark against, file a bug against, or share.
The goal: make a dataset something you design and version-control like source
code. You declare its structure — distributions, causal relationships, difficulty,
and data-quality failures — in one spec file, and DataDoom regenerates it
byte-for-byte identically from (spec_hash, seed), while honestly reporting how
well the realized data matches what you asked for. No network, no telemetry, no
account: everything runs locally.
Good for: ML teaching & reproducible benchmarks · testing data pipelines on known edge cases · sharing a dataset's recipe instead of PII · hackathon / challenge datasets with a known ground truth.
What it does
- Deterministic by construction — one seeded RNG underpins everything; the same spec + seed yields a bitwise-identical dataset on the pinned path.
- Honest statistics — distributions are sampled correctly and their fit is reported (KS / chi-square goodness-of-fit, compliance score); parameters are never refit to flatter the sample.
- Causal structure — a DAG of structural equations (linear/logistic/polynomial/…)
with per-node noise and
do()interventions, plus a true-graph + mutual-information report. - Failure injection — eight mechanisms (MCAR/MAR/MNAR, label & feature noise, drift, covariate shift, leakage) corrupt a copy while the clean baseline is kept, with realized-effect diffs.
- Difficulty targeting — calibrate a binary label to a chosen baseline-model AUROC band, reported with the achieved metric, knobs, and bisection trace.
- Rich feature types — numeric/categorical/boolean/datetime, realistic seeded text (names, emails, addresses), additive time-series, and latent (hidden) features.
- Extensible — distributions, structural functions, failure modes, exporters, and probes all ship as plugins against the engine ABCs, with zero core changes.
- Built to consume — export CSV / JSON / Parquet, load a run straight into pandas / PyTorch / TensorFlow / HuggingFace, and start from built-in domain templates (including ready-made hackathon challenges).
- Two surfaces, one engine — a CLI for automation and a web Canvas for design both call the exact same pipeline, so results never diverge.
Status
Phases 0–5 complete; 1.0 hardening underway. Everything in What it does above
ships today. Remaining for 1.0 is hardening (docs site, release automation, the repro
matrix); see status.md. Optional team mode is a deferred future addon.
Install
pip install datadoom # engine + CLI
pip install "datadoom[server]" # + web Canvas (datadoom serve)
pip install "datadoom[parquet]" # + Parquet export
Quickstart
# generate a dataset from a spec
datadoom run examples/causal-fraud.datadoom.yaml --seed 42 --out out/
# validate a spec
datadoom validate examples/causal-fraud.datadoom.yaml
# verify a run reproduces bitwise from spec + seed
datadoom verify examples/causal-fraud.datadoom.yaml --seed 42 --against out/
# start from a built-in domain template
datadoom template use fraud-detection --out my.datadoom.yaml
Web UI (Canvas)
The web Canvas — design schemas, wire causal graphs, configure difficulty/failures, generate with a live tracker, preview/compare/export — ships prebuilt inside the package (no Node toolchain needed). There are two ways to run it.
Option A — pip + datadoom serve
pip install "datadoom[server]" # the [server] extra adds FastAPI/uvicorn
datadoom serve # serves the API + Canvas on http://127.0.0.1:8000
Then open http://127.0.0.1:8000 in your browser. datadoom serve is what starts
the UI — installing the package alone does not run a server.
Hitting
The web server needs extra deps … pip install 'datadoom[server]'even after installing it? You almost certainly have an olderdatadoomalready installed, so pip reports "already satisfied" and never pulls the[server]dependencies. Force a clean reinstall:pip install --upgrade --force-reinstall --no-cache-dir "datadoom[server]"
Option B — Docker (UI starts automatically)
The image's entrypoint is datadoom serve, so the Canvas comes up as soon as
the container runs — you do not run any extra command.
Build and run from a clone (works today):
docker build -t datadoom:local .
docker run --rm -p 8000:8000 -v datadoom-data:/data datadoom:local
Or pull the published image (available after a tagged release pushes it to
GHCR — see docs_v2/22 §3):
docker run --rm -p 8000:8000 -v datadoom-data:/data ghcr.io/santhoshreddy352/datadoom:latest
Each
docker runis a single line on purpose — it works in PowerShell, CMD, and bash alike. A\line-continuation is bash-only and breaks in PowerShell.
Open http://localhost:8000. The -v datadoom-data:/data volume persists your
datasets/runs across restarts; the server binds 0.0.0.0:8000 inside the container.
Development
Clone the repo (or fork it first on GitHub and clone your fork if you intend to open a pull request), then set up a project-local virtual environment:
# clone (use your fork's URL if you forked)
git clone https://github.com/SanthoshReddy352/datadoom.git
cd datadoom
# project-local venv (Python 3.11 matches CI's lowest supported version)
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS/Linux
pip install -e ".[dev]" # editable install + dev tools
ruff check src tests # lint
lint-imports # architecture boundaries
mypy # type-check
pytest # test suite
Contributions are welcome — please commit with DCO sign-off (git commit -s) and run
the gates above before opening a PR. See CONTRIBUTING.md.
The reproducibility guarantee (scoped)
Given the same spec and seed, on the pinned path (single-threaded BLAS, pinned
library versions, CPU, same OS/arch), DataDoom produces a bitwise-identical dataset.
Across different OS/architectures we guarantee statistical — not bitwise —
equivalence (FP reductions differ). The cross-OS × cross-Python reproducibility matrix
enforces this in CI. See
docs_v2/13_Testing_and_Reproducibility_Strategy.md.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datadoom-0.1.1.tar.gz.
File metadata
- Download URL: datadoom-0.1.1.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1f82c64f9b7fbd3f02cdbb7e80fa4d4915bdd51a7edf1330c46c679e226e51e
|
|
| MD5 |
490cc97c4d8f6648e916dbd7dc685199
|
|
| BLAKE2b-256 |
1ec0d3d30518ee13ae9f9ba5f0fd151acbdbfc0ae628f3f953f13d4fdfd9e7a6
|
Provenance
The following attestation bundles were made for datadoom-0.1.1.tar.gz:
Publisher:
release.yml on SanthoshReddy352/datadoom
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datadoom-0.1.1.tar.gz -
Subject digest:
e1f82c64f9b7fbd3f02cdbb7e80fa4d4915bdd51a7edf1330c46c679e226e51e - Sigstore transparency entry: 1720360686
- Sigstore integration time:
-
Permalink:
SanthoshReddy352/datadoom@76f7a5507e9046074be248b7da3b9f4f9c985424 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/SanthoshReddy352
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@76f7a5507e9046074be248b7da3b9f4f9c985424 -
Trigger Event:
push
-
Statement type:
File details
Details for the file datadoom-0.1.1-py3-none-any.whl.
File metadata
- Download URL: datadoom-0.1.1-py3-none-any.whl
- Upload date:
- Size: 710.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
929ff25d9768d4baecf4b477d383e1fdd12c59e7f384c96287bb5c81807669b2
|
|
| MD5 |
e19796d91bcdabbbbf2509e7188023d8
|
|
| BLAKE2b-256 |
2530c4792b8a48ead85f17a98db7b01edfaf002b3b4f44dee1802609372c3720
|
Provenance
The following attestation bundles were made for datadoom-0.1.1-py3-none-any.whl:
Publisher:
release.yml on SanthoshReddy352/datadoom
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datadoom-0.1.1-py3-none-any.whl -
Subject digest:
929ff25d9768d4baecf4b477d383e1fdd12c59e7f384c96287bb5c81807669b2 - Sigstore transparency entry: 1720360792
- Sigstore integration time:
-
Permalink:
SanthoshReddy352/datadoom@76f7a5507e9046074be248b7da3b9f4f9c985424 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/SanthoshReddy352
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@76f7a5507e9046074be248b7da3b9f4f9c985424 -
Trigger Event:
push
-
Statement type: