Local-first, open-source engine for controllable, reproducible synthetic data.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

santhosh_19x

These details have not been verified by PyPI

Project description

DataDoom

Local-first, open-source engine for controllable, reproducible synthetic data.

Design the dataset the way you reason about it — distributions, causal relationships, difficulty, and failure modes — and regenerate it identically, forever, from a single spec file.

North star: a synthetic dataset should be as version-controllable, shareable, and reproducible as source code.

📖 Docs: https://santhoshreddy352.github.io/datadoom/ · authoritative design in docs_v2/ (start at docs_v2/00_README_Index.md).

Why DataDoom

Synthetic data usually forces a trade-off: it's either realistic but a black box (you can't say what relationships or flaws it contains) or controllable but throwaway (you can't regenerate the exact same dataset tomorrow). That makes it hard to teach with, benchmark against, file a bug against, or share.

The goal: make a dataset something you design and version-control like source code. You declare its structure — distributions, causal relationships, difficulty, and data-quality failures — in one spec file, and DataDoom regenerates it byte-for-byte identically from (spec_hash, seed), while honestly reporting how well the realized data matches what you asked for. No network, no telemetry, no account: everything runs locally.

Good for: ML teaching & reproducible benchmarks · testing data pipelines on known edge cases · sharing a dataset's recipe instead of PII · hackathon / challenge datasets with a known ground truth.

What it does

Deterministic by construction — one seeded RNG underpins everything; the same spec + seed yields a bitwise-identical dataset on the pinned path.
Honest statistics — distributions are sampled correctly and their fit is reported (KS / chi-square goodness-of-fit, compliance score); parameters are never refit to flatter the sample.
Causal structure — a DAG of structural equations (linear/logistic/polynomial/…) with per-node noise and do() interventions, plus a true-graph + mutual-information report.
Failure injection — eight mechanisms (MCAR/MAR/MNAR, label & feature noise, drift, covariate shift, leakage) corrupt a copy while the clean baseline is kept, with realized-effect diffs.
Difficulty targeting — calibrate a binary label to a chosen baseline-model AUROC band, reported with the achieved metric, knobs, and bisection trace.
Rich feature types — numeric/categorical/boolean/datetime, realistic seeded text (names, emails, addresses), additive time-series, and latent (hidden) features.
Extensible — distributions, structural functions, failure modes, exporters, and probes all ship as plugins against the engine ABCs, with zero core changes.
Built to consume — export CSV / JSON / Parquet, load a run straight into pandas / PyTorch / TensorFlow / HuggingFace, and start from built-in domain templates (including ready-made hackathon challenges).
Two surfaces, one engine — a CLI for automation and a web Canvas for design both call the exact same pipeline, so results never diverge.

Status

Phases 0–5 complete; 1.0 hardening underway. Everything in What it does above ships today. Remaining for 1.0 is hardening (docs site, release automation, the repro matrix); see status.md. Optional team mode is a deferred future addon.

Install

pip install datadoom              # engine + CLI
pip install "datadoom[server]"    # + web Canvas (datadoom serve)
pip install "datadoom[parquet]"   # + Parquet export

Quickstart

# generate a dataset from a spec
datadoom run examples/causal-fraud.datadoom.yaml --seed 42 --out out/

# validate a spec
datadoom validate examples/causal-fraud.datadoom.yaml

# verify a run reproduces bitwise from spec + seed
datadoom verify examples/causal-fraud.datadoom.yaml --seed 42 --against out/

# start from a built-in domain template
datadoom template use fraud-detection --out my.datadoom.yaml

Web UI (Canvas)

The web Canvas — design schemas, wire causal graphs, configure difficulty/failures, generate with a live tracker, preview/compare/export — ships prebuilt inside the package (no Node toolchain needed). There are two ways to run it.

Option A — pip + `datadoom serve`

pip install "datadoom[server]"   # the [server] extra adds FastAPI/uvicorn
datadoom serve                   # serves the API + Canvas on http://127.0.0.1:8000

Then open http://127.0.0.1:8000 in your browser. datadoom serve is what starts the UI — installing the package alone does not run a server.

Hitting The web server needs extra deps … pip install 'datadoom[server]' even after installing it? You almost certainly have an older datadoom already installed, so pip reports "already satisfied" and never pulls the [server] dependencies. Force a clean reinstall:
pip install --upgrade --force-reinstall --no-cache-dir "datadoom[server]"

Option B — Docker (UI starts automatically)

The image's entrypoint is datadoom serve, so the Canvas comes up as soon as the container runs — you do not run any extra command.

Build and run from a clone (works today):

docker build -t datadoom:local .
docker run --rm -p 8000:8000 -v datadoom-data:/data datadoom:local

Or pull the published image (available after a tagged release pushes it to GHCR — see docs_v2/22 §3):

docker run --rm -p 8000:8000 -v datadoom-data:/data ghcr.io/santhoshreddy352/datadoom:latest

Each docker run is a single line on purpose — it works in PowerShell, CMD, and bash alike. A \ line-continuation is bash-only and breaks in PowerShell.

Open http://localhost:8000. The -v datadoom-data:/data volume persists your datasets/runs across restarts; the server binds 0.0.0.0:8000 inside the container.

Development

Clone the repo (or fork it first on GitHub and clone your fork if you intend to open a pull request), then set up a project-local virtual environment:

# clone (use your fork's URL if you forked)
git clone https://github.com/SanthoshReddy352/datadoom.git
cd datadoom

# project-local venv (Python 3.11 matches CI's lowest supported version)
python -m venv .venv
.venv\Scripts\activate          # Windows
# source .venv/bin/activate     # macOS/Linux

pip install -e ".[dev]"         # editable install + dev tools

ruff check src tests            # lint
lint-imports                    # architecture boundaries
mypy                            # type-check
pytest                          # test suite

Contributions are welcome — please commit with DCO sign-off (git commit -s) and run the gates above before opening a PR. See CONTRIBUTING.md.

The reproducibility guarantee (scoped)

Given the same spec and seed, on the pinned path (single-threaded BLAS, pinned library versions, CPU, same OS/arch), DataDoom produces a bitwise-identical dataset. Across different OS/architectures we guarantee statistical — not bitwise — equivalence (FP reductions differ). The cross-OS × cross-Python reproducibility matrix enforces this in CI. See docs_v2/13_Testing_and_Reproducibility_Strategy.md.

License

Apache-2.0.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

santhosh_19x

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadoom-0.1.1.tar.gz (1.1 MB view details)

Uploaded Jun 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datadoom-0.1.1-py3-none-any.whl (710.7 kB view details)

Uploaded Jun 4, 2026 Python 3

File details

Details for the file datadoom-0.1.1.tar.gz.

File metadata

Download URL: datadoom-0.1.1.tar.gz
Upload date: Jun 4, 2026
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datadoom-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`e1f82c64f9b7fbd3f02cdbb7e80fa4d4915bdd51a7edf1330c46c679e226e51e`
MD5	`490cc97c4d8f6648e916dbd7dc685199`
BLAKE2b-256	`1ec0d3d30518ee13ae9f9ba5f0fd151acbdbfc0ae628f3f953f13d4fdfd9e7a6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datadoom-0.1.1.tar.gz:

Publisher: release.yml on SanthoshReddy352/datadoom

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datadoom-0.1.1.tar.gz
- Subject digest: e1f82c64f9b7fbd3f02cdbb7e80fa4d4915bdd51a7edf1330c46c679e226e51e
- Sigstore transparency entry: 1720360686
- Sigstore integration time: Jun 4, 2026
Source repository:
- Permalink: SanthoshReddy352/datadoom@76f7a5507e9046074be248b7da3b9f4f9c985424
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/SanthoshReddy352
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@76f7a5507e9046074be248b7da3b9f4f9c985424
- Trigger Event: push

File details

Details for the file datadoom-0.1.1-py3-none-any.whl.

File metadata

Download URL: datadoom-0.1.1-py3-none-any.whl
Upload date: Jun 4, 2026
Size: 710.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datadoom-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`929ff25d9768d4baecf4b477d383e1fdd12c59e7f384c96287bb5c81807669b2`
MD5	`e19796d91bcdabbbbf2509e7188023d8`
BLAKE2b-256	`2530c4792b8a48ead85f17a98db7b01edfaf002b3b4f44dee1802609372c3720`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datadoom-0.1.1-py3-none-any.whl:

Publisher: release.yml on SanthoshReddy352/datadoom

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datadoom-0.1.1-py3-none-any.whl
- Subject digest: 929ff25d9768d4baecf4b477d383e1fdd12c59e7f384c96287bb5c81807669b2
- Sigstore transparency entry: 1720360792
- Sigstore integration time: Jun 4, 2026
Source repository:
- Permalink: SanthoshReddy352/datadoom@76f7a5507e9046074be248b7da3b9f4f9c985424
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/SanthoshReddy352
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@76f7a5507e9046074be248b7da3b9f4f9c985424
- Trigger Event: push

datadoom 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

DataDoom

Why DataDoom

What it does

Status

Install

Quickstart

Web UI (Canvas)

Option A — pip + datadoom serve

Option B — Docker (UI starts automatically)

Development

The reproducibility guarantee (scoped)

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Option A — pip + `datadoom serve`