Synthetic dataset generation library
Project description
Blueprint
Blueprint is a pure-Python library for generating realistic synthetic datasets. Define features, population segments, and causal relationships between columns — then emit a reproducible pandas.DataFrame in one call.
from blueprint import Blueprint, Feature, Class, Influence
df = (
Blueprint(n=1000, seed=42)
.add_feature(
Feature("sqft", dtype=int, base=1800, std=400, clip=(500, 5000)),
Feature("price", dtype=float, base=0, std=0, derived=True),
Feature("tax", dtype=float, base=0, std=0, derived=True),
)
.add_class(Class("luxury", when=("sqft", ">=", 2500)))
.add_influence(
Influence("sqft").on("price", effect="+155 per unit"),
Influence("price").on("tax", effect="+0.012 per unit"),
)
.emit()
)
Why Blueprint?
Most synthetic data tools generate columns independently. Blueprint lets you specify why one column affects another — and preserves those relationships in the output:
- Features — numeric, boolean, categorical, datetime, text/template, computed, and derived columns
- Classes — named population segments that override feature parameters for a subset of rows
- Influences — causal edges (
source → target) with rich effect types, optional row-level noise, class-conditional behavior, and gating conditions - DAG — dependencies are topologically sorted so multi-hop chains always evaluate in the right order
- Reproducibility — every run with the same
seedproduces identical data; influence noise has its own deterministic sub-seed per edge
Installation
pip install blueprint-synth
Requires Python 3.10+ and only depends on numpy and pandas.
import blueprint
Feature overview
Features
Feature("age", dtype=int, base=35, std=10, clip=(18, 80))
Feature("active", dtype=bool, p=0.7)
Feature("tier", dtype="category", values=["bronze", "silver", "gold"], weights=[5, 3, 1])
Feature("joined", dtype="datetime", start="2020-01-01", end="2024-12-31")
Feature("user_id", dtype="id", style="uuid")
Feature("score", dtype="computed", formula=lambda df: df["a"] * 2 + df["b"])
Feature("revenue", dtype=float, base=0, std=0, derived=True) # accumulates from influences only
Modifiers chain onto any numeric feature: .trend(), .seasonality(), .noise(), .clip(), .round().
Classes (population segments)
Class("high_value", when=("income", ">=", 100000))
Class("churned", when=("days_inactive", ">", 90))
Class("sampled", when=("__random__", 0.2)) # random 20% of rows
Class("custom", when=lambda df: df["x"] > df["y"])
Override any feature parameter for rows that match a class:
Class("vip", when=("tier", "==", "gold")).override("spend", base=5000, std=800)
Influences (causal relationships)
Influence("sqft").on("price", effect="+155 per unit") # per-unit additive
Influence("has_pool").on("price", effect="+8%") # percentage
Influence("is_member").on("fee", effect="-20") # flat additive (boolean source)
Influence("region").on("price", by_class={ # class-conditional
"urban": "+15%", "suburban": "+5%"
}, effect="+0%")
Influence("distance").on("price", # custom function
fn=lambda src, tgt, df: tgt - src * 250)
Add row-level noise to any numeric effect for more realistic variation:
Influence("sqft").on("price", effect="+155 per unit", noise_std=0.1)
# effective rate ~ N(155, 15.5) per row, fully reproducible
Output formats
df = bp.emit() # pandas DataFrame
df = bp.emit(describe=True) # prints blueprint summary first
df = bp.emit(manifest="meta.json") # writes a JSON config sidecar
bp.to_csv("data.csv")
bp.to_json("data.json", manifest="meta.json")
Notebook guide
The docs/notebooks/ directory contains a step-by-step notebook series covering every aspect of the library:
| Notebook | Topic |
|---|---|
| 01 — Getting Started | Installation, minimal example, reproducibility |
| 02 — Features Deep Dive | All dtype options, modifiers, computed & derived columns |
| 03 — Classes | Population segments, condition types, presets |
| 04 — Influences | Effect strings, by_class, when=, fn=, noise_std, presets |
| 05 — The Dependency DAG | Topological sort, cycle detection, visualization |
| 06 — Assembly & Emission | Blueprint construction, validate, describe, emit, output formats |
Preset library
from blueprint.presets.classes import RandomClass, HighValueClass, LowValueClass, OutlierClass
from blueprint.presets.influences import ScalesWith, CorrelatedWith, Caps
bp.add_class(HighValueClass("rich", feature="income", top_pct=0.2))
bp.add_influence(CorrelatedWith("income", "spend", correlation=0.75))
bp.add_influence(Caps("experience", "salary", threshold=10, decay=0.05))
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file blueprint_synth-0.1.0.tar.gz.
File metadata
- Download URL: blueprint_synth-0.1.0.tar.gz
- Upload date:
- Size: 26.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f7d859889312fdd1b2602a10b6a74f70d2147dc2d41f12e693dc9c3a93d39be
|
|
| MD5 |
60f0313d8748dc64deb54f9584741a4f
|
|
| BLAKE2b-256 |
9cc0d8af6d8310b2ee661d2d083b34bf57686cd5b3e4a8205ba7923409ae748a
|
Provenance
The following attestation bundles were made for blueprint_synth-0.1.0.tar.gz:
Publisher:
pypi-publish.yml on dpforesi/blueprint-synth
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
blueprint_synth-0.1.0.tar.gz -
Subject digest:
0f7d859889312fdd1b2602a10b6a74f70d2147dc2d41f12e693dc9c3a93d39be - Sigstore transparency entry: 1462842140
- Sigstore integration time:
-
Permalink:
dpforesi/blueprint-synth@21fc442f70ba35d42c63af4cf242fc16c62ab111 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/dpforesi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@21fc442f70ba35d42c63af4cf242fc16c62ab111 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file blueprint_synth-0.1.0-py3-none-any.whl.
File metadata
- Download URL: blueprint_synth-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b31d438db34086fdfe0ea222175c6d3270907d78f711436e152515b7ccd2a4c5
|
|
| MD5 |
1d1495025bc93859c5ccf3b066929742
|
|
| BLAKE2b-256 |
a581b0ab4c8fd13695520171a1b00b8d3d7513e40279b228b9386a22568ac6f6
|
Provenance
The following attestation bundles were made for blueprint_synth-0.1.0-py3-none-any.whl:
Publisher:
pypi-publish.yml on dpforesi/blueprint-synth
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
blueprint_synth-0.1.0-py3-none-any.whl -
Subject digest:
b31d438db34086fdfe0ea222175c6d3270907d78f711436e152515b7ccd2a4c5 - Sigstore transparency entry: 1462842264
- Sigstore integration time:
-
Permalink:
dpforesi/blueprint-synth@21fc442f70ba35d42c63af4cf242fc16c62ab111 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/dpforesi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@21fc442f70ba35d42c63af4cf242fc16c62ab111 -
Trigger Event:
workflow_dispatch
-
Statement type: