Synthetic dataset generation library

Project description

Blueprint

Blueprint is a pure-Python library for generating realistic synthetic datasets. Define features, population segments, and causal relationships between columns — then emit a reproducible pandas.DataFrame in one call.

from blueprint import Blueprint, Feature, Class, Influence

df = (
    Blueprint(n=1000, seed=42)
    .add_feature(
        Feature("sqft",  dtype=int,   base=1800, std=400,  clip=(500, 5000)),
        Feature("price", dtype=float, base=0,    std=0,    derived=True),
        Feature("tax",   dtype=float, base=0,    std=0,    derived=True),
    )
    .add_class(Class("luxury", when=("sqft", ">=", 2500)))
    .add_influence(
        Influence("sqft").on("price", effect="+155 per unit"),
        Influence("price").on("tax",  effect="+0.012 per unit"),
    )
    .emit()
)

Why Blueprint?

Most synthetic data tools generate columns independently. Blueprint lets you specify why one column affects another — and preserves those relationships in the output:

Features — numeric, boolean, categorical, datetime, text/template, computed, and derived columns
Classes — named population segments that override feature parameters for a subset of rows
Influences — causal edges (source → target) with rich effect types, optional row-level noise, class-conditional behavior, and gating conditions
DAG — dependencies are topologically sorted so multi-hop chains always evaluate in the right order
Reproducibility — every run with the same seed produces identical data; influence noise has its own deterministic sub-seed per edge

Installation

pip install blueprint-synth

Requires Python 3.10+ and only depends on numpy and pandas.

import blueprint

Feature overview

Features

Feature("age",       dtype=int,        base=35,   std=10,  clip=(18, 80))
Feature("active",    dtype=bool,        p=0.7)
Feature("tier",      dtype="category",  values=["bronze", "silver", "gold"], weights=[5, 3, 1])
Feature("joined",    dtype="datetime",  start="2020-01-01", end="2024-12-31")
Feature("user_id",   dtype="id",        style="uuid")
Feature("score",     dtype="computed",  formula=lambda df: df["a"] * 2 + df["b"])
Feature("revenue",   dtype=float,       base=0, std=0, derived=True)  # accumulates from influences only

Modifiers chain onto any numeric feature: .trend(), .seasonality(), .noise(), .clip(), .round().

Classes (population segments)

Class("high_value", when=("income", ">=", 100000))
Class("churned",    when=("days_inactive", ">", 90))
Class("sampled",    when=("__random__", 0.2))          # random 20% of rows
Class("custom",     when=lambda df: df["x"] > df["y"])

Override any feature parameter for rows that match a class:

Class("vip", when=("tier", "==", "gold")).override("spend", base=5000, std=800)

Influences (causal relationships)

Influence("sqft").on("price", effect="+155 per unit")   # per-unit additive
Influence("has_pool").on("price", effect="+8%")          # percentage
Influence("is_member").on("fee", effect="-20")           # flat additive (boolean source)
Influence("region").on("price", by_class={              # class-conditional
    "urban": "+15%", "suburban": "+5%"
}, effect="+0%")
Influence("distance").on("price",                        # custom function
    fn=lambda src, tgt, df: tgt - src * 250)

Add row-level noise to any numeric effect for more realistic variation:

Influence("sqft").on("price", effect="+155 per unit", noise_std=0.1)
# effective rate ~ N(155, 15.5) per row, fully reproducible

Output formats

df = bp.emit()                         # pandas DataFrame
df = bp.emit(describe=True)            # prints blueprint summary first
df = bp.emit(manifest="meta.json")     # writes a JSON config sidecar
bp.to_csv("data.csv")
bp.to_json("data.json", manifest="meta.json")

Notebook guide

The docs/notebooks/ directory contains a step-by-step notebook series covering every aspect of the library:

Notebook	Topic
01 — Getting Started	Installation, minimal example, reproducibility
02 — Features Deep Dive	All dtype options, modifiers, computed & derived columns
03 — Classes	Population segments, condition types, presets
04 — Influences	Effect strings, by_class, when=, fn=, noise_std, presets
05 — The Dependency DAG	Topological sort, cycle detection, visualization
06 — Assembly & Emission	Blueprint construction, validate, describe, emit, output formats

Preset library

from blueprint.presets.classes import RandomClass, HighValueClass, LowValueClass, OutlierClass
from blueprint.presets.influences import ScalesWith, CorrelatedWith, Caps

bp.add_class(HighValueClass("rich", feature="income", top_pct=0.2))
bp.add_influence(CorrelatedWith("income", "spend", correlation=0.75))
bp.add_influence(Caps("experience", "salary", threshold=10, decay=0.05))

License

MIT — see LICENSE.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blueprint_synth-0.1.0.tar.gz (26.9 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

blueprint_synth-0.1.0-py3-none-any.whl (21.2 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file blueprint_synth-0.1.0.tar.gz.

File metadata

Download URL: blueprint_synth-0.1.0.tar.gz
Upload date: May 7, 2026
Size: 26.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for blueprint_synth-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0f7d859889312fdd1b2602a10b6a74f70d2147dc2d41f12e693dc9c3a93d39be`
MD5	`60f0313d8748dc64deb54f9584741a4f`
BLAKE2b-256	`9cc0d8af6d8310b2ee661d2d083b34bf57686cd5b3e4a8205ba7923409ae748a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for blueprint_synth-0.1.0.tar.gz:

Publisher: pypi-publish.yml on dpforesi/blueprint-synth

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: blueprint_synth-0.1.0.tar.gz
- Subject digest: 0f7d859889312fdd1b2602a10b6a74f70d2147dc2d41f12e693dc9c3a93d39be
- Sigstore transparency entry: 1462842140
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: dpforesi/blueprint-synth@21fc442f70ba35d42c63af4cf242fc16c62ab111
- Branch / Tag: refs/heads/main
- Owner: https://github.com/dpforesi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@21fc442f70ba35d42c63af4cf242fc16c62ab111
- Trigger Event: workflow_dispatch

File details

Details for the file blueprint_synth-0.1.0-py3-none-any.whl.

File metadata

Download URL: blueprint_synth-0.1.0-py3-none-any.whl
Upload date: May 7, 2026
Size: 21.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for blueprint_synth-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b31d438db34086fdfe0ea222175c6d3270907d78f711436e152515b7ccd2a4c5`
MD5	`1d1495025bc93859c5ccf3b066929742`
BLAKE2b-256	`a581b0ab4c8fd13695520171a1b00b8d3d7513e40279b228b9386a22568ac6f6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for blueprint_synth-0.1.0-py3-none-any.whl:

Publisher: pypi-publish.yml on dpforesi/blueprint-synth

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: blueprint_synth-0.1.0-py3-none-any.whl
- Subject digest: b31d438db34086fdfe0ea222175c6d3270907d78f711436e152515b7ccd2a4c5
- Sigstore transparency entry: 1462842264
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: dpforesi/blueprint-synth@21fc442f70ba35d42c63af4cf242fc16c62ab111
- Branch / Tag: refs/heads/main
- Owner: https://github.com/dpforesi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@21fc442f70ba35d42c63af4cf242fc16c62ab111
- Trigger Event: workflow_dispatch

blueprint-synth 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Blueprint

Why Blueprint?

Installation

Feature overview

Features

Classes (population segments)

Influences (causal relationships)

Output formats

Notebook guide

Preset library

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance