Skip to main content

Synthetic dataset generation library

Project description

Blueprint

Blueprint is a pure-Python library for generating realistic synthetic datasets. Define features, population segments, and causal relationships between columns — then emit a reproducible pandas.DataFrame in one call.

from blueprint import Blueprint, Feature, Class, Influence

df = (
    Blueprint(n=1000, seed=42)
    .add_feature(
        Feature("sqft",  dtype=int,   base=1800, std=400,  clip=(500, 5000)),
        Feature("price", dtype=float, base=0,    std=0,    derived=True),
        Feature("tax",   dtype=float, base=0,    std=0,    derived=True),
    )
    .add_class(Class("luxury", when=("sqft", ">=", 2500)))
    .add_influence(
        Influence("sqft").on("price", effect="+155 per unit"),
        Influence("price").on("tax",  effect="+0.012 per unit"),
    )
    .emit()
)

Why Blueprint?

Most synthetic data tools generate columns independently. Blueprint lets you specify why one column affects another — and preserves those relationships in the output:

  • Features — numeric, boolean, categorical, datetime, text/template, computed, and derived columns
  • Classes — named population segments that override feature parameters for a subset of rows
  • Influences — causal edges (source → target) with rich effect types, optional row-level noise, class-conditional behavior, and gating conditions
  • DAG — dependencies are topologically sorted so multi-hop chains always evaluate in the right order
  • Reproducibility — every run with the same seed produces identical data; influence noise has its own deterministic sub-seed per edge

Installation

pip install blueprint-synth

Requires Python 3.10+ and only depends on numpy and pandas.

import blueprint

Feature overview

Features

Feature("age",       dtype=int,        base=35,   std=10,  clip=(18, 80))
Feature("active",    dtype=bool,        p=0.7)
Feature("tier",      dtype="category",  values=["bronze", "silver", "gold"], weights=[5, 3, 1])
Feature("joined",    dtype="datetime",  start="2020-01-01", end="2024-12-31")
Feature("user_id",   dtype="id",        style="uuid")
Feature("score",     dtype="computed",  formula=lambda df: df["a"] * 2 + df["b"])
Feature("revenue",   dtype=float,       base=0, std=0, derived=True)  # accumulates from influences only

Modifiers chain onto any numeric feature: .trend(), .seasonality(), .noise(), .clip(), .round().

Classes (population segments)

Class("high_value", when=("income", ">=", 100000))
Class("churned",    when=("days_inactive", ">", 90))
Class("sampled",    when=("__random__", 0.2))          # random 20% of rows
Class("custom",     when=lambda df: df["x"] > df["y"])

Override any feature parameter for rows that match a class:

Class("vip", when=("tier", "==", "gold")).override("spend", base=5000, std=800)

Influences (causal relationships)

Influence("sqft").on("price", effect="+155 per unit")   # per-unit additive
Influence("has_pool").on("price", effect="+8%")          # percentage
Influence("is_member").on("fee", effect="-20")           # flat additive (boolean source)
Influence("region").on("price", by_class={              # class-conditional
    "urban": "+15%", "suburban": "+5%"
}, effect="+0%")
Influence("distance").on("price",                        # custom function
    fn=lambda src, tgt, df: tgt - src * 250)

Add row-level noise to any numeric effect for more realistic variation:

Influence("sqft").on("price", effect="+155 per unit", noise_std=0.1)
# effective rate ~ N(155, 15.5) per row, fully reproducible

Output formats

df = bp.emit()                         # pandas DataFrame
df = bp.emit(describe=True)            # prints blueprint summary first
df = bp.emit(manifest="meta.json")     # writes a JSON config sidecar
bp.to_csv("data.csv")
bp.to_json("data.json", manifest="meta.json")

Notebook guide

The docs/notebooks/ directory contains a step-by-step notebook series covering every aspect of the library:

Notebook Topic
01 — Getting Started Installation, minimal example, reproducibility
02 — Features Deep Dive All dtype options, modifiers, computed & derived columns
03 — Classes Population segments, condition types, presets
04 — Influences Effect strings, by_class, when=, fn=, noise_std, presets
05 — The Dependency DAG Topological sort, cycle detection, visualization
06 — Assembly & Emission Blueprint construction, validate, describe, emit, output formats

Preset library

from blueprint.presets.classes import RandomClass, HighValueClass, LowValueClass, OutlierClass
from blueprint.presets.influences import ScalesWith, CorrelatedWith, Caps
bp.add_class(HighValueClass("rich", feature="income", top_pct=0.2))
bp.add_influence(CorrelatedWith("income", "spend", correlation=0.75))
bp.add_influence(Caps("experience", "salary", threshold=10, decay=0.05))

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blueprint_synth-0.1.0.tar.gz (26.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blueprint_synth-0.1.0-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file blueprint_synth-0.1.0.tar.gz.

File metadata

  • Download URL: blueprint_synth-0.1.0.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for blueprint_synth-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0f7d859889312fdd1b2602a10b6a74f70d2147dc2d41f12e693dc9c3a93d39be
MD5 60f0313d8748dc64deb54f9584741a4f
BLAKE2b-256 9cc0d8af6d8310b2ee661d2d083b34bf57686cd5b3e4a8205ba7923409ae748a

See more details on using hashes here.

Provenance

The following attestation bundles were made for blueprint_synth-0.1.0.tar.gz:

Publisher: pypi-publish.yml on dpforesi/blueprint-synth

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file blueprint_synth-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for blueprint_synth-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b31d438db34086fdfe0ea222175c6d3270907d78f711436e152515b7ccd2a4c5
MD5 1d1495025bc93859c5ccf3b066929742
BLAKE2b-256 a581b0ab4c8fd13695520171a1b00b8d3d7513e40279b228b9386a22568ac6f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for blueprint_synth-0.1.0-py3-none-any.whl:

Publisher: pypi-publish.yml on dpforesi/blueprint-synth

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page