Skip to main content

Synthetic data generation for test fixtures and demos — single brain primitive made portable. Built on GIGI's DREAM primitive (https://davisgeometric.com).

Project description

gigi-dream

Synthetic data generation for test fixtures, dev environments, and privacy-aware demos. Statistically faithful records that aren't real records.

from gigi_dream import dream

real_customers = [
    {"age": 30, "country": "US", "salary": 75000},
    {"age": 45, "country": "CA", "salary": 95000},
    {"age": 28, "country": "US", "salary": 68000},
    # ... 100 more ...
]

result = dream(real_customers, n_samples=1000, temperature=1.0, seed=42)
print(result.records[0])
# {"age": 32.7, "country": "US", "salary": 73210.3}
$ gigi-dream customers.csv -n 1000 -o test_customers.csv
  source:      customers.csv
  output:      test_customers.csv
  backend:     local
  temperature: 1.0
  n_samples:   1000
  columns:     5

What it's for

Anywhere you need data that looks like your real data but isn't your real data:

  • Test fixtures — populate test databases with records that exercise edge cases
  • Dev environments — stop hand-rolling fake data; learn it from prod
  • Staging — anonymized demos with statistically faithful behavior
  • ML augmentation — extra training records sampled from the empirical density
  • Privacy-conscious onboarding — let new hires explore data shape without seeing real PII

gigi-dream is intentionally narrow: per-column distribution sampling, nothing else. Other "DREAM" features (multivariate, correlated, anisotropic, fiber-bundle native) live in the GIGI engine — gigi-dream exposes one specific brain primitive as the smallest possible installable tool.

Install

pip install gigi-dream

Optional: install with GIGI backend (requires requests):

pip install "gigi-dream[gigi]"

Optional: install with Parquet support (requires pandas + pyarrow):

pip install "gigi-dream[parquet]"

Quick start

Library

from gigi_dream import dream

# Learn the distribution from real data
real = [
    {"age": 30, "country": "US", "salary": 75000},
    {"age": 45, "country": "CA", "salary": 95000},
    {"age": 28, "country": "US", "salary": 68000},
    {"age": 51, "country": "UK", "salary": 110000},
    # ... more records ...
]

# Generate 1000 synthetic records at temperature 1.0 (faithful)
result = dream(real, n_samples=1000, temperature=1.0, seed=42)

# Inspect what was learned
for col in result.columns:
    if col.kind == "numeric":
        print(f"  {col.name}: numeric  mean={col.mean:.1f} sigma={col.sigma:.1f}")
    else:
        print(f"  {col.name}: categorical {len(col.values)} values")

# Use the synthetic records anywhere you'd use real ones
for r in result.records[:5]:
    print(r)

CLI

# Generate 1000 synthetic CSV records
gigi-dream customers.csv -n 1000 -o test_customers.csv

# Higher temperature = wider spread, more novel records
gigi-dream customers.csv -n 1000 -T 3.0 -o exotic_customers.csv

# Output to stdout for piping into other tools
gigi-dream customers.csv -n 100 | head

# Output JSON instead of CSV
gigi-dream customers.csv -n 100 --format json -o synth.json

# Reproducible — same seed gives same output
gigi-dream customers.csv -n 100 --seed 42 -o snapshot.csv

# Just inspect the column distributions, don't sample
gigi-dream customers.csv --inspect

Supported input formats: .csv, .json, .jsonl / .ndjson, .parquet (with [parquet] extra). Supported output formats: same.

Tuning

Parameter Default Effect
--num / -n 100 Number of synthetic records
--temperature / -T 1.0 1.0 = faithful; > 1.0 = wider; < 1.0 = tighter
--seed none Reproducibility

Temperature notes:

  • T = 1.0 — synthetic distribution matches the real one (~variance, ~range)
  • T = 2.0–4.0 — DREAM mode; ~1.4–2× wider spread; "novel-but-plausible"
  • T = 0.3–0.7 — synthesize tight samples near the mode; useful for "typical case" demos
  • T = 0 — every sample equals the per-column mean (degenerate)

How it works (v0)

gigi-dream fits an independent per-column model to your input:

  • Numeric columns → diagonal Gaussian with Welford-streamed mean and variance. Sample: μ + √T × σ × N(0,1).
  • Categorical / string / boolean columns → empirical frequency distribution. Sample: weighted choice from observed values.

Each column is sampled independently. Correlations between columns are NOT preserved in v0. If your data has strong inter-column structure (e.g., income correlates with age), use GigiBackend instead — GIGI's /brain/dream endpoint uses the engine's full Kähler-aware fit including the L13.3 diagonal-Gaussian variant of the brain primitives.

Two backends

LocalBackend (default) — pure-numpy, no infrastructure required. Use this 99% of the time.

from gigi_dream import LocalBackend, dream
result = dream(real_records, backend=LocalBackend())

GigiBackend — calls a running GIGI instance's /brain/dream endpoint. Higher-fidelity sampling for anisotropic, correlated, or multivariate data. Useful when your data is already in a GIGI bundle.

from gigi_dream import GigiBackend, dream

backend = GigiBackend(
    url="http://localhost:3142",
    api_key="dev-local",
    bundle="customers",
    fields=["age", "salary"],
)
result = dream(n_samples=1000, backend=backend)

What gigi-dream isn't

  • Not a differential-privacy tool. It provides statistical faithfulness, not formal DP guarantees. If you need ε-differential privacy, use a DP-specific library (e.g., diffprivlib, tumult-analytics).
  • Not a relational data generator. Single tables only; no FK constraints, no schema relationships. (DHOOM supports nested bundles natively, so a future version could.)
  • Not a model-based synthesizer. No GANs, no diffusion. The "model" is the per-column Welford fit. That's intentional — small, fast, transparent.

License

MIT. Free for any use, commercial or otherwise. See LICENSE.

Related

  • GIGI — the fiber-bundle database engine; gigi-dream's GigiBackend calls it. DREAM is one of twelve brain primitives.
  • EpisodeKit — change-point detection using GIGI's EPISODIC primitive. Sibling project.
  • gigi-mind — VS Code extension exposing all twelve brain primitives. Sibling project.

Status

v0.1.0 — stable for the documented surface (CSV/JSON/JSONL + LocalBackend + CLI + GigiBackend skeleton). API may evolve in 0.x; will stabilize at 1.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gigi_dream-0.1.0.tar.gz (33.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gigi_dream-0.1.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file gigi_dream-0.1.0.tar.gz.

File metadata

  • Download URL: gigi_dream-0.1.0.tar.gz
  • Upload date:
  • Size: 33.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for gigi_dream-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a8ac4ea9f10106006810bf2e12a20646ca269b33902ec46385c24b174bfc6d94
MD5 f2d6adba8bc31b2bf2a2c19de597ad1c
BLAKE2b-256 0604038cd2fcb569477cfd26b4cbee31bbc44e8407b26d908441a30a333cedae

See more details on using hashes here.

File details

Details for the file gigi_dream-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gigi_dream-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for gigi_dream-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 75220fe38ffc6d85b124dd2380ce21173b419a86650279d5da717319c6b70417
MD5 c058d9618964d770e9fe52235cf986c3
BLAKE2b-256 a9661384ac57632db3f688eb83b3fceb4b901055c8d8a7d744f9e2c3850f7fc0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page