Synthetic data generation for test fixtures and demos — single brain primitive made portable. Built on GIGI's DREAM primitive (https://davisgeometric.com).
Project description
gigi-dream
Synthetic data generation for test fixtures, dev environments, and privacy-aware demos. Statistically faithful records that aren't real records.
from gigi_dream import dream
real_customers = [
{"age": 30, "country": "US", "salary": 75000},
{"age": 45, "country": "CA", "salary": 95000},
{"age": 28, "country": "US", "salary": 68000},
# ... 100 more ...
]
result = dream(real_customers, n_samples=1000, temperature=1.0, seed=42)
print(result.records[0])
# {"age": 32.7, "country": "US", "salary": 73210.3}
$ gigi-dream customers.csv -n 1000 -o test_customers.csv
source: customers.csv
output: test_customers.csv
backend: local
temperature: 1.0
n_samples: 1000
columns: 5
What it's for
Anywhere you need data that looks like your real data but isn't your real data:
- Test fixtures — populate test databases with records that exercise edge cases
- Dev environments — stop hand-rolling fake data; learn it from prod
- Staging — anonymized demos with statistically faithful behavior
- ML augmentation — extra training records sampled from the empirical density
- Privacy-conscious onboarding — let new hires explore data shape without seeing real PII
gigi-dream is intentionally narrow: per-column distribution sampling, nothing else. Other "DREAM" features (multivariate, correlated, anisotropic, fiber-bundle native) live in the GIGI engine — gigi-dream exposes one specific brain primitive as the smallest possible installable tool.
Install
pip install gigi-dream
Optional: install with GIGI backend (requires requests):
pip install "gigi-dream[gigi]"
Optional: install with Parquet support (requires pandas + pyarrow):
pip install "gigi-dream[parquet]"
Quick start
Library
from gigi_dream import dream
# Learn the distribution from real data
real = [
{"age": 30, "country": "US", "salary": 75000},
{"age": 45, "country": "CA", "salary": 95000},
{"age": 28, "country": "US", "salary": 68000},
{"age": 51, "country": "UK", "salary": 110000},
# ... more records ...
]
# Generate 1000 synthetic records at temperature 1.0 (faithful)
result = dream(real, n_samples=1000, temperature=1.0, seed=42)
# Inspect what was learned
for col in result.columns:
if col.kind == "numeric":
print(f" {col.name}: numeric mean={col.mean:.1f} sigma={col.sigma:.1f}")
else:
print(f" {col.name}: categorical {len(col.values)} values")
# Use the synthetic records anywhere you'd use real ones
for r in result.records[:5]:
print(r)
CLI
# Generate 1000 synthetic CSV records
gigi-dream customers.csv -n 1000 -o test_customers.csv
# Higher temperature = wider spread, more novel records
gigi-dream customers.csv -n 1000 -T 3.0 -o exotic_customers.csv
# Output to stdout for piping into other tools
gigi-dream customers.csv -n 100 | head
# Output JSON instead of CSV
gigi-dream customers.csv -n 100 --format json -o synth.json
# Reproducible — same seed gives same output
gigi-dream customers.csv -n 100 --seed 42 -o snapshot.csv
# Just inspect the column distributions, don't sample
gigi-dream customers.csv --inspect
Supported input formats: .csv, .json, .jsonl / .ndjson, .parquet (with [parquet] extra).
Supported output formats: same.
Tuning
| Parameter | Default | Effect |
|---|---|---|
--num / -n |
100 | Number of synthetic records |
--temperature / -T |
1.0 | 1.0 = faithful; > 1.0 = wider; < 1.0 = tighter |
--seed |
none | Reproducibility |
Temperature notes:
T = 1.0— synthetic distribution matches the real one (~variance, ~range)T = 2.0–4.0— DREAM mode; ~1.4–2× wider spread; "novel-but-plausible"T = 0.3–0.7— synthesize tight samples near the mode; useful for "typical case" demosT = 0— every sample equals the per-column mean (degenerate)
How it works (v0)
gigi-dream fits an independent per-column model to your input:
- Numeric columns → diagonal Gaussian with Welford-streamed mean and variance. Sample:
μ + √T × σ × N(0,1). - Categorical / string / boolean columns → empirical frequency distribution. Sample: weighted choice from observed values.
Each column is sampled independently. Correlations between columns are NOT preserved in v0. If your data has strong inter-column structure (e.g., income correlates with age), use GigiBackend instead — GIGI's /brain/dream endpoint uses the engine's full Kähler-aware fit including the L13.3 diagonal-Gaussian variant of the brain primitives.
Two backends
LocalBackend (default) — pure-numpy, no infrastructure required. Use this 99% of the time.
from gigi_dream import LocalBackend, dream
result = dream(real_records, backend=LocalBackend())
GigiBackend — calls a running GIGI instance's /brain/dream endpoint. Higher-fidelity sampling for anisotropic, correlated, or multivariate data. Useful when your data is already in a GIGI bundle.
from gigi_dream import GigiBackend, dream
backend = GigiBackend(
url="http://localhost:3142",
api_key="dev-local",
bundle="customers",
fields=["age", "salary"],
)
result = dream(n_samples=1000, backend=backend)
What gigi-dream isn't
- Not a differential-privacy tool. It provides statistical faithfulness, not formal DP guarantees. If you need ε-differential privacy, use a DP-specific library (e.g.,
diffprivlib,tumult-analytics). - Not a relational data generator. Single tables only; no FK constraints, no schema relationships. (DHOOM supports nested bundles natively, so a future version could.)
- Not a model-based synthesizer. No GANs, no diffusion. The "model" is the per-column Welford fit. That's intentional — small, fast, transparent.
License
MIT. Free for any use, commercial or otherwise. See LICENSE.
Related
- GIGI — the fiber-bundle database engine; gigi-dream's
GigiBackendcalls it. DREAM is one of twelve brain primitives. - EpisodeKit — change-point detection using GIGI's EPISODIC primitive. Sibling project.
- gigi-mind — VS Code extension exposing all twelve brain primitives. Sibling project.
Status
v0.1.0 — stable for the documented surface (CSV/JSON/JSONL + LocalBackend + CLI + GigiBackend skeleton). API may evolve in 0.x; will stabilize at 1.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gigi_dream-0.1.0.tar.gz.
File metadata
- Download URL: gigi_dream-0.1.0.tar.gz
- Upload date:
- Size: 33.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8ac4ea9f10106006810bf2e12a20646ca269b33902ec46385c24b174bfc6d94
|
|
| MD5 |
f2d6adba8bc31b2bf2a2c19de597ad1c
|
|
| BLAKE2b-256 |
0604038cd2fcb569477cfd26b4cbee31bbc44e8407b26d908441a30a333cedae
|
File details
Details for the file gigi_dream-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gigi_dream-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75220fe38ffc6d85b124dd2380ce21173b419a86650279d5da717319c6b70417
|
|
| MD5 |
c058d9618964d770e9fe52235cf986c3
|
|
| BLAKE2b-256 |
a9661384ac57632db3f688eb83b3fceb4b901055c8d8a7d744f9e2c3850f7fc0
|