Synthetic relational dataset generator with schema-driven chaos controls.
Project description
chaoslake
Generate synthetic relational datasets from a YAML schema in seconds.
chaoslake preserves foreign-key integrity across tables, supports statistical distributions, and can deliberately inject controlled data-quality issues (nulls, duplicates, drift, mixed date formats) — making it ideal for testing data pipelines, ML models, and analytics dashboards.
Install
pip install chaoslake
For the optional DuckDB backend and database introspection:
pip install "chaoslake[dev]"
Quick Start
1. Try it instantly — no schema needed
chaoslake quick
Generates a demo_output/ folder with a users table and a FK-linked transactions table straight away.
2. Create a schema interactively
chaoslake init
Asks whether you want to answer a short questionnaire (table name, row count, columns, chaos settings) or drop a ready-made template. Either way it writes chaoslake.yaml in the current directory.
3. Validate your schema
chaoslake check chaoslake.yaml
4. Generate the dataset
chaoslake generate --schema chaoslake.yaml --output ./output --format csv
Options:
| Flag | Default | Description |
|---|---|---|
--schema |
chaoslake.yaml |
Path to YAML schema |
--output |
./output |
Output directory |
--format |
csv |
csv or parquet |
--seed |
(none) | Integer seed for reproducible output |
--engine |
auto |
polars, duckdb, or auto |
--verbose |
false |
Show per-column progress logs |
Schema Reference
tables:
- name: customers # lowercase identifier
rows: 1000
columns:
- name: id
type: integer
primary_key: true # required — exactly one per table
- name: full_name
type: name # pre-baked numpy name pool (~6M rows/sec)
- name: email
type: email
unique: true
- name: signup_date
type: datetime
range: [2020-01-01, 2024-12-31]
- name: orders
rows: 5000
columns:
- name: id
type: integer
primary_key: true
- name: customer_id
type: foreign_key
references: customers.id # always table.column
- name: amount
type: float
range: [5, 5000]
distribution: lognormal # see distributions below
- name: order_date
type: datetime
range: [2021-01-01, 2024-12-31]
chaos:
null_probability: 0.03 # 3 % of non-PK/FK values become null
duplicate_probability: 0.01 # 1 % extra duplicate rows appended
drift: # concept drift on a numeric column
field: amount
distribution: norm
start_after_row: 4000
format_inconsistency: # randomly mix date formats
fields: [order_date]
style: mixed_us_dates
Column types
| Type | Description |
|---|---|
integer |
Random integers. Add primary_key: true for sequential IDs. |
float |
Uniform floats. Combine with distribution for realistic shapes. |
string |
Random 12-char alphanumeric strings. |
name |
Full person names sampled from a pre-baked pool. |
email |
firstname.lastname@domain.tld — add unique: true for deduplication. |
datetime |
Timestamps within an optional range. |
foreign_key |
Samples from a parent table's primary key. Requires references. |
Distributions
Use distribution on float columns (or integer with drift):
| Value | Shape |
|---|---|
lognormal / lognorm |
Right-skewed — good for prices, revenue |
normal / norm |
Bell curve |
uniform |
Flat |
expon |
Exponential decay |
poisson |
Integer-valued counts |
Any scipy.stats name |
Fallback to scipy, e.g. beta, gamma |
All Commands
chaoslake --help
| Command | What it does |
|---|---|
generate |
Generate tables from a YAML schema |
init |
Create chaoslake.yaml (interactive or static template) |
check |
Validate a schema file without generating data |
quick |
One-command demo — no schema file needed |
introspect |
Reflect a live database and emit a Chaoslake YAML schema |
Introspect a database
chaoslake introspect --db sqlite:///mydb.db
chaoslake introspect --db postgresql://user:pass@localhost/mydb --tables customers,orders
Requires pip install "chaoslake[dev]" (SQLAlchemy).
Reproducible Output
Pass --seed to get bit-identical datasets every time:
chaoslake generate --schema chaoslake.yaml --output ./output --seed 42
Performance
Benchmarked on Apple M-series (1.1M rows, 3 columns):
| Engine | Rows / sec |
|---|---|
| Polars (default) | ~6 000 000 |
| DuckDB | auto-selected above 500k total rows |
Run the benchmark yourself:
chaoslake bench
Development Install
git clone https://github.com/gouravshokeen/chaoslake
cd chaoslake
python3 -m venv .venv
source .venv/bin/activate
pip install ".[dev]"
pytest tests/ -v
License
MIT © Gourav Shokeen
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chaoslake-0.1.0-py3-none-any.whl.
File metadata
- Download URL: chaoslake-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dfab30db2bd900b5ecea4b34b23cdf8b5e9eca921e853f2fb8000f3adb0f7e2
|
|
| MD5 |
64f1471fee2ea805d5e2903f3fcf072d
|
|
| BLAKE2b-256 |
cd279cc892e7ba0c6f55767e3dc1bcb72d617c6d64366ba8679a6c62ad05bf92
|