Skip to main content

Synthetic relational dataset generator with schema-driven chaos controls.

Project description

chaoslake

Generate synthetic relational datasets from a YAML schema in seconds.

chaoslake preserves foreign-key integrity across tables, supports statistical distributions, and can deliberately inject controlled data-quality issues (nulls, duplicates, drift, mixed date formats) — making it ideal for testing data pipelines, ML models, and analytics dashboards.

PyPI Python License: MIT


Install

pip install chaoslake

For the optional DuckDB backend and database introspection:

pip install "chaoslake[dev]"

Quick Start

1. Try it instantly — no schema needed

chaoslake quick

Generates a demo_output/ folder with a users table and a FK-linked transactions table straight away.

2. Create a schema interactively

chaoslake init

Asks whether you want to answer a short questionnaire (table name, row count, columns, chaos settings) or drop a ready-made template. Either way it writes chaoslake.yaml in the current directory.

3. Validate your schema

chaoslake check chaoslake.yaml

4. Generate the dataset

chaoslake generate --schema chaoslake.yaml --output ./output --format csv

Options:

Flag Default Description
--schema chaoslake.yaml Path to YAML schema
--output ./output Output directory
--format csv csv or parquet
--seed (none) Integer seed for reproducible output
--engine auto polars, duckdb, or auto
--verbose false Show per-column progress logs

Schema Reference

tables:
  - name: customers          # lowercase identifier
    rows: 1000
    columns:
      - name: id
        type: integer
        primary_key: true    # required — exactly one per table

      - name: full_name
        type: name           # pre-baked numpy name pool (~6M rows/sec)

      - name: email
        type: email
        unique: true

      - name: signup_date
        type: datetime
        range: [2020-01-01, 2024-12-31]

  - name: orders
    rows: 5000
    columns:
      - name: id
        type: integer
        primary_key: true

      - name: customer_id
        type: foreign_key
        references: customers.id   # always table.column

      - name: amount
        type: float
        range: [5, 5000]
        distribution: lognormal    # see distributions below

      - name: order_date
        type: datetime
        range: [2021-01-01, 2024-12-31]

    chaos:
      null_probability: 0.03         # 3 % of non-PK/FK values become null
      duplicate_probability: 0.01    # 1 % extra duplicate rows appended
      drift:                         # concept drift on a numeric column
        field: amount
        distribution: norm
        start_after_row: 4000
      format_inconsistency:          # randomly mix date formats
        fields: [order_date]
        style: mixed_us_dates

Column types

Type Description
integer Random integers. Add primary_key: true for sequential IDs.
float Uniform floats. Combine with distribution for realistic shapes.
string Random 12-char alphanumeric strings.
name Full person names sampled from a pre-baked pool.
email firstname.lastname@domain.tld — add unique: true for deduplication.
datetime Timestamps within an optional range.
foreign_key Samples from a parent table's primary key. Requires references.

Distributions

Use distribution on float columns (or integer with drift):

Value Shape
lognormal / lognorm Right-skewed — good for prices, revenue
normal / norm Bell curve
uniform Flat
expon Exponential decay
poisson Integer-valued counts
Any scipy.stats name Fallback to scipy, e.g. beta, gamma

All Commands

chaoslake --help
Command What it does
generate Generate tables from a YAML schema
init Create chaoslake.yaml (interactive or static template)
check Validate a schema file without generating data
quick One-command demo — no schema file needed
introspect Reflect a live database and emit a Chaoslake YAML schema

Introspect a database

chaoslake introspect --db sqlite:///mydb.db
chaoslake introspect --db postgresql://user:pass@localhost/mydb --tables customers,orders

Requires pip install "chaoslake[dev]" (SQLAlchemy).


Reproducible Output

Pass --seed to get bit-identical datasets every time:

chaoslake generate --schema chaoslake.yaml --output ./output --seed 42

Performance

Benchmarked on Apple M-series (1.1M rows, 3 columns):

Engine Rows / sec
Polars (default) ~6 000 000
DuckDB auto-selected above 500k total rows

Run the benchmark yourself:

chaoslake bench

License

MIT © Gourav Shokeen

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chaoslake-0.1.1.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chaoslake-0.1.1-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file chaoslake-0.1.1.tar.gz.

File metadata

  • Download URL: chaoslake-0.1.1.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for chaoslake-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b0de1ec42a37385d6d72062bc816544aacbf16d12ec861c8da8d4cf09ac21d5c
MD5 6dfcbf4ef4b0654e3c5f52db3cb92fe3
BLAKE2b-256 f74ac54a31adc8e5ca5e60cbace870bbc5e6fee6395ac96410466dc4ff7ba261

See more details on using hashes here.

File details

Details for the file chaoslake-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: chaoslake-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for chaoslake-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cf646ce78ee4acf0ffee492d7fcb8a7c575912c71ce47aae2c85cf84c067fec4
MD5 9a3e84f713227447f48d02bf18c5db4a
BLAKE2b-256 7afb69640a6452e0b289efed3fc651395ccb63def9a815ef95abc8b8475db095

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page