Skip to main content

Turn any SQL schema into realistic test data — instantly.

Project description

ddl2data

Turn any SQL schema into realistic test data — instantly.

  • PyPI package: ddl2data
  • CLI command: ddl2data
  • Python module: ddl2data

ddl2data turns SQL DDL, a live relational schema, or a live DynamoDB table schema into fake but structured data that is useful for load tests, end-to-end pipeline checks, local development, and staging environment seeding.

For relational schemas it parses tables and foreign keys, generates rows in parent-before-child order, applies type-aware defaults plus optional distributions, validates the generated data, and writes it out as SQL, JSON, CSV, or Parquet. It can also insert generated rows directly into a relational database through SQLAlchemy. For DynamoDB it can load key and index metadata from a real table and emit typed DynamoDB JSON payloads.


Why use it

Use ddl2data when you need to:

  • stress a pipeline or service with large synthetic datasets
  • generate relational fixtures from existing DDL
  • seed staging or local environments while preserving foreign-key order
  • inspect a live schema and generate data without hand-writing metadata
  • generate DynamoDB-shaped typed JSON from a real table definition
  • export the same dataset shape in SQL, JSON, CSV, Parquet, or DynamoDB JSON
  • produce validation reports before loading data elsewhere

How it works

Schema input
  -> parse DDL, inspect a live relational DB, or inspect a DynamoDB table
  -> build metadata and dependency order
  -> generate rows with Faker and optional distributions
  -> validate FK, null, unique, and optional CHECK constraints
  -> write SQL/JSON/CSV/Parquet/DynamoDB JSON or insert into a DB

Core modules:

  • ddl2data/parser/ddl.py: SQL DDL parsing via sqlglot
  • ddl2data/parser/introspect.py: relational schema introspection via SQLAlchemy
  • ddl2data/parser/dynamodb.py: DynamoDB schema loading via boto3
  • ddl2data/generator/base.py: main row generation engine
  • ddl2data/generator/dist.py: distribution parsing and sampling
  • ddl2data/writer/: SQL, JSON, CSV, Parquet, and DynamoDB JSON writers
  • ddl2data/validation.py and ddl2data/report.py: validation and reporting

Core capabilities

  • Input sources:
    • SQL DDL via --ddl
    • live relational schema introspection via --schema-from-db --db-url ...
    • live DynamoDB table schema via --schema-from-dynamodb --dynamodb-table ...
  • Relationship-aware generation:
    • foreign-key dependency graph
    • parent tables generated before child tables
    • FK-only tables can use the Polars generation path
  • Output formats:
    • PostgreSQL INSERT
    • MySQL INSERT
    • SQLite INSERT
    • BigQuery INSERT and INSERT ALL
    • JSON
    • CSV
    • Parquet
    • DynamoDB typed JSON
  • Validation and reporting:
    • FK integrity checks
    • non-null checks
    • unique collision checks
    • optional CHECK validation with --strict-checks
    • JSON report output via --report-path
  • Generation controls:
    • per-table row overrides with --table-rows
    • reproducible runs with --seed
    • config-file driven runs with JSON, TOML, or YAML
    • per-column and per-table distribution overrides with --dist
    • optional python or polars engine selection

Install

Recommended for CLI use:

pipx install ddl2data

Or install with pip:

pip install ddl2data

Optional Polars support:

pip install "ddl2data[polars]"

DynamoDB schema loading requires boto3:

pip install boto3

From source:

git clone https://github.com/hyangminj/ddl2data.git
cd ddl2data
python3 -m venv .venv
. .venv/bin/activate
.venv/bin/python -m pip install -e .

Contributor setup with test and Polars extras:

.venv/bin/python -m pip install -e ".[test,polars]"

Sanity check:

ddl2data --help

GitHub Releases also include wheel (.whl) and source (.tar.gz) artifacts.


Quick start

Minimal schema:

CREATE TABLE users (
  id INT PRIMARY KEY,
  email VARCHAR(100) UNIQUE NOT NULL,
  age INT,
  tier VARCHAR(10)
);

CREATE TABLE orders (
  id INT PRIMARY KEY,
  user_id INT NOT NULL,
  amount NUMERIC,
  FOREIGN KEY (user_id) REFERENCES users(id)
);

Generate JSON:

ddl2data --ddl schema.sql --rows 100 --out json --output-path data.json

Generate PostgreSQL inserts:

ddl2data --ddl schema.sql --rows 100 --out postgres --output-path seed.sql

Generate CSV files, one per table:

ddl2data --ddl schema.sql --rows 100 --out csv --output-path ./csv_out

With this schema, users is generated first and orders.user_id is filled from generated users.id values.


Common workflows

Generate SQL for different targets

ddl2data --ddl schema.sql --rows 100 --out postgres
ddl2data --ddl schema.sql --rows 100 --out mysql
ddl2data --ddl schema.sql --rows 100 --out sqlite
ddl2data --ddl schema.sql --rows 100 --out bigquery

Read schema directly from a relational database

ddl2data \
  --schema-from-db \
  --db-url postgresql+psycopg://user:pass@localhost:5432/mydb \
  --rows 100 \
  --out postgres

Limit introspection to selected tables:

ddl2data \
  --schema-from-db \
  --db-url postgresql+psycopg://user:pass@localhost:5432/mydb \
  --tables users,orders,events \
  --rows 100 \
  --out json

Insert generated rows directly into a relational database

ddl2data \
  --ddl schema.sql \
  --rows 1000 \
  --insert \
  --db-url postgresql+psycopg://user:pass@localhost:5432/mydb

Generate DynamoDB typed JSON from a live table schema

ddl2data \
  --schema-from-dynamodb \
  --dynamodb-table users \
  --dynamodb-region us-east-1 \
  --dynamodb-extra-attr email:string \
  --dynamodb-extra-attr score:int \
  --rows 100 \
  --out dynamodb-json \
  --output-path users.jsonl

Notes:

  • key attributes and GSI or LSI key attributes are inferred from the live table
  • use --dynamodb-extra-attr name:type to add non-key fields to generated output
  • supported extra attribute aliases include string, uuid, date, datetime, numeric, float, int, and boolean

Control row counts per table

ddl2data \
  --ddl schema.sql \
  --rows 100 \
  --table-rows users=20,orders=500 \
  --table-rows events=2000 \
  --out json
  • --rows stays the global default
  • --table-rows table=count overrides individual tables
  • config files can use a table_rows map

Generate a validation report

ddl2data \
  --ddl schema.sql \
  --rows 500 \
  --strict-checks \
  --report-path report.json \
  --out json \
  --output-path data.json

The report includes counts and sample issues for:

  • FK violations
  • non-null violations
  • unique collisions
  • supported CHECK violations when --strict-checks is enabled

Use Parquet or the Polars engine

Parquet output:

ddl2data \
  --ddl schema.sql \
  --rows 100 \
  --out parquet \
  --parquet-compression zstd \
  --output-path ./parquet_out

Polars generation and write path:

ddl2data \
  --ddl schema.sql \
  --rows 100000 \
  --out csv \
  --engine polars \
  --output-path ./csv_out

Engine behavior:

  • tables with CHECK constraints fall back to the Python row-wise generator
  • tables with unique or primary-key constraints fall back to the Python row-wise generator
  • FK-only tables can remain on the Polars path
  • --out parquet requires the optional polars dependency regardless of --engine

BigQuery-specific output

ddl2data --ddl schema.sql --rows 100 --out bigquery --output-path out.sql
ddl2data --ddl schema.sql --rows 100 --out bigquery --bq-insert-all --output-path out_insert_all.sql

BigQuery output supports:

  • dataset-qualified table names like dataset.table
  • INSERT ALL rendering via --bq-insert-all
  • typed DATE and TIMESTAMP literals

Config file example

ddl2data --config ddl2data.toml

Example ddl2data.toml:

ddl = "schema.sql"
rows = 500
out = "postgres"
seed = 42
strict_checks = true
table_rows = { users = 100, orders = 500, events = 2000 }
dist = [
  "users.age:normal,mean=33,std=7",
  "orders.amount:pareto,alpha=1.7,xm=1",
]

CLI arguments override config file values.


Distribution overrides

Syntax:

--dist <column_or_table.column>:<kind>,k=v,k=v

Supported distribution kinds:

  • normal
  • poisson
  • weighted
  • exponential
  • pareto
  • zipf
  • peak

Examples:

# global column rule
--dist age:normal,mean=35,std=7

# table-qualified rule (higher priority than global)
--dist users.age:normal,mean=30,std=6

# poisson counts
--dist daily_orders:poisson,lambda=3

# weighted categorical values
--dist tier:weighted,A=60%,B=30%,C=10%

# heavy-tail behavior
--dist amount:pareto,alpha=1.5,xm=1
--dist category_rank:zipf,skew=1.8,n=200

# peak-hour timestamps
--dist created_at:peak,hours=9-11,18-20

Priority when both exist:

  1. table.column
  2. column

CHECK-aware generation

ddl2data uses practical heuristics for common CHECK constraints during generation and can optionally validate them again with --strict-checks.

Supported forms include:

  • numeric comparison: age >= 18, qty > 0, score <= 100, x != 0
  • ranges: price BETWEEN 10 AND 20
  • enum lists: status IN ('A', 'B', 'C')
  • regex-like checks: code ~ '^[A-Z]{2}$', REGEXP_LIKE(code, '^[0-9]+$')
  • nested compound forms built from supported expressions with AND and OR

This is intentionally best-effort, not a full SQL expression engine.


Reproducible runs

Use --seed to stabilize Python random and Faker output:

ddl2data --ddl schema.sql --rows 100 --seed 42 --out json

Development and testing

Local setup

python3 -m venv .venv
. .venv/bin/activate
.venv/bin/python -m pip install -e ".[test,polars]"

If you only need the core package:

.venv/bin/python -m pip install -e .

If you need DynamoDB schema loading outside the test extra:

.venv/bin/python -m pip install boto3

Useful test commands

Run the full suite:

.venv/bin/python -m pytest

Run only non-integration tests:

.venv/bin/python -m pytest -m "not integration"

Run one file:

.venv/bin/python -m pytest tests/test_parser_graph.py

Run one integration target:

.venv/bin/python -m pytest tests/test_integration_postgres.py
.venv/bin/python -m pytest tests/test_integration_dynamodb.py
.venv/bin/python -m pytest tests/test_integration_bigquery.py

Available markers:

  • integration
  • postgres
  • dynamodb
  • bigquery

Local integration services

The repo includes docker-compose.yml for PostgreSQL and LocalStack DynamoDB:

docker compose up -d postgres localstack
docker compose ps

Service summary:

  • PostgreSQL: localhost:5432
  • LocalStack DynamoDB: http://localhost:4566

Suggested .env.test:

TEST_POSTGRES_URL=postgresql+psycopg2://testuser:testpass@localhost:5432/testdb
AWS_ACCESS_KEY_ID=test
AWS_SECRET_ACCESS_KEY=test
AWS_DEFAULT_REGION=us-east-1
DYNAMODB_ENDPOINT_URL=http://localhost:4566
TEST_BQ_PROJECT=your-gcp-project-id
TEST_BQ_DATASET=ddl2data_integration_test
# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

.env.test is already ignored and is the right place for local test-only credentials.

BigQuery authentication

Two common options work for the integration tests:

Application Default Credentials:

gcloud auth application-default login

Service-account key file:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

If TEST_BQ_PROJECT is missing or credentials are unavailable, the BigQuery integration tests skip automatically.


CLI reference

ddl2data [--ddl schema.sql
        | --schema-from-db --db-url URL [--tables t1,t2]
        | --schema-from-dynamodb --dynamodb-table NAME]
        [--dynamodb-region REGION]
        [--dynamodb-extra-attr name:type]
        [--config config.toml]
        [--rows 100]
        [--table-rows users=20,orders=500] [--table-rows events=2000]
        [--out postgres|mysql|sqlite|bigquery|json|csv|parquet|dynamodb-json]
        [--engine python|polars]
        [--bq-insert-all]
        [--output-path PATH]
        [--insert --db-url URL]
        [--dist ...] [--dist ...]
        [--seed INT]
        [--report-path report.json]
        [--strict-checks]
        [--parquet-compression snappy|zstd|lz4|gzip|none]

Flag summary:

  • --config: load defaults from JSON, TOML, YAML, or YML
  • --ddl: input DDL file
  • --schema-from-db: inspect relational table metadata from a live database
  • --tables: optional comma-separated table filter for relational introspection mode
  • --schema-from-dynamodb: inspect a live DynamoDB table definition
  • --dynamodb-table: DynamoDB table name for --schema-from-dynamodb
  • --dynamodb-region: AWS region for DynamoDB schema loading
  • --dynamodb-extra-attr: add synthetic non-key DynamoDB attributes, repeatable
  • --rows: global default rows per table
  • --table-rows: per-table row overrides
  • --out: output format, default postgres
  • --engine: python or polars
  • --bq-insert-all: BigQuery INSERT ALL ... SELECT 1; mode
  • --output-path: output file path or output directory depending on format
  • --db-url: SQLAlchemy database URL for introspection or direct insert
  • --insert: insert generated rows directly into the target relational database
  • --dist: distribution overrides
  • --seed: deterministic generation seed
  • --report-path: write a JSON report with validation summary
  • --strict-checks: validate supported CHECK constraints after generation
  • --parquet-compression: Parquet compression codec

Limitations

  • DDL parsing and relational DB introspection cover common schemas well, but advanced dialect-specific features are still partial.
  • DynamoDB schema loading models key and index attributes plus explicitly declared extra attributes; it does not infer full document structure from item samples.
  • CHECK-aware generation and --strict-checks cover a practical subset, not arbitrary SQL expressions.
  • Function-heavy or cross-column CHECK expressions are best-effort rather than fully modeled.
  • --engine polars still falls back to Python for tables with CHECK, unique, or primary-key constraints.

Future work

  • More distribution types for realistic synthetic workloads
  • Additional pipeline edge-case generation such as skew, timestamp boundaries, and duplicate-heavy batches
  • Broader dialect coverage and richer schema inference for advanced database features

License

Apache License 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ddl2data-0.3.0.tar.gz (48.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ddl2data-0.3.0-py3-none-any.whl (37.3 kB view details)

Uploaded Python 3

File details

Details for the file ddl2data-0.3.0.tar.gz.

File metadata

  • Download URL: ddl2data-0.3.0.tar.gz
  • Upload date:
  • Size: 48.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ddl2data-0.3.0.tar.gz
Algorithm Hash digest
SHA256 9b3cf8ac36334c904a9586fdc1a43a179e19ab52cc2d9522e4375f400aea6fa0
MD5 99d14b7abd857f3e3ab0c9bb09e89692
BLAKE2b-256 a2abaec72275dfe4ac518276e3743b9cc0ed0a98b86a4478565fc9da7f7d83c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for ddl2data-0.3.0.tar.gz:

Publisher: release.yml on hyangminj/ddl2data

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ddl2data-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: ddl2data-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 37.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ddl2data-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f4fc1b32c618b0ba4a6301e43d64f3e1b873eb3048a34a24c5787e02b6f2b570
MD5 fb1b364983761709e2322f83f693bfca
BLAKE2b-256 c313d69d0981e187325aff1054afbc28a54a7849a37cf69c13b51a2234da18ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for ddl2data-0.3.0-py3-none-any.whl:

Publisher: release.yml on hyangminj/ddl2data

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page