Turn any SQL schema into realistic test data — instantly.
Project description
ddl2data
Turn any SQL schema into realistic test data — instantly.
- PyPI package:
ddl2data - CLI command:
ddl2data - Python module:
ddl2data
ddl2data turns SQL DDL, a live relational schema, or a live DynamoDB table schema into fake but structured data that is useful for load tests, end-to-end pipeline checks, local development, and staging environment seeding.
For relational schemas it parses tables and foreign keys, generates rows in parent-before-child order, applies type-aware defaults plus optional distributions, validates the generated data, and writes it out as SQL, JSON, CSV, or Parquet. It can also insert generated rows directly into a relational database through SQLAlchemy. For DynamoDB it can load key and index metadata from a real table and emit typed DynamoDB JSON payloads.
Why use it
Use ddl2data when you need to:
- stress a pipeline or service with large synthetic datasets
- generate relational fixtures from existing DDL
- seed staging or local environments while preserving foreign-key order
- inspect a live schema and generate data without hand-writing metadata
- generate DynamoDB-shaped typed JSON from a real table definition
- export the same dataset shape in SQL, JSON, CSV, Parquet, or DynamoDB JSON
- produce validation reports before loading data elsewhere
How it works
Schema input
-> parse DDL, inspect a live relational DB, or inspect a DynamoDB table
-> build metadata and dependency order
-> generate rows with Faker and optional distributions
-> validate FK, null, unique, and optional CHECK constraints
-> write SQL/JSON/CSV/Parquet/DynamoDB JSON or insert into a DB
Core modules:
ddl2data/parser/ddl.py: SQL DDL parsing viasqlglotddl2data/parser/introspect.py: relational schema introspection via SQLAlchemyddl2data/parser/dynamodb.py: DynamoDB schema loading viaboto3ddl2data/generator/base.py: main row generation engineddl2data/generator/dist.py: distribution parsing and samplingddl2data/writer/: SQL, JSON, CSV, Parquet, and DynamoDB JSON writersddl2data/validation.pyandddl2data/report.py: validation and reporting
Core capabilities
- Input sources:
- SQL DDL via
--ddl - live relational schema introspection via
--schema-from-db --db-url ... - live DynamoDB table schema via
--schema-from-dynamodb --dynamodb-table ...
- SQL DDL via
- Relationship-aware generation:
- foreign-key dependency graph
- parent tables generated before child tables
- FK-only tables can use the Polars generation path
- Output formats:
- PostgreSQL
INSERT - MySQL
INSERT - SQLite
INSERT - BigQuery
INSERTandINSERT ALL - JSON
- CSV
- Parquet
- DynamoDB typed JSON
- PostgreSQL
- Validation and reporting:
- FK integrity checks
- non-null checks
- unique collision checks
- optional CHECK validation with
--strict-checks - JSON report output via
--report-path
- Generation controls:
- per-table row overrides with
--table-rows - reproducible runs with
--seed - config-file driven runs with JSON, TOML, or YAML
- per-column and per-table distribution overrides with
--dist - optional
pythonorpolarsengine selection
- per-table row overrides with
Install
Recommended for CLI use:
pipx install ddl2data
Or install with pip:
pip install ddl2data
Optional Polars support:
pip install "ddl2data[polars]"
DynamoDB schema loading requires boto3:
pip install boto3
From source:
git clone https://github.com/hyangminj/ddl2data.git
cd ddl2data
python3 -m venv .venv
. .venv/bin/activate
.venv/bin/python -m pip install -e .
Contributor setup with test and Polars extras:
.venv/bin/python -m pip install -e ".[test,polars]"
Sanity check:
ddl2data --help
GitHub Releases also include wheel (.whl) and source (.tar.gz) artifacts.
Quick start
Minimal schema:
CREATE TABLE users (
id INT PRIMARY KEY,
email VARCHAR(100) UNIQUE NOT NULL,
age INT,
tier VARCHAR(10)
);
CREATE TABLE orders (
id INT PRIMARY KEY,
user_id INT NOT NULL,
amount NUMERIC,
FOREIGN KEY (user_id) REFERENCES users(id)
);
Generate JSON:
ddl2data --ddl schema.sql --rows 100 --out json --output-path data.json
Generate PostgreSQL inserts:
ddl2data --ddl schema.sql --rows 100 --out postgres --output-path seed.sql
Generate CSV files, one per table:
ddl2data --ddl schema.sql --rows 100 --out csv --output-path ./csv_out
With this schema, users is generated first and orders.user_id is filled from generated users.id values.
Common workflows
Generate SQL for different targets
ddl2data --ddl schema.sql --rows 100 --out postgres
ddl2data --ddl schema.sql --rows 100 --out mysql
ddl2data --ddl schema.sql --rows 100 --out sqlite
ddl2data --ddl schema.sql --rows 100 --out bigquery
Read schema directly from a relational database
ddl2data \
--schema-from-db \
--db-url postgresql+psycopg://user:pass@localhost:5432/mydb \
--rows 100 \
--out postgres
Limit introspection to selected tables:
ddl2data \
--schema-from-db \
--db-url postgresql+psycopg://user:pass@localhost:5432/mydb \
--tables users,orders,events \
--rows 100 \
--out json
Insert generated rows directly into a relational database
ddl2data \
--ddl schema.sql \
--rows 1000 \
--insert \
--db-url postgresql+psycopg://user:pass@localhost:5432/mydb
Generate DynamoDB typed JSON from a live table schema
ddl2data \
--schema-from-dynamodb \
--dynamodb-table users \
--dynamodb-region us-east-1 \
--dynamodb-extra-attr email:string \
--dynamodb-extra-attr score:int \
--rows 100 \
--out dynamodb-json \
--output-path users.jsonl
Notes:
- key attributes and GSI or LSI key attributes are inferred from the live table
- use
--dynamodb-extra-attr name:typeto add non-key fields to generated output - supported extra attribute aliases include
string,uuid,date,datetime,numeric,float,int, andboolean
Control row counts per table
ddl2data \
--ddl schema.sql \
--rows 100 \
--table-rows users=20,orders=500 \
--table-rows events=2000 \
--out json
--rowsstays the global default--table-rows table=countoverrides individual tables- config files can use a
table_rowsmap
Generate a validation report
ddl2data \
--ddl schema.sql \
--rows 500 \
--strict-checks \
--report-path report.json \
--out json \
--output-path data.json
The report includes counts and sample issues for:
- FK violations
- non-null violations
- unique collisions
- supported CHECK violations when
--strict-checksis enabled
Use Parquet or the Polars engine
Parquet output:
ddl2data \
--ddl schema.sql \
--rows 100 \
--out parquet \
--parquet-compression zstd \
--output-path ./parquet_out
Polars generation and write path:
ddl2data \
--ddl schema.sql \
--rows 100000 \
--out csv \
--engine polars \
--output-path ./csv_out
Engine behavior:
- tables with CHECK constraints fall back to the Python row-wise generator
- tables with unique or primary-key constraints fall back to the Python row-wise generator
- FK-only tables can remain on the Polars path
--out parquetrequires the optionalpolarsdependency regardless of--engine
BigQuery-specific output
ddl2data --ddl schema.sql --rows 100 --out bigquery --output-path out.sql
ddl2data --ddl schema.sql --rows 100 --out bigquery --bq-insert-all --output-path out_insert_all.sql
BigQuery output supports:
- dataset-qualified table names like
dataset.table INSERT ALLrendering via--bq-insert-all- typed
DATEandTIMESTAMPliterals
Config file example
ddl2data --config ddl2data.toml
Example ddl2data.toml:
ddl = "schema.sql"
rows = 500
out = "postgres"
seed = 42
strict_checks = true
table_rows = { users = 100, orders = 500, events = 2000 }
dist = [
"users.age:normal,mean=33,std=7",
"orders.amount:pareto,alpha=1.7,xm=1",
]
CLI arguments override config file values.
Distribution overrides
Syntax:
--dist <column_or_table.column>:<kind>,k=v,k=v
Supported distribution kinds:
normalpoissonweightedexponentialparetozipfpeak
Examples:
# global column rule
--dist age:normal,mean=35,std=7
# table-qualified rule (higher priority than global)
--dist users.age:normal,mean=30,std=6
# poisson counts
--dist daily_orders:poisson,lambda=3
# weighted categorical values
--dist tier:weighted,A=60%,B=30%,C=10%
# heavy-tail behavior
--dist amount:pareto,alpha=1.5,xm=1
--dist category_rank:zipf,skew=1.8,n=200
# peak-hour timestamps
--dist created_at:peak,hours=9-11,18-20
Priority when both exist:
table.columncolumn
CHECK-aware generation
ddl2data uses practical heuristics for common CHECK constraints during generation and can optionally validate them again with --strict-checks.
Supported forms include:
- numeric comparison:
age >= 18,qty > 0,score <= 100,x != 0 - ranges:
price BETWEEN 10 AND 20 - enum lists:
status IN ('A', 'B', 'C') - regex-like checks:
code ~ '^[A-Z]{2}$',REGEXP_LIKE(code, '^[0-9]+$') - nested compound forms built from supported expressions with
ANDandOR
This is intentionally best-effort, not a full SQL expression engine.
Reproducible runs
Use --seed to stabilize Python random and Faker output:
ddl2data --ddl schema.sql --rows 100 --seed 42 --out json
Development and testing
Local setup
python3 -m venv .venv
. .venv/bin/activate
.venv/bin/python -m pip install -e ".[test,polars]"
If you only need the core package:
.venv/bin/python -m pip install -e .
If you need DynamoDB schema loading outside the test extra:
.venv/bin/python -m pip install boto3
Useful test commands
Run the full suite:
.venv/bin/python -m pytest
Run only non-integration tests:
.venv/bin/python -m pytest -m "not integration"
Run one file:
.venv/bin/python -m pytest tests/test_parser_graph.py
Run one integration target:
.venv/bin/python -m pytest tests/test_integration_postgres.py
.venv/bin/python -m pytest tests/test_integration_dynamodb.py
.venv/bin/python -m pytest tests/test_integration_bigquery.py
Available markers:
integrationpostgresdynamodbbigquery
Local integration services
The repo includes docker-compose.yml for PostgreSQL and LocalStack DynamoDB:
docker compose up -d postgres localstack
docker compose ps
Service summary:
- PostgreSQL:
localhost:5432 - LocalStack DynamoDB:
http://localhost:4566
Suggested .env.test:
TEST_POSTGRES_URL=postgresql+psycopg2://testuser:testpass@localhost:5432/testdb
AWS_ACCESS_KEY_ID=test
AWS_SECRET_ACCESS_KEY=test
AWS_DEFAULT_REGION=us-east-1
DYNAMODB_ENDPOINT_URL=http://localhost:4566
TEST_BQ_PROJECT=your-gcp-project-id
TEST_BQ_DATASET=ddl2data_integration_test
# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
.env.test is already ignored and is the right place for local test-only credentials.
BigQuery authentication
Two common options work for the integration tests:
Application Default Credentials:
gcloud auth application-default login
Service-account key file:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
If TEST_BQ_PROJECT is missing or credentials are unavailable, the BigQuery integration tests skip automatically.
CLI reference
ddl2data [--ddl schema.sql
| --schema-from-db --db-url URL [--tables t1,t2]
| --schema-from-dynamodb --dynamodb-table NAME]
[--dynamodb-region REGION]
[--dynamodb-extra-attr name:type]
[--config config.toml]
[--rows 100]
[--table-rows users=20,orders=500] [--table-rows events=2000]
[--out postgres|mysql|sqlite|bigquery|json|csv|parquet|dynamodb-json]
[--engine python|polars]
[--bq-insert-all]
[--output-path PATH]
[--insert --db-url URL]
[--dist ...] [--dist ...]
[--seed INT]
[--report-path report.json]
[--strict-checks]
[--parquet-compression snappy|zstd|lz4|gzip|none]
Flag summary:
--config: load defaults from JSON, TOML, YAML, or YML--ddl: input DDL file--schema-from-db: inspect relational table metadata from a live database--tables: optional comma-separated table filter for relational introspection mode--schema-from-dynamodb: inspect a live DynamoDB table definition--dynamodb-table: DynamoDB table name for--schema-from-dynamodb--dynamodb-region: AWS region for DynamoDB schema loading--dynamodb-extra-attr: add synthetic non-key DynamoDB attributes, repeatable--rows: global default rows per table--table-rows: per-table row overrides--out: output format, defaultpostgres--engine:pythonorpolars--bq-insert-all: BigQueryINSERT ALL ... SELECT 1;mode--output-path: output file path or output directory depending on format--db-url: SQLAlchemy database URL for introspection or direct insert--insert: insert generated rows directly into the target relational database--dist: distribution overrides--seed: deterministic generation seed--report-path: write a JSON report with validation summary--strict-checks: validate supported CHECK constraints after generation--parquet-compression: Parquet compression codec
Limitations
- DDL parsing and relational DB introspection cover common schemas well, but advanced dialect-specific features are still partial.
- DynamoDB schema loading models key and index attributes plus explicitly declared extra attributes; it does not infer full document structure from item samples.
- CHECK-aware generation and
--strict-checkscover a practical subset, not arbitrary SQL expressions. - Function-heavy or cross-column CHECK expressions are best-effort rather than fully modeled.
--engine polarsstill falls back to Python for tables with CHECK, unique, or primary-key constraints.
Future work
- More distribution types for realistic synthetic workloads
- Additional pipeline edge-case generation such as skew, timestamp boundaries, and duplicate-heavy batches
- Broader dialect coverage and richer schema inference for advanced database features
License
Apache License 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ddl2data-0.3.0.tar.gz.
File metadata
- Download URL: ddl2data-0.3.0.tar.gz
- Upload date:
- Size: 48.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b3cf8ac36334c904a9586fdc1a43a179e19ab52cc2d9522e4375f400aea6fa0
|
|
| MD5 |
99d14b7abd857f3e3ab0c9bb09e89692
|
|
| BLAKE2b-256 |
a2abaec72275dfe4ac518276e3743b9cc0ed0a98b86a4478565fc9da7f7d83c6
|
Provenance
The following attestation bundles were made for ddl2data-0.3.0.tar.gz:
Publisher:
release.yml on hyangminj/ddl2data
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ddl2data-0.3.0.tar.gz -
Subject digest:
9b3cf8ac36334c904a9586fdc1a43a179e19ab52cc2d9522e4375f400aea6fa0 - Sigstore transparency entry: 1317254941
- Sigstore integration time:
-
Permalink:
hyangminj/ddl2data@d1091f951dbd315efbf6adeb8d81c7d00e5116da -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/hyangminj
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d1091f951dbd315efbf6adeb8d81c7d00e5116da -
Trigger Event:
push
-
Statement type:
File details
Details for the file ddl2data-0.3.0-py3-none-any.whl.
File metadata
- Download URL: ddl2data-0.3.0-py3-none-any.whl
- Upload date:
- Size: 37.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4fc1b32c618b0ba4a6301e43d64f3e1b873eb3048a34a24c5787e02b6f2b570
|
|
| MD5 |
fb1b364983761709e2322f83f693bfca
|
|
| BLAKE2b-256 |
c313d69d0981e187325aff1054afbc28a54a7849a37cf69c13b51a2234da18ce
|
Provenance
The following attestation bundles were made for ddl2data-0.3.0-py3-none-any.whl:
Publisher:
release.yml on hyangminj/ddl2data
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ddl2data-0.3.0-py3-none-any.whl -
Subject digest:
f4fc1b32c618b0ba4a6301e43d64f3e1b873eb3048a34a24c5787e02b6f2b570 - Sigstore transparency entry: 1317255001
- Sigstore integration time:
-
Permalink:
hyangminj/ddl2data@d1091f951dbd315efbf6adeb8d81c7d00e5116da -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/hyangminj
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d1091f951dbd315efbf6adeb8d81c7d00e5116da -
Trigger Event:
push
-
Statement type: