Generate high-quality synthetic data from ODCS data contracts
Project description
datacontract-faker
Synthetic data that actually matches your contract — straight from
schema.yaml.
datacontract-faker reads an ODCS v3.1.0 data contract and produces realistic, contract-compliant synthetic data — as a CLI tool or a Python library.
datacontract-faker generate orders.yaml -r 50000 -o data/orders.parquet -f parquet
Parsing contract: orders.yaml
✓ Contract valid — 2 model(s): orders, customers
Generating 50,000 row(s) …
✓ customers → data/orders_customers.parquet
✓ orders → data/orders_orders.parquet
Highlights
- Strict contract compliance —
minimum,maximum,exclusiveMinimum,exclusiveMaximum,multipleOf,minLength,maxLength,pattern,examples,format, and length-constrained types are all enforced on every row. - Foreign-key integrity — declares from
relationships:are honored across models. Referenced tables are generated first; child rows sample real parent keys. - Nested + array types — recursive
objectfields andarrayitems (including arrays of objects) generate properly typed values. - Reproducible — pin
--seedfor byte-identical output across runs. - Four output formats — JSON, JSONL, CSV, Parquet (via
polars+pyarrow). - Locale-aware — any Faker locale (
en_US,de_DE,ja_JP, …). - Library-first — every component (
ContractParser,SyntheticGenerator,ProviderMapper,Exporter) is independently importable.
Install
pip install datacontract-faker # standard
pipx install datacontract-faker # isolated CLI
poetry add datacontract-faker # poetry projects
Requires Python ≥ 3.11.
CLI
generate
Generate synthetic data from a contract.
datacontract-faker generate CONTRACT [OPTIONS]
| Option | Short | Default | Description |
|---|---|---|---|
CONTRACT |
— | required | Path to the ODCS contract (YAML), as a positional argument |
--rows INT |
-r |
100 |
Rows to generate per model |
--output PATH |
-o |
— | Output file. Omit to preview in the terminal |
--format |
-f |
json |
csv | json | jsonl | parquet |
--model TEXT |
-m |
all | Generate only this model |
--locale TEXT |
— | en_US |
Faker locale |
--seed INT |
— | — | Seed for reproducible output |
--nullable-ratio FLOAT |
— | 0.1 |
Probability of null for optional fields (0.0–1.0) |
--validate-only |
— | false |
Only validate the contract; skip generation |
--verbose |
-v |
false |
Debug logging to stderr |
When a contract has multiple models and --output is set, each model is written to {stem}_{model}{suffix} — e.g. data.parquet becomes data_orders.parquet, data_customers.parquet.
inspect
Render the parsed schema as a table — type, format, flags (PK, req, uniq, →fk_target), examples, and constraints.
datacontract-faker inspect CONTRACT [--nullable-ratio FLOAT]
Recipes
# Preview 10 rows in the terminal
datacontract-faker generate contract.yaml -r 10
# Validate without generating
datacontract-faker generate contract.yaml --validate-only
# 1M rows to Parquet, deterministic
datacontract-faker generate contract.yaml -r 1000000 -o out.parquet -f parquet --seed 42
# Single model, German locale, NDJSON
datacontract-faker generate contract.yaml -m customers --locale de_DE -o cust.jsonl -f jsonl
# Stress-test null handling
datacontract-faker generate contract.yaml --nullable-ratio 0.5
Python API
Every CLI capability is available as a library. Public exports from the top-level package:
from datacontract_faker import (
ContractParser, ContractValidationError,
SyntheticGenerator,
Exporter, OutputFormat,
FieldSpec, GenerationSchema, QualityRule,
)
End-to-end
from pathlib import Path
from datacontract_faker import ContractParser, SyntheticGenerator, Exporter, OutputFormat
schema = ContractParser(nullable_ratio=0.05).load_and_validate(Path("contract.yaml"))
gen = SyntheticGenerator(schema, rows=10_000, seed=42, locale="en_US")
dataframes = gen.generate_all() # dict[str, polars.DataFrame], FK-aware order
exporter = Exporter()
for name, df in dataframes.items():
exporter.export(df, Path(f"output/{name}.parquet"), OutputFormat.PARQUET)
generate_all() honors foreign-key relationships: referenced models are generated first and child models sample real parent keys.
Generate a single model
df = gen.generate_model("orders") # polars.DataFrame
df = gen.generate_model("orders", fk_pools={"customer_id": ["c1", "c2", "c3"]})
Parse without file I/O
schema = ContractParser().parse_string(contract_yaml_str)
Custom Faker providers
Override or extend the type-to-provider resolution:
from datacontract_faker.mapper import ProviderMapper
from datacontract_faker import SyntheticGenerator
mapper = ProviderMapper(
logical_format_overrides={
"string:product_sku": lambda f: f.bothify("???-####").upper(),
},
logical_overrides={
"integer": lambda f: f.random_int(min=1000, max=9999),
},
physical_overrides={
"decimal": lambda f: round(f.pyfloat(min_value=1, max_value=500), 2),
},
)
gen = SyntheticGenerator(schema, rows=500, mapper=mapper)
Output formats
| Format | Flag | Extension | Notes |
|---|---|---|---|
| JSON (array) | json |
.json |
Pretty-printed, UTF-8 |
| NDJSON | jsonl |
.jsonl |
One record per line, stream-friendly |
| CSV | csv |
.csv |
UTF-8, no index |
| Apache Parquet | parquet |
.parquet |
Columnar, compressed (recommended for large datasets) |
Stack
datacontract-cli for contract validation · Faker for synthetic values · polars + pyarrow for DataFrames and Parquet · rstr for regex-conforming strings · Typer + Rich for the CLI.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datacontract_faker-0.1.0.tar.gz.
File metadata
- Download URL: datacontract_faker-0.1.0.tar.gz
- Upload date:
- Size: 24.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.4.1 CPython/3.10.12 Linux/6.8.0-111-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
220236b473d3bec5945e09c089c57730d462bf30ba66b54586ff79a9917a7734
|
|
| MD5 |
067326cfa2ceee4114183aa06fa2da4c
|
|
| BLAKE2b-256 |
97734b2b5b62b8459dbb188801def722890af6e54e29876304b685f3fb11001a
|
File details
Details for the file datacontract_faker-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datacontract_faker-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.4.1 CPython/3.10.12 Linux/6.8.0-111-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a0d1a422c11838bb8a4b527ec05e499ea76727521ef6085885afe0f8c5405ff
|
|
| MD5 |
a42269f19c6440bb04a20d836a22310e
|
|
| BLAKE2b-256 |
13c349469fca4abc860198188baf4acef84864ccacccc29d85facdf641f87a89
|