Skip to main content

Generate high-quality synthetic data from ODCS data contracts

Project description

datacontract-faker

Synthetic data that actually matches your contract — straight from schema.yaml.

CI PyPI Python License: MIT Ruff

datacontract-faker reads an ODCS v3.1.0 data contract and produces realistic, contract-compliant synthetic data — as a CLI tool or a Python library.

datacontract-faker generate orders.yaml -r 50000 -o data/orders.parquet -f parquet
Parsing contract: orders.yaml
✓ Contract valid — 2 model(s): orders, customers
Generating 50,000 row(s) …
✓ customers → data/orders_customers.parquet
✓ orders    → data/orders_orders.parquet

Highlights

  • Strict contract complianceminimum, maximum, exclusiveMinimum, exclusiveMaximum, multipleOf, minLength, maxLength, pattern, examples, format, and length-constrained types are all enforced on every row.
  • Foreign-key integrity — declares from relationships: are honored across models. Referenced tables are generated first; child rows sample real parent keys.
  • Nested + array types — recursive object fields and array items (including arrays of objects) generate properly typed values.
  • Reproducible — pin --seed for byte-identical output across runs.
  • Four output formats — JSON, JSONL, CSV, Parquet (via polars + pyarrow).
  • Locale-aware — any Faker locale (en_US, de_DE, ja_JP, …).
  • Library-first — every component (ContractParser, SyntheticGenerator, ProviderMapper, Exporter) is independently importable.

Install

pip install datacontract-faker          # standard
pipx install datacontract-faker         # isolated CLI
poetry add datacontract-faker           # poetry projects

Requires Python ≥ 3.11.


CLI

generate

Generate synthetic data from a contract.

datacontract-faker generate CONTRACT [OPTIONS]
Option Short Default Description
CONTRACT required Path to the ODCS contract (YAML), as a positional argument
--rows INT -r 100 Rows to generate per model
--output PATH -o Output file. Omit to preview in the terminal
--format -f json csv | json | jsonl | parquet
--model TEXT -m all Generate only this model
--locale TEXT en_US Faker locale
--seed INT Seed for reproducible output
--nullable-ratio FLOAT 0.1 Probability of null for optional fields (0.0–1.0)
--validate-only false Only validate the contract; skip generation
--verbose -v false Debug logging to stderr

When a contract has multiple models and --output is set, each model is written to {stem}_{model}{suffix} — e.g. data.parquet becomes data_orders.parquet, data_customers.parquet.

inspect

Render the parsed schema as a table — type, format, flags (PK, req, uniq, →fk_target), examples, and constraints.

datacontract-faker inspect CONTRACT [--nullable-ratio FLOAT]

Recipes

# Preview 10 rows in the terminal
datacontract-faker generate contract.yaml -r 10

# Validate without generating
datacontract-faker generate contract.yaml --validate-only

# 1M rows to Parquet, deterministic
datacontract-faker generate contract.yaml -r 1000000 -o out.parquet -f parquet --seed 42

# Single model, German locale, NDJSON
datacontract-faker generate contract.yaml -m customers --locale de_DE -o cust.jsonl -f jsonl

# Stress-test null handling
datacontract-faker generate contract.yaml --nullable-ratio 0.5

Python API

Every CLI capability is available as a library. Public exports from the top-level package:

from datacontract_faker import (
    ContractParser, ContractValidationError,
    SyntheticGenerator,
    Exporter, OutputFormat,
    FieldSpec, GenerationSchema, QualityRule,
)

End-to-end

from pathlib import Path
from datacontract_faker import ContractParser, SyntheticGenerator, Exporter, OutputFormat

schema = ContractParser(nullable_ratio=0.05).load_and_validate(Path("contract.yaml"))

gen = SyntheticGenerator(schema, rows=10_000, seed=42, locale="en_US")
dataframes = gen.generate_all()   # dict[str, polars.DataFrame], FK-aware order

exporter = Exporter()
for name, df in dataframes.items():
    exporter.export(df, Path(f"output/{name}.parquet"), OutputFormat.PARQUET)

generate_all() honors foreign-key relationships: referenced models are generated first and child models sample real parent keys.

Generate a single model

df = gen.generate_model("orders")          # polars.DataFrame
df = gen.generate_model("orders", fk_pools={"customer_id": ["c1", "c2", "c3"]})

Parse without file I/O

schema = ContractParser().parse_string(contract_yaml_str)

Custom Faker providers

Override or extend the type-to-provider resolution:

from datacontract_faker.mapper import ProviderMapper
from datacontract_faker import SyntheticGenerator

mapper = ProviderMapper(
    logical_format_overrides={
        "string:product_sku": lambda f: f.bothify("???-####").upper(),
    },
    logical_overrides={
        "integer": lambda f: f.random_int(min=1000, max=9999),
    },
    physical_overrides={
        "decimal": lambda f: round(f.pyfloat(min_value=1, max_value=500), 2),
    },
)

gen = SyntheticGenerator(schema, rows=500, mapper=mapper)

Output formats

Format Flag Extension Notes
JSON (array) json .json Pretty-printed, UTF-8
NDJSON jsonl .jsonl One record per line, stream-friendly
CSV csv .csv UTF-8, no index
Apache Parquet parquet .parquet Columnar, compressed (recommended for large datasets)

Stack

datacontract-cli for contract validation · Faker for synthetic values · polars + pyarrow for DataFrames and Parquet · rstr for regex-conforming strings · Typer + Rich for the CLI.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacontract_faker-0.1.0.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacontract_faker-0.1.0-py3-none-any.whl (25.7 kB view details)

Uploaded Python 3

File details

Details for the file datacontract_faker-0.1.0.tar.gz.

File metadata

  • Download URL: datacontract_faker-0.1.0.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.10.12 Linux/6.8.0-111-generic

File hashes

Hashes for datacontract_faker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 220236b473d3bec5945e09c089c57730d462bf30ba66b54586ff79a9917a7734
MD5 067326cfa2ceee4114183aa06fa2da4c
BLAKE2b-256 97734b2b5b62b8459dbb188801def722890af6e54e29876304b685f3fb11001a

See more details on using hashes here.

File details

Details for the file datacontract_faker-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: datacontract_faker-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.10.12 Linux/6.8.0-111-generic

File hashes

Hashes for datacontract_faker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9a0d1a422c11838bb8a4b527ec05e499ea76727521ef6085885afe0f8c5405ff
MD5 a42269f19c6440bb04a20d836a22310e
BLAKE2b-256 13c349469fca4abc860198188baf4acef84864ccacccc29d85facdf641f87a89

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page