Python synthetic data generator for realistic multi-table test data, database seeding, and scenario simulation
Project description
Misata generates consistent, referentially-intact multi-table datasets from a plain-English description, a YAML schema file, or an existing database schema. No machine-learning model is required. No real data is needed.
Built for:
- Database seeding — fill dev and staging environments with production-like data
- Integration tests — relational fixtures with FK integrity across every table
- Demos and prototypes — realistic numbers, names, and distributions, no PII
- BI and dashboard development — data shaped like your real domain before launch
Install
pip install misata
Optional extras:
pip install "misata[llm]" # multi-provider LLM schema generation
pip install "misata[documents]" # PDF output via weasyprint
pip install "misata[advanced]" # SDV/CTGAN statistical synthesis
Quick start
import misata
# One sentence → multi-table DataFrame dict
tables = misata.generate("A SaaS company with 5k users, monthly subscriptions, and 20% churn")
print(tables["users"].head())
print(tables["subscriptions"].head())
# Or from the CLI
misata generate --story "A SaaS company with 5k users and 20% churn" --rows 5000
Six ways to generate data
1. Plain English — no config required
tables = misata.generate("A fintech startup with 10k customers, fraud rate 3%, and IBAN accounts")
Misata reads the story, infers domain (fintech), scale (10 000 rows), and column semantics (fraud flag, IBAN format) — no schema authoring needed.
2. YAML schema-as-code — commit it to git
misata init # scaffolds misata.yaml in the current directory
misata generate # reads misata.yaml automatically
# misata.yaml
name: my-app
seed: 42
tables:
users:
rows: 1000
columns:
user_id: { type: int, unique: true }
email: { type: text, text_type: email }
plan: { type: categorical, choices: [free, pro, enterprise] }
orders:
rows: 5000
columns:
order_id: { type: int, unique: true }
user_id: { type: foreign_key }
amount: { type: float, min: 5.0, max: 500.0 }
relationships:
- "users.user_id → orders.user_id"
constraints:
- name: amount_above_cost
table: orders
type: inequality
column_a: amount
operator: ">"
column_b: cost
schema = misata.load_yaml_schema("misata.yaml")
tables = misata.generate_from_schema(schema)
3. Seed an existing database directly
from misata import schema_from_db, generate_from_schema, seed_database
# Introspect the live schema — no manual column definitions
schema = schema_from_db("postgresql://user:pass@localhost/myapp")
tables = generate_from_schema(schema)
# Seed it back — insert order respects FK dependencies automatically
report = seed_database(tables, "postgresql://user:pass@localhost/myapp_dev")
# SeedReport: seeded 6 tables, 47,300 rows in 1.2s
# One-command workflow
misata init --db postgresql://user:pass@localhost/myapp # writes misata.yaml
misata generate --db-url postgresql://user:pass@localhost/myapp_dev --db-create
SQLAlchemy models are supported too:
from misata import seed_from_sqlalchemy_models
from myapp.models import Base
report = seed_from_sqlalchemy_models(Base, db_url="sqlite:///test.db", row_count=500, create_tables=True)
4. Python dict schema
schema = misata.from_dict_schema({
"customers": {
"id": {"type": "integer", "primary_key": True},
"email": {"type": "email"},
"plan": {"type": "string", "enum": ["free", "pro", "enterprise"]},
},
"orders": {
"id": {"type": "integer", "primary_key": True},
"customer_id": {"type": "integer", "foreign_key": {"table": "customers", "column": "id"}},
"amount": {"type": "float", "min": 1.0, "max": 999.0},
},
}, row_count=5_000)
tables = misata.generate_from_schema(schema)
5. LLM-assisted generation — richer semantics, optional
from misata import LLMSchemaGenerator
gen = LLMSchemaGenerator(provider="groq") # free tier, fast
# gen = LLMSchemaGenerator(provider="anthropic") # Claude
# gen = LLMSchemaGenerator(provider="ollama", model="llama3") # fully local, no API key
schema = gen.generate_from_story(
"A fraud detection dataset — 2% positive rate, FICO scores, transaction velocity features"
)
tables = misata.generate_from_schema(schema)
Requires pip install "misata[llm]" plus one of GROQ_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY.
6. Incremental generation — grow a dataset without re-seeding
tables = misata.generate("A fintech company with 1000 customers", seed=1)
# Add 1 000 more rows — IDs auto-offset, FK integrity maintained across both batches
tables = misata.generate_more(tables, schema, n=1000, seed=2)
print(len(tables["customers"])) # 2000
Localisation
Misata automatically detects the country context from your story and generates statistically accurate data for that locale — the right names, salary distributions, national ID formats, currencies, postcodes, and company naming conventions.
# Locale is detected automatically — no extra flag needed
tables = misata.generate("German SaaS company in Berlin with 2k enterprise customers")
# → names from de_DE Faker pool, salary ~ lognormal(μ=10.71, σ=0.5) ≈ €45k median,
# postcodes are 5-digit, company names end in GmbH/AG/UG
tables = misata.generate("Brazilian fintech with R$ payments and CPF verification, 50k users")
# → pt_BR names, salary median ~BRL 33.6k, national IDs match CPF format ###.###.###-##
tables = misata.generate("Indian startup in Bangalore with ₹ salary bands and Aadhaar KYC")
# → hi_IN names, salary median ~₹350k/yr, national IDs match Aadhaar 12-digit format
Force or override a locale explicitly:
schema = misata.parse("An ecommerce store with 10k orders")
tables = misata.generate_from_schema(schema) # defaults to en_US
# CLI
misata generate --story "Ecommerce store" --locale ja_JP
15 built-in locales
| Locale | Country | Currency | Salary median | National ID |
|---|---|---|---|---|
en_US |
United States | USD / $ | $62 000 | SSN ###-##-#### |
en_GB |
United Kingdom | GBP / £ | £34 000 | NIN AA######A |
de_DE |
Germany | EUR / € | €45 000 | Steuer-IdNr |
fr_FR |
France | EUR / € | €38 000 | NIR |
pt_BR |
Brazil | BRL / R$ | R$33 600 | CPF ###.###.###-## |
es_ES |
Spain | EUR / € | €27 000 | NIE |
hi_IN |
India | INR / ₹ | ₹350 000 | Aadhaar ####-####-#### |
ja_JP |
Japan | JPY / ¥ | ¥4 400 000 | My Number |
zh_CN |
China | CNY / ¥ | ¥90 000 | Resident ID |
ar_SA |
Saudi Arabia | SAR | SAR 96 000 | National ID |
ko_KR |
South Korea | KRW / ₩ | ₩42 000 000 | RRN |
nl_NL |
Netherlands | EUR / € | €42 000 | BSN |
it_IT |
Italy | EUR / € | €29 000 | Codice Fiscale |
pl_PL |
Poland | PLN | PLN 72 000 | PESEL |
tr_TR |
Turkey | TRY | TRY 720 000 | TC Kimlik |
Each pack carries real salary distributions (median and lognormal priors), age distributions, top-ranked cities, phone-number prefixes, postcode patterns, company suffixes, and VAT rates — sourced from OECD, World Bank, ILO, and national statistics offices (2023–24 data).
# Inspect a locale pack directly
pack = misata.get_locale_pack("de_DE")
print(pack.salary_median) # 45000
print(pack.currency_symbol) # €
print(pack.top_cities[:3]) # ['Berlin', 'Hamburg', 'Munich']
print(pack.company_suffixes) # ['GmbH', 'AG', 'UG', 'KG', 'e.K.']
# Auto-detect from a story
locale = misata.detect_locale("South Korean company in Seoul with KRW salaries")
# → "ko_KR"
Constraints
Enforce business rules that survive every row of generation:
from misata.constraints import (
InequalityConstraint, # price > cost on every row
ColumnRangeConstraint, # min_price <= price <= max_price
RatioConstraint, # 70% free / 30% pro
UniqueConstraint, # no duplicate (user_id, date) pairs
SumConstraint, # total_hours per employee per day <= 8
NotNullConstraint, # no nulls in required columns
)
c = InequalityConstraint("price", ">", "cost")
df = c.apply(df)
Constraints can also be declared in misata.yaml — they run at generation time, not as a post-processing step.
Export
misata.to_parquet(tables, "data/")
misata.to_duckdb(tables, "data/dataset.duckdb")
misata.to_jsonl(tables, "data/")
Document generation
Render one document per row from any table — useful for demo datasets that need to look real end-to-end:
# Built-in templates: invoice, patient_report, transaction_receipt, user_profile
paths = misata.generate_documents(
tables, "invoice", table="orders", output_dir="/tmp/invoices", format="html"
)
# format="pdf" requires: pip install "misata[documents]"
# Custom Jinja2 template
tmpl = "<h1>Order #{{ order_id }}</h1><p>Amount: ${{ amount }}</p>"
paths = misata.generate_documents(tables, tmpl, table="orders", output_dir="/tmp/custom")
Quality and privacy analysis
bundle = misata.analyze_generation(tables, schema)
print(bundle.data_card.summary()) # row counts, null rates, type distribution
print(bundle.fidelity_report.score) # 0–1 statistical fidelity score vs. schema intent
print(bundle.privacy_report.pii_risk) # column-level PII exposure analysis
Supported domains
| Domain | Trigger keywords | Tables generated |
|---|---|---|
| SaaS | saas, subscription, mrr, churn | users, subscriptions |
| Ecommerce | ecommerce, orders, store, retail | customers, orders |
| Fintech | fintech, payments, banking, fraud | customers, accounts, transactions |
| Healthcare | healthcare, patients, doctors, clinic | doctors, patients, appointments |
| Marketplace | marketplace, sellers, buyers, listings | sellers, buyers, listings, orders |
| Logistics | logistics, shipping, drivers, routes | drivers, vehicles, routes, shipments |
No keyword match → generic single-table schema with smart column inference.
How it works
story / YAML / dict / DB introspection
↓
StoryParser · locale detection · load_yaml_schema · schema_from_db
↓
SchemaConfig ← validate_schema() catches issues before any rows are generated
↓
DataSimulator
├─ topological sort (FK dependency order)
├─ domain priors → locale priors (salary, age, monetary)
├─ constraint engine (inequality, range, ratio, sum, unique)
├─ outcome curves ("revenue rises from 50k in Jan to 200k in Dec")
└─ RealisticTextGenerator (Faker locale + Kaggle vocabulary assets)
↓
{table_name: DataFrame}
↓
seed_database · to_parquet · to_duckdb · generate_documents
Domain priors — monetary columns get log-normal distributions. Categoricals use Zipf sampling. Blood types, country distributions, and salary bands reflect real-world statistics.
Locale priors — salary and age distributions are overridden with country-specific lognormal/normal parameters sourced from national statistics. "Brazilian fintech" in your story means salaries are sampled from the BRL distribution, not the USD one.
Outcome curves — "revenue rises from 50k in Jan to 200k in Dec" becomes exact per-month targets that constrain row-by-row generation.
Realism rules — cost is always less than price. delivered_at is always after shipped_at. Email addresses derive from first and last name columns.
What makes Misata different
| Faker | Synth | syda | SDV | Misata | |
|---|---|---|---|---|---|
| No config, one line to multi-table data | — | — | — | — | Yes |
| Story auto-detects locale + country stats | — | — | — | — | Yes |
| YAML schema committed to git | — | Yes | Yes | — | Yes |
| DB introspection → generate → re-seed | — | Yes | — | Limited | Yes |
| Direct DB seeding (Postgres / MySQL / SQLite) | — | — | — | — | Yes |
| SQLAlchemy model seeding | — | — | — | — | Yes |
| Referential integrity across all FK tables | — | Yes | Yes | Yes | Yes |
Inequality / range constraints (price > cost) |
— | Limited | — | Yes | Yes |
| Aggregate target curves (monthly MRR shape) | — | — | — | — | Yes |
| Domain-realistic distributions | — | — | — | Limited | Yes |
| Multi-provider LLM (Groq / OpenAI / Claude / Gemini / Ollama) | — | — | Yes | — | Yes |
| Fully offline, no LLM required | Yes | Yes | — | Yes | Yes |
| Document generation (HTML / PDF per row) | — | — | — | — | Yes |
| Quality + privacy reports | — | — | — | Limited | Yes |
| Pure Python, no external services | Yes | — | — | Yes | Yes |
Faker generates individual fake values — not relational, no schema, no statistical accuracy.
Synth excels at schema-as-code git workflows; limited distribution control.
syda uses an LLM for every row — semantically rich but expensive, slow, and requires an API key.
SDV learns from real data — a different problem (you need real data first).
Misata generates from intent, offline by default, seeds databases directly, and now brings country-accurate statistics to every column automatically.
Performance
Measured on Apple M-series (single core, no GPU):
| Workload | Rows | Time | Throughput |
|---|---|---|---|
| Single table, lognormal | 1 000 000 | 0.06 s | ~16M rows/s |
| Star schema (5 tables, 4 FKs) | 1 055 030 | 1.54 s | ~687k rows/s |
Contributing
git clone https://github.com/rasinmuhammed/misata
cd misata
pip install -e ".[dev]"
pytest tests/
Issues and PRs welcome — github.com/rasinmuhammed/misata/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file misata-0.7.1.tar.gz.
File metadata
- Download URL: misata-0.7.1.tar.gz
- Upload date:
- Size: 308.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0d6b2edc374fab95b2f0c679feba32838fa3e24de5b9df5aeecd86f5c7ccaf5
|
|
| MD5 |
f5a107cb3996fe446783e3f307c0cbd9
|
|
| BLAKE2b-256 |
5261e31e513156c5125c47a216ae4fdbff4f746bdbe130f0b267ed7da36b3cd7
|
Provenance
The following attestation bundles were made for misata-0.7.1.tar.gz:
Publisher:
publish.yml on rasinmuhammed/misata
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
misata-0.7.1.tar.gz -
Subject digest:
b0d6b2edc374fab95b2f0c679feba32838fa3e24de5b9df5aeecd86f5c7ccaf5 - Sigstore transparency entry: 1316195992
- Sigstore integration time:
-
Permalink:
rasinmuhammed/misata@92374af8307e93eeb2b22bdcfbc8496181fc2a06 -
Branch / Tag:
refs/tags/v0.7.1 - Owner: https://github.com/rasinmuhammed
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@92374af8307e93eeb2b22bdcfbc8496181fc2a06 -
Trigger Event:
release
-
Statement type:
File details
Details for the file misata-0.7.1-py3-none-any.whl.
File metadata
- Download URL: misata-0.7.1-py3-none-any.whl
- Upload date:
- Size: 299.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1869d3062d2e1688b1677a7211e979fa897f09d2616434a0eeab482ffab2f0cc
|
|
| MD5 |
e8e82a39d03a5f5f4d831838621318e5
|
|
| BLAKE2b-256 |
0bc6ed2929f365c47477a23883a11d7ffdcf4666f0c55bc5515fc4a026f2a502
|
Provenance
The following attestation bundles were made for misata-0.7.1-py3-none-any.whl:
Publisher:
publish.yml on rasinmuhammed/misata
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
misata-0.7.1-py3-none-any.whl -
Subject digest:
1869d3062d2e1688b1677a7211e979fa897f09d2616434a0eeab482ffab2f0cc - Sigstore transparency entry: 1316196014
- Sigstore integration time:
-
Permalink:
rasinmuhammed/misata@92374af8307e93eeb2b22bdcfbc8496181fc2a06 -
Branch / Tag:
refs/tags/v0.7.1 - Owner: https://github.com/rasinmuhammed
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@92374af8307e93eeb2b22bdcfbc8496181fc2a06 -
Trigger Event:
release
-
Statement type: