Python synthetic data generator for realistic multi-table test data, database seeding, and scenario simulation
Project description
import misata
tables = misata.generate("A SaaS company with 5k users, monthly subscriptions, and 20% churn")
print(tables["users"].head())
print(tables["subscriptions"].head())
That's it. Misata reads your intent, infers a relational schema, generates linked tables with referential integrity, and applies domain-realistic distributions — all without a config file.
Install
pip install misata
For LLM-assisted generation (optional — pick any provider):
pip install "misata[llm]"
export GROQ_API_KEY=gsk_... # Groq (fast, free tier)
export OPENAI_API_KEY=sk-... # OpenAI
export ANTHROPIC_API_KEY=sk-ant-... # Anthropic / Claude
# Gemini: set GOOGLE_API_KEY
# Ollama: no key needed — runs locally
For PDF document output (optional):
pip install "misata[documents]"
Three examples
SaaS — revenue curve + churn
import misata
tables = misata.generate(
"A SaaS company with 5k users. Revenue rises from 50k in Jan to 200k in Dec "
"with a dip in September. 20% churn in Q3.",
rows=5000,
seed=42,
)
# users, subscriptions — with exact monthly MRR targets baked in
for name, df in tables.items():
print(f"{name}: {len(df):,} rows")
Ecommerce — multi-table with FK integrity
tables = misata.generate("An ecommerce store with customers and orders", rows=10_000)
# customers → orders (FK always holds)
assert tables["orders"]["customer_id"].isin(tables["customers"]["customer_id"]).all()
Inspect before generating
schema = misata.parse("A healthcare clinic with patients, doctors, and appointments")
print(schema.summary())
# Schema: Healthcare Dataset
# Domain: healthcare
# Tables: 3 / Total rows: 15,300
#
# Table Rows Columns
# ------------ -------- -------
# doctors 765 doctor_id, first_name, last_name, specialty, years_experience
# patients 5,000 patient_id, first_name, last_name, age, gender, blood_type ...
# appointments 10,000 appointment_id, patient_id, doctor_id, appointment_date ...
#
# Relationships (2):
# patients.patient_id → appointments.patient_id
# doctors.doctor_id → appointments.doctor_id
tables = misata.generate_from_schema(schema)
Supported domains
| Domain | Trigger keywords | Tables generated |
|---|---|---|
| SaaS | saas, subscription, mrr, churn | users, subscriptions |
| Ecommerce | ecommerce, orders, store, retail | customers, orders |
| Fintech | fintech, payments, banking, fraud | customers, accounts, transactions |
| Healthcare | healthcare, patients, doctors, clinic | doctors, patients, appointments |
| Marketplace | marketplace, sellers, buyers, listings | sellers, buyers, listings, orders |
| Logistics | logistics, shipping, drivers, routes | drivers, vehicles, routes, shipments |
| Pharma | pharma, clinical, trials | research_projects, timesheets |
No keyword match → falls back to a generic single-table schema with a warning.
What makes Misata different
| Faker | SDV | Misata | |
|---|---|---|---|
| One-liner API | No | No | Yes |
| Story-driven schema inference | No | No | Yes |
| Exact monthly aggregate targets | No | No | Yes |
| Referential integrity | No | Yes | Yes |
| Domain-realistic distributions | No | Limited | Yes |
| Pre-generation schema validation | No | No | Yes |
| Multi-provider LLM (OpenAI / Groq / Anthropic / Gemini / Ollama) | No | No | Yes |
| Document generation (HTML / PDF / Markdown per row) | No | No | Yes |
| Custom callable generators per column | No | No | Yes |
| Kaggle vocabulary enrichment (zero-token realism) | No | No | Yes |
| Streaming-safe for large datasets | No | No | Yes |
The core difference: Faker generates individual fake values. SDV learns from real data. Misata generates from intent — you describe a business, and it builds a logically consistent world.
How it works
story / intent
↓
StoryParser ←→ domain priors (lognormal for MRR, Zipf for categories…)
↓
SchemaConfig ← validate_schema() catches problems before generation
↓
DataSimulator ← topological sort, FK sampling, realism rules
↓
{table: DataFrame}
Domain priors — monetary columns automatically get log-normal distributions. Categorical columns get Zipf sampling so one value dominates naturally. Blood types get real-world probabilities.
Outcome curves — "revenue rises from 50k in Jan to 200k in Dec" becomes exact per-month targets that constrain generation row by row.
Realism rules — cost is always less than price. delivered_at is always after shipped_at. Email addresses derive from first and last name.
Full API
import misata
# ── Core generation ──────────────────────────────────────────────────────────
# One-liner: story → DataFrames
tables = misata.generate(story, rows=10_000, seed=42)
# Two-step: inspect schema first
schema = misata.parse(story, rows=10_000)
print(schema.summary())
tables = misata.generate_from_schema(schema)
# Append more rows to an existing dataset (IDs auto-offset, no collisions)
tables = misata.generate_more(tables, schema, n=5_000)
# Validate a schema before generation
misata.validate_schema(schema) # raises SchemaValidationError with all issues listed
# ── Import your own schema ───────────────────────────────────────────────────
schema = misata.from_dict_schema({
"customers": {
"id": {"type": "integer", "primary_key": True},
"email": {"type": "email"},
"plan": {"type": "string", "enum": ["free", "pro", "enterprise"]},
},
"orders": {
"id": {"type": "integer", "primary_key": True},
"customer_id": {"type": "integer",
"foreign_key": {"table": "customers", "column": "id"}},
"amount": {"type": "float", "min": 1.0, "max": 999.0},
},
}, row_count=5_000)
# Verify referential integrity after generation or manual edits
report = misata.verify_integrity(tables, schema)
report.raise_if_invalid() # raises ValueError if orphaned FK values exist
# ── Custom generators ────────────────────────────────────────────────────────
# Override any column with a Python callable
tables = misata.generate_from_schema(schema, custom_generators={
"orders": {
# vectorized: receives the partial DataFrame, returns an array
"amount": lambda df, ctx: (df["plan"] == "enterprise").map({True: 999, False: 49}),
# per-row: receives one row dict, returns a scalar
"note": lambda row, col, ctx: f"Order for plan {row.get('plan', '?')}",
}
})
# ── Multi-provider LLM ───────────────────────────────────────────────────────
from misata import LLMSchemaGenerator
# Groq (fast, free tier)
gen = LLMSchemaGenerator(provider="groq")
# Anthropic Claude — uses native SDK, no JSON-mode hack needed
gen = LLMSchemaGenerator(provider="anthropic", model="claude-haiku-4-5-20251001")
# Gemini
gen = LLMSchemaGenerator(provider="gemini", model="gemini-2.0-flash")
# Ollama — fully local, no API key
gen = LLMSchemaGenerator(provider="ollama", model="llama3")
schema = gen.generate_from_story("A fraud detection dataset with 2% positive rate")
tables = misata.generate_from_schema(schema)
# ── Document generation ──────────────────────────────────────────────────────
# Built-in templates — no template file needed
paths = misata.generate_documents(tables, "invoice",
table="orders", output_dir="/tmp/invoices")
# Auto-detect template from column names
paths = misata.generate_documents(tables, "auto",
output_dir="/tmp/docs", format="html")
# Custom Jinja2 template string
html_tmpl = "<h1>Order #{{ order_id }}</h1><p>Amount: ${{ amount }}</p>"
paths = misata.generate_documents(tables, html_tmpl,
table="orders", output_dir="/tmp/custom")
# PDF output (requires pip install "misata[documents]")
paths = misata.generate_documents(tables, "invoice",
table="orders", output_dir="/tmp/pdfs",
format="pdf")
# See all available built-in templates
misata.list_document_templates()
# ['generic', 'invoice', 'patient_report', 'transaction_receipt', 'user_profile']
# ── Kaggle vocabulary enrichment ─────────────────────────────────────────────
# One-time: populate real-world vocabulary for a domain (requires pip install kaggle)
result = misata.enrich_from_kaggle("ecommerce")
# EnrichmentResult(domain='ecommerce', datasets_ingested=1, assets_added=3, status='ok')
# All future generate() calls use the enriched vocabulary automatically
tables = misata.generate("An ecommerce store with 5k orders")
# Bring your own CSV — no Kaggle account needed
misata.ingest_csv_vocab("~/data/companies.csv", domain="fintech",
column_map={"CompanyName": "company_name", "City": "city"})
# Check what's stored
print(misata.kaggle_status())
Performance
Measured on Apple M-series (single core, no GPU):
| Workload | Rows | Time | Rows/s |
|---|---|---|---|
| Single table, lognormal | 1,000,000 | 0.06s | ~16M |
| Star schema (5 tables, 4 FKs) | 1,055,030 | 1.54s | ~687k |
Run the examples
pip install misata pandas numpy
# SaaS: all 12 monthly MRR targets hit exactly
python examples/saas_revenue_curve.py
# Fintech: FICO distribution matches real-world, fraud rate = 2.00%
python examples/fintech_fraud_detection.py
# Healthcare: ABO/Rh blood types, 2 FK edges, 0 orphans
python examples/healthcare_multi_table.py
# Ecommerce: seasonal revenue curve, power-law order amounts
python examples/ecommerce_seasonal.py
Contributing
git clone https://github.com/rasinmuhammed/misata
cd misata
pip install -e ".[dev]"
pytest tests/
Issues and PRs are welcome: github.com/rasinmuhammed/misata/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file misata-0.7.0.tar.gz.
File metadata
- Download URL: misata-0.7.0.tar.gz
- Upload date:
- Size: 277.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45033ec896ad8d6f481434f6918a777ebad133044c5636f9d1389c3a9d835efe
|
|
| MD5 |
5ea9b7b1347b25c285424ea940198029
|
|
| BLAKE2b-256 |
6ea45b48183ff8f4058339d2da6663be2b20fb4e09b996cdbc727bb5744bd820
|
Provenance
The following attestation bundles were made for misata-0.7.0.tar.gz:
Publisher:
publish.yml on rasinmuhammed/misata
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
misata-0.7.0.tar.gz -
Subject digest:
45033ec896ad8d6f481434f6918a777ebad133044c5636f9d1389c3a9d835efe - Sigstore transparency entry: 1292977913
- Sigstore integration time:
-
Permalink:
rasinmuhammed/misata@b389b57376addb735be8b85689914a99a3fbe94e -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/rasinmuhammed
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b389b57376addb735be8b85689914a99a3fbe94e -
Trigger Event:
release
-
Statement type:
File details
Details for the file misata-0.7.0-py3-none-any.whl.
File metadata
- Download URL: misata-0.7.0-py3-none-any.whl
- Upload date:
- Size: 275.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96c366b71e4720e6706c829682cfc75418ccd698dca392fb4173124647e30ae2
|
|
| MD5 |
45c99559e6d531f70ea1ec40b59c2e80
|
|
| BLAKE2b-256 |
13ea9bed154b6672ca09a0fea41a0ad9e91dd6a05af4c38f4f309c2feb6a0380
|
Provenance
The following attestation bundles were made for misata-0.7.0-py3-none-any.whl:
Publisher:
publish.yml on rasinmuhammed/misata
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
misata-0.7.0-py3-none-any.whl -
Subject digest:
96c366b71e4720e6706c829682cfc75418ccd698dca392fb4173124647e30ae2 - Sigstore transparency entry: 1292977966
- Sigstore integration time:
-
Permalink:
rasinmuhammed/misata@b389b57376addb735be8b85689914a99a3fbe94e -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/rasinmuhammed
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b389b57376addb735be8b85689914a99a3fbe94e -
Trigger Event:
release
-
Statement type: