Python synthetic data generator for realistic multi-table test data, database seeding, and scenario simulation
Project description
import misata
tables = misata.generate("A SaaS company with 5k users, monthly subscriptions, and 20% churn")
print(tables["users"].head())
print(tables["subscriptions"].head())
That's it. Misata reads your intent, infers a relational schema, generates linked tables with referential integrity, and applies domain-realistic distributions — all without a config file.
Install
pip install misata
For LLM-assisted generation (optional):
pip install "misata[llm]"
export GROQ_API_KEY=gsk_... # or OPENAI_API_KEY
Three examples
SaaS — revenue curve + churn
import misata
tables = misata.generate(
"A SaaS company with 5k users. Revenue rises from 50k in Jan to 200k in Dec "
"with a dip in September. 20% churn in Q3.",
rows=5000,
seed=42,
)
# users, subscriptions — with exact monthly MRR targets baked in
for name, df in tables.items():
print(f"{name}: {len(df):,} rows")
Ecommerce — multi-table with FK integrity
tables = misata.generate("An ecommerce store with customers and orders", rows=10_000)
# customers → orders (FK always holds)
assert tables["orders"]["customer_id"].isin(tables["customers"]["customer_id"]).all()
Inspect before generating
schema = misata.parse("A healthcare clinic with patients, doctors, and appointments")
print(schema.summary())
# Schema: Healthcare Dataset
# Domain: healthcare
# Tables: 3 / Total rows: 15,300
#
# Table Rows Columns
# ------------ -------- -------
# doctors 765 doctor_id, first_name, last_name, specialty, years_experience
# patients 5,000 patient_id, first_name, last_name, age, gender, blood_type ...
# appointments 10,000 appointment_id, patient_id, doctor_id, appointment_date ...
#
# Relationships (2):
# patients.patient_id → appointments.patient_id
# doctors.doctor_id → appointments.doctor_id
tables = misata.generate_from_schema(schema)
Supported domains
| Domain | Trigger keywords | Tables generated |
|---|---|---|
| SaaS | saas, subscription, mrr, churn | users, subscriptions |
| Ecommerce | ecommerce, orders, store, retail | customers, orders |
| Fintech | fintech, payments, banking, fraud | customers, accounts, transactions |
| Healthcare | healthcare, patients, doctors, clinic | doctors, patients, appointments |
| Marketplace | marketplace, sellers, buyers, listings | sellers, buyers, listings, orders |
| Logistics | logistics, shipping, drivers, routes | drivers, vehicles, routes, shipments |
| Pharma | pharma, clinical, trials | research_projects, timesheets |
No keyword match → falls back to a generic single-table schema with a warning.
What makes Misata different
| Faker | SDV | Misata | |
|---|---|---|---|
| One-liner API | No | No | Yes |
| Story-driven schema inference | No | No | Yes |
| Exact monthly aggregate targets | No | No | Yes |
| Referential integrity | No | Yes | Yes |
| Domain-realistic distributions | No | Limited | Yes |
| Pre-generation schema validation | No | No | Yes |
| Streaming-safe for large datasets | No | No | Yes |
The core difference: Faker generates individual fake values. SDV learns from real data. Misata generates from intent — you describe a business, and it builds a logically consistent world.
How it works
story / intent
↓
StoryParser ←→ domain priors (lognormal for MRR, Zipf for categories…)
↓
SchemaConfig ← validate_schema() catches problems before generation
↓
DataSimulator ← topological sort, FK sampling, realism rules
↓
{table: DataFrame}
Domain priors — monetary columns automatically get log-normal distributions. Categorical columns get Zipf sampling so one value dominates naturally. Blood types get real-world probabilities.
Outcome curves — "revenue rises from 50k in Jan to 200k in Dec" becomes exact per-month targets that constrain generation row by row.
Realism rules — cost is always less than price. delivered_at is always after shipped_at. Email addresses derive from first and last name.
Full API
import misata
# One-liner
tables = misata.generate(story, rows=10_000, seed=42)
# Two-step
schema = misata.parse(story, rows=10_000)
print(schema.summary())
tables = misata.generate_from_schema(schema)
# Validate a schema before generation
misata.validate_schema(schema) # raises SchemaValidationError with all issues listed
# LLM-powered (requires misata[llm] + API key)
from misata import LLMSchemaGenerator
gen = LLMSchemaGenerator(provider="groq") # or "openai", "ollama"
schema = gen.generate_from_story("A fraud detection dataset with 2% positive rate")
tables = misata.generate_from_schema(schema)
Performance
Measured on Apple M-series (single core, no GPU):
| Workload | Rows | Time | Rows/s |
|---|---|---|---|
| Single table, lognormal | 1,000,000 | 0.06s | ~16M |
| Star schema (5 tables, 4 FKs) | 1,055,030 | 1.54s | ~687k |
Contributing
git clone https://github.com/rasinmuhammed/misata
cd misata
pip install -e ".[dev]"
pytest tests/
Issues and PRs are welcome: github.com/rasinmuhammed/misata/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file misata-0.6.0.tar.gz.
File metadata
- Download URL: misata-0.6.0.tar.gz
- Upload date:
- Size: 250.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2194b30ed4e501cbff5ae4ce4cb356a791dcb7294666bcb62c9df0f021e1680f
|
|
| MD5 |
3a7035057469453c09d9e32eee9e587a
|
|
| BLAKE2b-256 |
907ceecf4ee0c42c729b9d1998413ebd0f1080814cb6b6a850a5976f3b79d824
|
Provenance
The following attestation bundles were made for misata-0.6.0.tar.gz:
Publisher:
publish.yml on rasinmuhammed/misata
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
misata-0.6.0.tar.gz -
Subject digest:
2194b30ed4e501cbff5ae4ce4cb356a791dcb7294666bcb62c9df0f021e1680f - Sigstore transparency entry: 1272278720
- Sigstore integration time:
-
Permalink:
rasinmuhammed/misata@0aac8640157e16f79525158fc85c88c2c3ea9108 -
Branch / Tag:
refs/tags/v0.6.0 - Owner: https://github.com/rasinmuhammed
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0aac8640157e16f79525158fc85c88c2c3ea9108 -
Trigger Event:
release
-
Statement type:
File details
Details for the file misata-0.6.0-py3-none-any.whl.
File metadata
- Download URL: misata-0.6.0-py3-none-any.whl
- Upload date:
- Size: 248.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c4d24e9ab0648f909cee76717088c7bef1f310e84b7e1076b7fa7f2874aa5c7
|
|
| MD5 |
56e5133bd0ac6d661dff3d23c3a92e17
|
|
| BLAKE2b-256 |
d027225516e2cff12e56c617b907cb3a1922ca0fb8c350a2f4c9af2e1bbcbe41
|
Provenance
The following attestation bundles were made for misata-0.6.0-py3-none-any.whl:
Publisher:
publish.yml on rasinmuhammed/misata
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
misata-0.6.0-py3-none-any.whl -
Subject digest:
2c4d24e9ab0648f909cee76717088c7bef1f310e84b7e1076b7fa7f2874aa5c7 - Sigstore transparency entry: 1272278817
- Sigstore integration time:
-
Permalink:
rasinmuhammed/misata@0aac8640157e16f79525158fc85c88c2c3ea9108 -
Branch / Tag:
refs/tags/v0.6.0 - Owner: https://github.com/rasinmuhammed
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0aac8640157e16f79525158fc85c88c2c3ea9108 -
Trigger Event:
release
-
Statement type: