Python synthetic data generator for realistic multi-table test data, database seeding, and scenario simulation

These details have not been verified by PyPI

Project description

Misata

Synthetic data from intent — not from config files

import misata

tables = misata.generate("A SaaS company with 5k users, monthly subscriptions, and 20% churn")

print(tables["users"].head())
print(tables["subscriptions"].head())

That's it. Misata reads your intent, infers a relational schema, generates linked tables with referential integrity, and applies domain-realistic distributions — all without a config file.

Install

pip install misata

For LLM-assisted generation (optional — pick any provider):

pip install "misata[llm]"
export GROQ_API_KEY=gsk_...        # Groq (fast, free tier)
export OPENAI_API_KEY=sk-...       # OpenAI
export ANTHROPIC_API_KEY=sk-ant-... # Anthropic / Claude
# Gemini: set GOOGLE_API_KEY
# Ollama: no key needed — runs locally

For PDF document output (optional):

pip install "misata[documents]"

Three examples

SaaS — revenue curve + churn

import misata

tables = misata.generate(
    "A SaaS company with 5k users. Revenue rises from 50k in Jan to 200k in Dec "
    "with a dip in September. 20% churn in Q3.",
    rows=5000,
    seed=42,
)

# users, subscriptions — with exact monthly MRR targets baked in
for name, df in tables.items():
    print(f"{name}: {len(df):,} rows")

Ecommerce — multi-table with FK integrity

tables = misata.generate("An ecommerce store with customers and orders", rows=10_000)

# customers → orders (FK always holds)
assert tables["orders"]["customer_id"].isin(tables["customers"]["customer_id"]).all()

Inspect before generating

schema = misata.parse("A healthcare clinic with patients, doctors, and appointments")
print(schema.summary())
# Schema: Healthcare Dataset
# Domain: healthcare
# Tables: 3  /  Total rows: 15,300
#
#   Table            Rows  Columns
#   ------------ --------  -------
#   doctors           765  doctor_id, first_name, last_name, specialty, years_experience
#   patients        5,000  patient_id, first_name, last_name, age, gender, blood_type ...
#   appointments   10,000  appointment_id, patient_id, doctor_id, appointment_date ...
#
#   Relationships (2):
#     patients.patient_id → appointments.patient_id
#     doctors.doctor_id → appointments.doctor_id

tables = misata.generate_from_schema(schema)

Supported domains

Domain	Trigger keywords	Tables generated
SaaS	saas, subscription, mrr, churn	users, subscriptions
Ecommerce	ecommerce, orders, store, retail	customers, orders
Fintech	fintech, payments, banking, fraud	customers, accounts, transactions
Healthcare	healthcare, patients, doctors, clinic	doctors, patients, appointments
Marketplace	marketplace, sellers, buyers, listings	sellers, buyers, listings, orders
Logistics	logistics, shipping, drivers, routes	drivers, vehicles, routes, shipments
Pharma	pharma, clinical, trials	research_projects, timesheets

No keyword match → falls back to a generic single-table schema with a warning.

What makes Misata different

	Faker	SDV	Misata
One-liner API	No	No	Yes
Story-driven schema inference	No	No	Yes
Exact monthly aggregate targets	No	No	Yes
Referential integrity	No	Yes	Yes
Domain-realistic distributions	No	Limited	Yes
Pre-generation schema validation	No	No	Yes
Multi-provider LLM (OpenAI / Groq / Anthropic / Gemini / Ollama)	No	No	Yes
Document generation (HTML / PDF / Markdown per row)	No	No	Yes
Custom callable generators per column	No	No	Yes
Kaggle vocabulary enrichment (zero-token realism)	No	No	Yes
Streaming-safe for large datasets	No	No	Yes

The core difference: Faker generates individual fake values. SDV learns from real data. Misata generates from intent — you describe a business, and it builds a logically consistent world.

How it works

story / intent
      ↓
 StoryParser  ←→  domain priors (lognormal for MRR, Zipf for categories…)
      ↓
 SchemaConfig    ← validate_schema() catches problems before generation
      ↓
 DataSimulator   ← topological sort, FK sampling, realism rules
      ↓
 {table: DataFrame}

Domain priors — monetary columns automatically get log-normal distributions. Categorical columns get Zipf sampling so one value dominates naturally. Blood types get real-world probabilities.

Outcome curves — "revenue rises from 50k in Jan to 200k in Dec" becomes exact per-month targets that constrain generation row by row.

Realism rules — cost is always less than price. delivered_at is always after shipped_at. Email addresses derive from first and last name.

Full API

import misata

# ── Core generation ──────────────────────────────────────────────────────────

# One-liner: story → DataFrames
tables = misata.generate(story, rows=10_000, seed=42)

# Two-step: inspect schema first
schema = misata.parse(story, rows=10_000)
print(schema.summary())
tables = misata.generate_from_schema(schema)

# Append more rows to an existing dataset (IDs auto-offset, no collisions)
tables = misata.generate_more(tables, schema, n=5_000)

# Validate a schema before generation
misata.validate_schema(schema)   # raises SchemaValidationError with all issues listed

# ── Import your own schema ───────────────────────────────────────────────────

schema = misata.from_dict_schema({
    "customers": {
        "id":     {"type": "integer", "primary_key": True},
        "email":  {"type": "email"},
        "plan":   {"type": "string", "enum": ["free", "pro", "enterprise"]},
    },
    "orders": {
        "id":          {"type": "integer", "primary_key": True},
        "customer_id": {"type": "integer",
                        "foreign_key": {"table": "customers", "column": "id"}},
        "amount":      {"type": "float", "min": 1.0, "max": 999.0},
    },
}, row_count=5_000)

# Verify referential integrity after generation or manual edits
report = misata.verify_integrity(tables, schema)
report.raise_if_invalid()   # raises ValueError if orphaned FK values exist

# ── Custom generators ────────────────────────────────────────────────────────

# Override any column with a Python callable
tables = misata.generate_from_schema(schema, custom_generators={
    "orders": {
        # vectorized: receives the partial DataFrame, returns an array
        "amount": lambda df, ctx: (df["plan"] == "enterprise").map({True: 999, False: 49}),
        # per-row: receives one row dict, returns a scalar
        "note":   lambda row, col, ctx: f"Order for plan {row.get('plan', '?')}",
    }
})

# ── Multi-provider LLM ───────────────────────────────────────────────────────

from misata import LLMSchemaGenerator

# Groq (fast, free tier)
gen = LLMSchemaGenerator(provider="groq")

# Anthropic Claude — uses native SDK, no JSON-mode hack needed
gen = LLMSchemaGenerator(provider="anthropic", model="claude-haiku-4-5-20251001")

# Gemini
gen = LLMSchemaGenerator(provider="gemini", model="gemini-2.0-flash")

# Ollama — fully local, no API key
gen = LLMSchemaGenerator(provider="ollama", model="llama3")

schema = gen.generate_from_story("A fraud detection dataset with 2% positive rate")
tables = misata.generate_from_schema(schema)

# ── Document generation ──────────────────────────────────────────────────────

# Built-in templates — no template file needed
paths = misata.generate_documents(tables, "invoice",
                                  table="orders", output_dir="/tmp/invoices")

# Auto-detect template from column names
paths = misata.generate_documents(tables, "auto",
                                  output_dir="/tmp/docs", format="html")

# Custom Jinja2 template string
html_tmpl = "<h1>Order #{{ order_id }}</h1><p>Amount: ${{ amount }}</p>"
paths = misata.generate_documents(tables, html_tmpl,
                                  table="orders", output_dir="/tmp/custom")

# PDF output (requires pip install "misata[documents]")
paths = misata.generate_documents(tables, "invoice",
                                  table="orders", output_dir="/tmp/pdfs",
                                  format="pdf")

# See all available built-in templates
misata.list_document_templates()
# ['generic', 'invoice', 'patient_report', 'transaction_receipt', 'user_profile']

# ── Kaggle vocabulary enrichment ─────────────────────────────────────────────

# One-time: populate real-world vocabulary for a domain (requires pip install kaggle)
result = misata.enrich_from_kaggle("ecommerce")
# EnrichmentResult(domain='ecommerce', datasets_ingested=1, assets_added=3, status='ok')

# All future generate() calls use the enriched vocabulary automatically
tables = misata.generate("An ecommerce store with 5k orders")

# Bring your own CSV — no Kaggle account needed
misata.ingest_csv_vocab("~/data/companies.csv", domain="fintech",
                        column_map={"CompanyName": "company_name", "City": "city"})

# Check what's stored
print(misata.kaggle_status())

Performance

Measured on Apple M-series (single core, no GPU):

Workload	Rows	Time	Rows/s
Single table, lognormal	1,000,000	0.06s	~16M
Star schema (5 tables, 4 FKs)	1,055,030	1.54s	~687k

Run the examples

pip install misata pandas numpy

# SaaS: all 12 monthly MRR targets hit exactly
python examples/saas_revenue_curve.py

# Fintech: FICO distribution matches real-world, fraud rate = 2.00%
python examples/fintech_fraud_detection.py

# Healthcare: ABO/Rh blood types, 2 FK edges, 0 orphans
python examples/healthcare_multi_table.py

# Ecommerce: seasonal revenue curve, power-law order amounts
python examples/ecommerce_seasonal.py

Contributing

git clone https://github.com/rasinmuhammed/misata
cd misata
pip install -e ".[dev]"
pytest tests/

Issues and PRs are welcome: github.com/rasinmuhammed/misata/issues

Built by Muhammed Rasin

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.8.0.post1

May 10, 2026

0.8.0

May 10, 2026

0.7.1

Apr 16, 2026

This version

0.7.0

Apr 14, 2026

0.6.1

Apr 11, 2026

0.6.0

Apr 10, 2026

0.5.3

Mar 28, 2026

0.5.2

Mar 8, 2026

0.5.1

Feb 15, 2026

0.5.0

Feb 3, 2026

0.3.1b0 pre-release

Jan 3, 2026

0.3.0b0 pre-release

Dec 29, 2025

0.2.0b0 pre-release

Dec 28, 2025

0.1.0b0 pre-release

Dec 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

misata-0.7.0.tar.gz (277.9 kB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

misata-0.7.0-py3-none-any.whl (275.0 kB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file misata-0.7.0.tar.gz.

File metadata

Download URL: misata-0.7.0.tar.gz
Upload date: Apr 14, 2026
Size: 277.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for misata-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`45033ec896ad8d6f481434f6918a777ebad133044c5636f9d1389c3a9d835efe`
MD5	`5ea9b7b1347b25c285424ea940198029`
BLAKE2b-256	`6ea45b48183ff8f4058339d2da6663be2b20fb4e09b996cdbc727bb5744bd820`

See more details on using hashes here.

Provenance

The following attestation bundles were made for misata-0.7.0.tar.gz:

Publisher: publish.yml on rasinmuhammed/misata

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: misata-0.7.0.tar.gz
- Subject digest: 45033ec896ad8d6f481434f6918a777ebad133044c5636f9d1389c3a9d835efe
- Sigstore transparency entry: 1292977913
- Sigstore integration time: Apr 14, 2026
Source repository:
- Permalink: rasinmuhammed/misata@b389b57376addb735be8b85689914a99a3fbe94e
- Branch / Tag: refs/tags/v0.7.0
- Owner: https://github.com/rasinmuhammed
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b389b57376addb735be8b85689914a99a3fbe94e
- Trigger Event: release

File details

Details for the file misata-0.7.0-py3-none-any.whl.

File metadata

Download URL: misata-0.7.0-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 275.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for misata-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`96c366b71e4720e6706c829682cfc75418ccd698dca392fb4173124647e30ae2`
MD5	`45c99559e6d531f70ea1ec40b59c2e80`
BLAKE2b-256	`13ea9bed154b6672ca09a0fea41a0ad9e91dd6a05af4c38f4f309c2feb6a0380`

See more details on using hashes here.

Provenance

The following attestation bundles were made for misata-0.7.0-py3-none-any.whl:

Publisher: publish.yml on rasinmuhammed/misata

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: misata-0.7.0-py3-none-any.whl
- Subject digest: 96c366b71e4720e6706c829682cfc75418ccd698dca392fb4173124647e30ae2
- Sigstore transparency entry: 1292977966
- Sigstore integration time: Apr 14, 2026
Source repository:
- Permalink: rasinmuhammed/misata@b389b57376addb735be8b85689914a99a3fbe94e
- Branch / Tag: refs/tags/v0.7.0
- Owner: https://github.com/rasinmuhammed
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b389b57376addb735be8b85689914a99a3fbe94e
- Trigger Event: release

misata 0.7.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Misata

Synthetic data from intent — not from config files

Install

Three examples

SaaS — revenue curve + churn

Ecommerce — multi-table with FK integrity

Inspect before generating

Supported domains

What makes Misata different

How it works

Full API

Performance

Run the examples

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance