Skip to main content

Python synthetic data generator for realistic multi-table test data, database seeding, and scenario simulation

Project description

Misata

Synthetic data from intent — not from config files

PyPI version Python versions CI License Open in Colab

import misata

tables = misata.generate("A SaaS company with 5k users, monthly subscriptions, and 20% churn")

print(tables["users"].head())
print(tables["subscriptions"].head())

That's it. Misata reads your intent, infers a relational schema, generates linked tables with referential integrity, and applies domain-realistic distributions — all without a config file.


Install

pip install misata

For LLM-assisted generation (optional):

pip install "misata[llm]"
export GROQ_API_KEY=gsk_...   # or OPENAI_API_KEY

Three examples

SaaS — revenue curve + churn

import misata

tables = misata.generate(
    "A SaaS company with 5k users. Revenue rises from 50k in Jan to 200k in Dec "
    "with a dip in September. 20% churn in Q3.",
    rows=5000,
    seed=42,
)

# users, subscriptions — with exact monthly MRR targets baked in
for name, df in tables.items():
    print(f"{name}: {len(df):,} rows")

Ecommerce — multi-table with FK integrity

tables = misata.generate("An ecommerce store with customers and orders", rows=10_000)

# customers → orders (FK always holds)
assert tables["orders"]["customer_id"].isin(tables["customers"]["customer_id"]).all()

Inspect before generating

schema = misata.parse("A healthcare clinic with patients, doctors, and appointments")
print(schema.summary())
# Schema: Healthcare Dataset
# Domain: healthcare
# Tables: 3  /  Total rows: 15,300
#
#   Table            Rows  Columns
#   ------------ --------  -------
#   doctors           765  doctor_id, first_name, last_name, specialty, years_experience
#   patients        5,000  patient_id, first_name, last_name, age, gender, blood_type ...
#   appointments   10,000  appointment_id, patient_id, doctor_id, appointment_date ...
#
#   Relationships (2):
#     patients.patient_id → appointments.patient_id
#     doctors.doctor_id → appointments.doctor_id

tables = misata.generate_from_schema(schema)

Supported domains

Domain Trigger keywords Tables generated
SaaS saas, subscription, mrr, churn users, subscriptions
Ecommerce ecommerce, orders, store, retail customers, orders
Fintech fintech, payments, banking, fraud customers, accounts, transactions
Healthcare healthcare, patients, doctors, clinic doctors, patients, appointments
Marketplace marketplace, sellers, buyers, listings sellers, buyers, listings, orders
Logistics logistics, shipping, drivers, routes drivers, vehicles, routes, shipments
Pharma pharma, clinical, trials research_projects, timesheets

No keyword match → falls back to a generic single-table schema with a warning.


What makes Misata different

Faker SDV Misata
One-liner API No No Yes
Story-driven schema inference No No Yes
Exact monthly aggregate targets No No Yes
Referential integrity No Yes Yes
Domain-realistic distributions No Limited Yes
Pre-generation schema validation No No Yes
Streaming-safe for large datasets No No Yes

The core difference: Faker generates individual fake values. SDV learns from real data. Misata generates from intent — you describe a business, and it builds a logically consistent world.


How it works

story / intent
      ↓
 StoryParser  ←→  domain priors (lognormal for MRR, Zipf for categories…)
      ↓
 SchemaConfig    ← validate_schema() catches problems before generation
      ↓
 DataSimulator   ← topological sort, FK sampling, realism rules
      ↓
 {table: DataFrame}

Domain priors — monetary columns automatically get log-normal distributions. Categorical columns get Zipf sampling so one value dominates naturally. Blood types get real-world probabilities.

Outcome curves — "revenue rises from 50k in Jan to 200k in Dec" becomes exact per-month targets that constrain generation row by row.

Realism rulescost is always less than price. delivered_at is always after shipped_at. Email addresses derive from first and last name.


Full API

import misata

# One-liner
tables = misata.generate(story, rows=10_000, seed=42)

# Two-step
schema = misata.parse(story, rows=10_000)
print(schema.summary())
tables = misata.generate_from_schema(schema)

# Validate a schema before generation
misata.validate_schema(schema)   # raises SchemaValidationError with all issues listed

# LLM-powered (requires misata[llm] + API key)
from misata import LLMSchemaGenerator
gen = LLMSchemaGenerator(provider="groq")   # or "openai", "ollama"
schema = gen.generate_from_story("A fraud detection dataset with 2% positive rate")
tables = misata.generate_from_schema(schema)

Performance

Measured on Apple M-series (single core, no GPU):

Workload Rows Time Rows/s
Single table, lognormal 1,000,000 0.06s ~16M
Star schema (5 tables, 4 FKs) 1,055,030 1.54s ~687k

Run the examples

pip install misata pandas numpy

# SaaS: all 12 monthly MRR targets hit exactly
python examples/saas_revenue_curve.py

# Fintech: FICO distribution matches real-world, fraud rate = 2.00%
python examples/fintech_fraud_detection.py

# Healthcare: ABO/Rh blood types, 2 FK edges, 0 orphans
python examples/healthcare_multi_table.py

# Ecommerce: seasonal revenue curve, power-law order amounts
python examples/ecommerce_seasonal.py

Contributing

git clone https://github.com/rasinmuhammed/misata
cd misata
pip install -e ".[dev]"
pytest tests/

Issues and PRs are welcome: github.com/rasinmuhammed/misata/issues


Built by Muhammed Rasin

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

misata-0.6.1.tar.gz (250.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

misata-0.6.1-py3-none-any.whl (249.0 kB view details)

Uploaded Python 3

File details

Details for the file misata-0.6.1.tar.gz.

File metadata

  • Download URL: misata-0.6.1.tar.gz
  • Upload date:
  • Size: 250.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for misata-0.6.1.tar.gz
Algorithm Hash digest
SHA256 23fd9ecf623a8d1eff6c4113fa844a5660f17335500927f659fe83b3b983aefa
MD5 6560e3df2295a29f75ab4b02ab018d37
BLAKE2b-256 c5a9025825c9b9c046acb13933ff2e5ada3c44664d9a181d0b3578716825ae5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for misata-0.6.1.tar.gz:

Publisher: publish.yml on rasinmuhammed/misata

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file misata-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: misata-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 249.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for misata-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2aa39cbef7e553a19c547d166ef6275080f9d1c15f76339fa87f070e74271338
MD5 56f3ad53ac606e8261b2a798f501cbf5
BLAKE2b-256 c9b8a59556327dd722b1e75cc2711e0e99688190f53ca6f6fbbc1f7a748ee42f

See more details on using hashes here.

Provenance

The following attestation bundles were made for misata-0.6.1-py3-none-any.whl:

Publisher: publish.yml on rasinmuhammed/misata

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page