AI-Powered Synthetic Data Engine - Generate realistic multi-table datasets from natural language
Project description
๐ง Misata
The Intelligent Synthetic Data Engine
Stop writing fake data scripts.
Generate production-grade datasets from natural language.
Quick Start โข Features โข Python API โข Enterprise
๐ Why Misata?
Misata isn't just a random data generator. It's an intelligent engine that understands your business logic, relationships, and constraints. Whether you need 50 rows for unit tests or 10 million rows for load testing, Misata delivers statistically realistic data that looks and behaves like the real thing.
| Feature | Faker | SDV | Misata |
|---|---|---|---|
| Natural Language Input | โ | โ | โ |
| Auto Schema Generation | โ | โ | โ |
| Relational Integrity | โ | โ | โ |
| Business Constraints | โ | โ | โ |
| No Training Data Needed | โ | โ | โ |
| Streaming (10M+ rows) | โ | โ | โ |
โก Quick Start
1. Install
pip install misata
2. Generate
Describe what you need in plain English. Misata handles the rest.
# Basic generation (Rule-based, instant)
misata generate --story "A SaaS platform with 50K users, monthly subscriptions, and a 20% churn rate in Q3"
# Intelligent generation (LLM-powered)
export GROQ_API_KEY=gsk_...
misata generate --story "E-commerce store with seasonal trends and customer segments" --use-llm
3. Result
Misata creates a relational schema, generates the data, and saves it to ./generated_data.
๐ Schema: SaaS_Platform
Tables: 4 (users, subscriptions, payments, events)
Relationships: 3
Events: 1 (Churn Spike Q3)
๐ Performance: 385,000 rows/second
๐พ Data saved to: ./generated_data
๐ฅ New in v0.5.2 โ The Realism Engine
Every column is now aware of every other column. Misata generates data that is mathematically consistent, not randomly independent.
What makes this different from Faker?
Faker/Random Misata v0.5.2
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
order.total $847.23 (random) $847.23 = $798.50 + $29.99 + $18.74
product.cost $96.00 (> price!) $41.20 (43% of price $95.81)
line_total $3,291.00 (random) $3,291.00 = 5 ร $662.00 โ $19.00
user.email luke.ri@wanadoo.co.uk emma.chen@gmail.com (from name)
rating 137 (wat?) 4 โ
(J-curve weighted)
categories "Hypothyroidism" "Electronics"
delivered_at 2021-01-03 (before order) 2024-03-15 (+7 days after order)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Row counts 100 ร every table 15 categories, 500 order_items
Smart Row Proportions
Misata analyzes your FK graph to size tables realistically:
misata generate --db-url sqlite:///shop.db --smart --rows 100
# categories: 15 (reference โ fewer, no duplicates)
# users: 100 (entities โ your base count)
# products: 250 (entities with variety)
# orders: 250 (transactions โ more than users)
# order_items: 500 (line items โ most rows)
# reviews: 150 (activity โ subset of orders)
Seed Any Existing Database
# PostgreSQL, MySQL, SQLite โ just point and seed
misata generate \
--db-url postgresql://user:pass@localhost:5432/mydb \
--smart --rows 10000 --db-truncate
๐ป Python API
Seamlessly integrate Misata into your test suites and CI/CD pipelines.
Standard Generation
from misata import DataSimulator
from misata.llm_parser import LLMSchemaGenerator
# 1. Design schema with AI
llm = LLMSchemaGenerator(provider="groq")
config = llm.generate_from_story(
"Healthcare app with patients, doctors, and appointments"
)
# 2. Generate data
simulator = DataSimulator(config)
for table_name, df in simulator.generate_all():
print(f"Generated {len(df)} rows for {table_name}")
df.to_csv(f"{table_name}.csv", index=False)
SQLAlchemy Seeding (Powerful!)
Directly seed your SQLAlchemy models without writing factories.
from misata import seed_from_sqlalchemy_models
from myapp.models import Base, engine
# Automatically analyzes your models and foreign keys
report = seed_from_sqlalchemy_models(
engine,
Base,
default_rows=10_000,
create=True,
smart_mode=True # Infers realistic values from column names
)
print(f"Seeded {report.total_rows} rows in {report.duration_seconds}s")
๐ฏ Business Constraints
Define complex rules that simple random generators can't handle.
from misata import Constraint, Table
timesheets = Table(
name="timesheets",
row_count=10000,
constraints=[
Constraint(
name="max_daily_hours",
type="sum_limit",
group_by=["employee_id", "date"],
column="hours",
value=8.0,
action="redistribute" # Automatically fixes violations
)
]
)
๐ Providers
Misata supports multiple LLM providers for schema generation.
| Provider | Env Var | Tier | Best For |
|---|---|---|---|
| Groq | GROQ_API_KEY |
Free | Speed (Recommended) |
| OpenAI | OPENAI_API_KEY |
Paid | Quality |
| Ollama | None | Free | Privacy (Local) |
๐ข Enterprise
Building a platform? Misata Studio is our commercial offering for teams.
- ๐ฅ๏ธ Visual Schema Editor: Drag-and-drop schema design.
- ๐ Privacy Filters: PII scanning and masking.
- ๐ฆ One-Click Deploy: Docker & Kubernetes ready.
- ๐ค Support: Dedicated support and custom integration.
Contact Sales for a demo.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file misata-0.5.2.tar.gz.
File metadata
- Download URL: misata-0.5.2.tar.gz
- Upload date:
- Size: 186.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1d31d2bd154f5d1b41a07f0cffe04a50555bae0ae91c3487aefef88a8b7a007
|
|
| MD5 |
51b3263cda76aa52ca89789bccb6e1e5
|
|
| BLAKE2b-256 |
9692e68c3de16d1f5102044df5fa36fe7ac3d342b8bdb7211641f09dbde8d503
|
File details
Details for the file misata-0.5.2-py3-none-any.whl.
File metadata
- Download URL: misata-0.5.2-py3-none-any.whl
- Upload date:
- Size: 187.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c3e1bfebb9dbebefc6f1aeb1171f91200bea5703b14ee5ac02b44e2e2eecc0d
|
|
| MD5 |
39441474bb8003d30e05507afb32d093
|
|
| BLAKE2b-256 |
119fd78e62b1b8c7743b2f813fcf6d3f1da64230a3cf4483db896658b33d5982
|