Skip to main content

AI-Powered Synthetic Data Engine - Generate realistic multi-table datasets from natural language

Project description

🧠 Misata

Generate realistic multi-table datasets from natural language.

No schema writing. No training data. Just describe what you need.

Version License Python

✨ What Makes Misata Different

Feature Faker SDV Misata
Natural language input
Auto schema generation
Relational integrity
Business constraints
No training data needed
Streaming (10M+ rows)

🚀 Quick Start

pip install misata

With Groq (Free, Fast)

export GROQ_API_KEY=your_key  # Get free: https://console.groq.com
misata generate --story "A SaaS with 50K users, subscriptions, and payments" --use-llm

With OpenAI

export OPENAI_API_KEY=your_key
misata generate --story "E-commerce with products and orders" --use-llm --provider openai

With Ollama (Local, Free, Private)

ollama run llama3  # Start Ollama first
misata generate --story "Fitness app with workouts" --use-llm --provider ollama

📊 Example Output

$ misata generate --story "A fitness app with 50K users" --use-llm

🧠 Using Groq (llama-3.3-70b-versatile) for intelligent parsing...
✅ LLM schema generated successfully!

📋 Schema: FitnessApp
   Tables: 5
   Relationships: 4

🔧 Generating 5 table(s)...

   ✓ exercises     (10 rows)
   ✓ plans         (5 rows)
   ✓ users         (50,000 rows)
   ✓ subscriptions (45,000 rows)
   ✓ workouts      (500,000 rows)

⏱️  Generation time: 2.34 seconds
🚀 Performance: 213,675 rows/second
💾 Data saved to: ./generated_data

💻 Python API

from misata import DataSimulator, SchemaConfig
from misata.llm_parser import LLMSchemaGenerator

# Generate schema from story
llm = LLMSchemaGenerator(provider="groq")  # or "openai", "ollama"
config = llm.generate_from_story(
    "A mobile fitness app with 50K users, workout tracking, "
    "premium subscriptions, and January signup spikes"
)

# Generate data
for table_name, batch in DataSimulator(config).generate_all():
    print(f"Generated {len(batch)} rows for {table_name}")

🔧 CLI Reference

# Basic generation (rule-based, no API key needed)
misata generate --story "SaaS company with users and subscriptions"

# LLM-powered generation
misata generate --story "..." --use-llm

# Specify provider and model
misata generate --story "..." --use-llm --provider ollama --model llama3

# Custom output directory
misata generate --story "..." --use-llm --output-dir ./my_data

# Set row count
misata generate --story "..." --use-llm --rows 100000

# Reproducible with seed
misata generate --story "..." --use-llm --seed 42

🎯 Business Rule Constraints

Define rules like "employees can't log >8 hours/day":

from misata import Constraint, Table

timesheets = Table(
    name="timesheets",
    row_count=10000,
    constraints=[
        Constraint(
            name="max_daily_hours",
            type="sum_limit",
            group_by=["employee_id", "date"],
            column="hours",
            value=8.0,
            action="redistribute"
        )
    ]
)

🔑 LLM Providers

Provider Env Variable Free Tier Notes
Groq GROQ_API_KEY ✅ 30 req/min Fastest, recommended
OpenAI OPENAI_API_KEY Best quality
Ollama None ✅ Local Private, no internet

📈 Extending Data Pools

from misata import TextGenerator

# Add custom names
TextGenerator.extend_pool("first_names", ["Arjun", "Priya", "Rahul"])

# Load from file
TextGenerator.load_pools_from_file("custom_pools.json")

# Save for reuse
TextGenerator.save_pools_to_file("expanded_pools.json")

🤖 ML Training Data

Make your synthetic data indistinguishable from real-world data with noise injection:

from misata import add_noise, NoiseInjector

# Quick noise injection
noisy_df = add_noise(df,
    null_rate=0.05,      # 5% missing values
    outlier_rate=0.02,   # 2% statistical outliers
    typo_rate=0.01,      # 1% typos in text
    duplicate_rate=0.03, # 3% duplicate rows
    seed=42
)

# Advanced: Temporal distribution drift
injector = NoiseInjector(seed=42)
df = injector.apply_temporal_drift(df, 
    date_column="created_at",
    value_column="revenue", 
    drift_rate=0.15,      # 15% increase over time
    drift_direction="up"
)

Attribute Customization

from misata import Customizer, ColumnOverride
import numpy as np

customizer = Customizer(seed=42)

# Custom age distribution (realistic, not uniform)
customizer.add_override("users", ColumnOverride(
    name="age",
    generator=lambda n: np.random.normal(35, 12, n).clip(18, 80).astype(int)
))

# Conditional values based on other columns
customizer.add_conditional("orders", "shipping_cost", {
    "country": {"US": 5.99, "UK": 9.99, "DE": 7.99}
})

# Apply to generated data
df = customizer.apply(df, "users")

⚡ Performance

Rows Time Speed
10K 0.03s 333K rows/sec
100K 0.26s 385K rows/sec
1M 2.6s 390K rows/sec
10M 26s 390K rows/sec (streaming)

� Try It Now

Open In Colab

Try Misata in your browser without installing anything!

💼 Enterprise & Consulting

Need help with complex scenarios?

  • 🏢 Custom enterprise data schemas (10M+ rows)
  • 🔧 Integration with your existing pipelines
  • 📊 Industry-specific realistic data generation
  • 🎓 Training and onboarding for your team

📧 Contact: rasinbinabdulla@gmail.com

�📄 License

MIT License

👤 Author

Built by Muhammed Rasin


Misata - From story to synthetic database in one command.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

misata-0.5.0.tar.gz (165.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

misata-0.5.0-py3-none-any.whl (168.6 kB view details)

Uploaded Python 3

File details

Details for the file misata-0.5.0.tar.gz.

File metadata

  • Download URL: misata-0.5.0.tar.gz
  • Upload date:
  • Size: 165.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for misata-0.5.0.tar.gz
Algorithm Hash digest
SHA256 eba78015da3e2134806459b5f141f56991b69a9c5ee70f13b077285d6749558f
MD5 29c5ddd8c9a2459e382a13b6548d8dad
BLAKE2b-256 674e1ade34abfe6f27e6ef732890f5d807ebb19e5e89b7e33a6b2c01f569b3ef

See more details on using hashes here.

File details

Details for the file misata-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: misata-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 168.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for misata-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 60c15f63e2d0c519e5e82a3ef5eeacc759ef331328bdc0f0ba267455d0fef1a2
MD5 74f047cb05b5381ffc20c14f5055ba2b
BLAKE2b-256 5f7b7c6419140a7af2c1cf5ef1620c4f80c9db4f25dbd4869eda2f07284372cc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page