Skip to main content

AI-Powered Synthetic Data Engine - Generate realistic multi-table datasets from natural language

Project description

🧠 Misata

Generate realistic multi-table datasets from natural language.

No schema writing. No training data. Just describe what you need.

Version License Python

✨ What Makes Misata Different

Feature Faker SDV Misata
Natural language input
Auto schema generation
Relational integrity
Business constraints
No training data needed
Streaming (10M+ rows)

🚀 Quick Start

pip install misata

With Groq (Free, Fast)

export GROQ_API_KEY=your_key  # Get free: https://console.groq.com
misata generate --story "A SaaS with 50K users, subscriptions, and payments" --use-llm

With OpenAI

export OPENAI_API_KEY=your_key
misata generate --story "E-commerce with products and orders" --use-llm --provider openai

With Ollama (Local, Free, Private)

ollama run llama3  # Start Ollama first
misata generate --story "Fitness app with workouts" --use-llm --provider ollama

📊 Example Output

$ misata generate --story "A fitness app with 50K users" --use-llm

🧠 Using Groq (llama-3.3-70b-versatile) for intelligent parsing...
✅ LLM schema generated successfully!

📋 Schema: FitnessApp
   Tables: 5
   Relationships: 4

🔧 Generating 5 table(s)...

   ✓ exercises     (10 rows)
   ✓ plans         (5 rows)
   ✓ users         (50,000 rows)
   ✓ subscriptions (45,000 rows)
   ✓ workouts      (500,000 rows)

⏱️  Generation time: 2.34 seconds
🚀 Performance: 213,675 rows/second
💾 Data saved to: ./generated_data

💻 Python API

from misata import DataSimulator, SchemaConfig
from misata.llm_parser import LLMSchemaGenerator

# Generate schema from story
llm = LLMSchemaGenerator(provider="groq")  # or "openai", "ollama"
config = llm.generate_from_story(
    "A mobile fitness app with 50K users, workout tracking, "
    "premium subscriptions, and January signup spikes"
)

# Generate data
for table_name, batch in DataSimulator(config).generate_all():
    print(f"Generated {len(batch)} rows for {table_name}")

🔧 CLI Reference

# Basic generation (rule-based, no API key needed)
misata generate --story "SaaS company with users and subscriptions"

# LLM-powered generation
misata generate --story "..." --use-llm

# Specify provider and model
misata generate --story "..." --use-llm --provider ollama --model llama3

# Custom output directory
misata generate --story "..." --use-llm --output-dir ./my_data

# Set row count
misata generate --story "..." --use-llm --rows 100000

# Reproducible with seed
misata generate --story "..." --use-llm --seed 42

🎯 Business Rule Constraints

Define rules like "employees can't log >8 hours/day":

from misata import Constraint, Table

timesheets = Table(
    name="timesheets",
    row_count=10000,
    constraints=[
        Constraint(
            name="max_daily_hours",
            type="sum_limit",
            group_by=["employee_id", "date"],
            column="hours",
            value=8.0,
            action="redistribute"
        )
    ]
)

🔑 LLM Providers

Provider Env Variable Free Tier Notes
Groq GROQ_API_KEY ✅ 30 req/min Fastest, recommended
OpenAI OPENAI_API_KEY Best quality
Ollama None ✅ Local Private, no internet

📈 Extending Data Pools

from misata import TextGenerator

# Add custom names
TextGenerator.extend_pool("first_names", ["Arjun", "Priya", "Rahul"])

# Load from file
TextGenerator.load_pools_from_file("custom_pools.json")

# Save for reuse
TextGenerator.save_pools_to_file("expanded_pools.json")

🤖 ML Training Data

Make your synthetic data indistinguishable from real-world data with noise injection:

from misata import add_noise, NoiseInjector

# Quick noise injection
noisy_df = add_noise(df,
    null_rate=0.05,      # 5% missing values
    outlier_rate=0.02,   # 2% statistical outliers
    typo_rate=0.01,      # 1% typos in text
    duplicate_rate=0.03, # 3% duplicate rows
    seed=42
)

# Advanced: Temporal distribution drift
injector = NoiseInjector(seed=42)
df = injector.apply_temporal_drift(df, 
    date_column="created_at",
    value_column="revenue", 
    drift_rate=0.15,      # 15% increase over time
    drift_direction="up"
)

Attribute Customization

from misata import Customizer, ColumnOverride
import numpy as np

customizer = Customizer(seed=42)

# Custom age distribution (realistic, not uniform)
customizer.add_override("users", ColumnOverride(
    name="age",
    generator=lambda n: np.random.normal(35, 12, n).clip(18, 80).astype(int)
))

# Conditional values based on other columns
customizer.add_conditional("orders", "shipping_cost", {
    "country": {"US": 5.99, "UK": 9.99, "DE": 7.99}
})

# Apply to generated data
df = customizer.apply(df, "users")

⚡ Performance

Rows Time Speed
10K 0.03s 333K rows/sec
100K 0.26s 385K rows/sec
1M 2.6s 390K rows/sec
10M 26s 390K rows/sec (streaming)

� Try It Now

Open In Colab

Try Misata in your browser without installing anything!

💼 Enterprise & Consulting

Need help with complex scenarios?

  • 🏢 Custom enterprise data schemas (10M+ rows)
  • 🔧 Integration with your existing pipelines
  • 📊 Industry-specific realistic data generation
  • 🎓 Training and onboarding for your team

📧 Contact: rasinbinabdulla@gmail.com

�📄 License

MIT License

👤 Author

Built by Muhammed Rasin


Misata - From story to synthetic database in one command.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

misata-0.3.0b0.tar.gz (119.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

misata-0.3.0b0-py3-none-any.whl (114.1 kB view details)

Uploaded Python 3

File details

Details for the file misata-0.3.0b0.tar.gz.

File metadata

  • Download URL: misata-0.3.0b0.tar.gz
  • Upload date:
  • Size: 119.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for misata-0.3.0b0.tar.gz
Algorithm Hash digest
SHA256 fe20caefabeba62696ba6b09174ccfb850e01f36cf68b652c5d8bc419904014e
MD5 1435d30e14f809d4e17ef677cb41e1f1
BLAKE2b-256 e2885e3612de0a0d3e45e5862070c0372c72205a6af18a1ddadea70872811f63

See more details on using hashes here.

File details

Details for the file misata-0.3.0b0-py3-none-any.whl.

File metadata

  • Download URL: misata-0.3.0b0-py3-none-any.whl
  • Upload date:
  • Size: 114.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for misata-0.3.0b0-py3-none-any.whl
Algorithm Hash digest
SHA256 41b617c81cb881c29dda83a969f388b383115c74cfc646e265df7f5986d86169
MD5 dca96824a54f562d76ec57d90a3b16ea
BLAKE2b-256 c928dfe3b8afa8ba1be76f69c587a9f0056d9956f96832fba0ffb061795196af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page