Skip to main content

AI-Powered Synthetic Data Engine - Generate realistic multi-table datasets from natural language

Project description

🧠 Misata

Generate realistic multi-table datasets from natural language.

No schema writing. No training data. Just describe what you need.

Version License Python

✨ What Makes Misata Different

Feature Faker SDV Misata
Natural language input
Auto schema generation
Relational integrity
Business constraints
No training data needed
Streaming (10M+ rows)

🚀 Quick Start

pip install misata

With Groq (Free, Fast)

export GROQ_API_KEY=your_key  # Get free: https://console.groq.com
misata generate --story "A SaaS with 50K users, subscriptions, and payments" --use-llm

With OpenAI

export OPENAI_API_KEY=your_key
misata generate --story "E-commerce with products and orders" --use-llm --provider openai

With Ollama (Local, Free, Private)

ollama run llama3  # Start Ollama first
misata generate --story "Fitness app with workouts" --use-llm --provider ollama

📊 Example Output

$ misata generate --story "A fitness app with 50K users" --use-llm

🧠 Using Groq (llama-3.3-70b-versatile) for intelligent parsing...
✅ LLM schema generated successfully!

📋 Schema: FitnessApp
   Tables: 5
   Relationships: 4

🔧 Generating 5 table(s)...

   ✓ exercises     (10 rows)
   ✓ plans         (5 rows)
   ✓ users         (50,000 rows)
   ✓ subscriptions (45,000 rows)
   ✓ workouts      (500,000 rows)

⏱️  Generation time: 2.34 seconds
🚀 Performance: 213,675 rows/second
💾 Data saved to: ./generated_data

💻 Python API

from misata import DataSimulator, SchemaConfig
from misata.llm_parser import LLMSchemaGenerator

# Generate schema from story
llm = LLMSchemaGenerator(provider="groq")  # or "openai", "ollama"
config = llm.generate_from_story(
    "A mobile fitness app with 50K users, workout tracking, "
    "premium subscriptions, and January signup spikes"
)

# Generate data
for table_name, batch in DataSimulator(config).generate_all():
    print(f"Generated {len(batch)} rows for {table_name}")

🔧 CLI Reference

# Basic generation (rule-based, no API key needed)
misata generate --story "SaaS company with users and subscriptions"

# LLM-powered generation
misata generate --story "..." --use-llm

# Specify provider and model
misata generate --story "..." --use-llm --provider ollama --model llama3

# Custom output directory
misata generate --story "..." --use-llm --output-dir ./my_data

# Set row count
misata generate --story "..." --use-llm --rows 100000

# Reproducible with seed
misata generate --story "..." --use-llm --seed 42

🎯 Business Rule Constraints

Define rules like "employees can't log >8 hours/day":

from misata import Constraint, Table

timesheets = Table(
    name="timesheets",
    row_count=10000,
    constraints=[
        Constraint(
            name="max_daily_hours",
            type="sum_limit",
            group_by=["employee_id", "date"],
            column="hours",
            value=8.0,
            action="redistribute"
        )
    ]
)

🔑 LLM Providers

Provider Env Variable Free Tier Notes
Groq GROQ_API_KEY ✅ 30 req/min Fastest, recommended
OpenAI OPENAI_API_KEY Best quality
Ollama None ✅ Local Private, no internet

📈 Extending Data Pools

from misata import TextGenerator

# Add custom names
TextGenerator.extend_pool("first_names", ["Arjun", "Priya", "Rahul"])

# Load from file
TextGenerator.load_pools_from_file("custom_pools.json")

# Save for reuse
TextGenerator.save_pools_to_file("expanded_pools.json")

🤖 ML Training Data

Make your synthetic data indistinguishable from real-world data with noise injection:

from misata import add_noise, NoiseInjector

# Quick noise injection
noisy_df = add_noise(df,
    null_rate=0.05,      # 5% missing values
    outlier_rate=0.02,   # 2% statistical outliers
    typo_rate=0.01,      # 1% typos in text
    duplicate_rate=0.03, # 3% duplicate rows
    seed=42
)

# Advanced: Temporal distribution drift
injector = NoiseInjector(seed=42)
df = injector.apply_temporal_drift(df, 
    date_column="created_at",
    value_column="revenue", 
    drift_rate=0.15,      # 15% increase over time
    drift_direction="up"
)

Attribute Customization

from misata import Customizer, ColumnOverride
import numpy as np

customizer = Customizer(seed=42)

# Custom age distribution (realistic, not uniform)
customizer.add_override("users", ColumnOverride(
    name="age",
    generator=lambda n: np.random.normal(35, 12, n).clip(18, 80).astype(int)
))

# Conditional values based on other columns
customizer.add_conditional("orders", "shipping_cost", {
    "country": {"US": 5.99, "UK": 9.99, "DE": 7.99}
})

# Apply to generated data
df = customizer.apply(df, "users")

⚡ Performance

Rows Time Speed
10K 0.03s 333K rows/sec
100K 0.26s 385K rows/sec
1M 2.6s 390K rows/sec
10M 26s 390K rows/sec (streaming)

� Try It Now

Open In Colab

Try Misata in your browser without installing anything!

💼 Enterprise & Consulting

Need help with complex scenarios?

  • 🏢 Custom enterprise data schemas (10M+ rows)
  • 🔧 Integration with your existing pipelines
  • 📊 Industry-specific realistic data generation
  • 🎓 Training and onboarding for your team

📧 Contact: rasinbinabdulla@gmail.com

�📄 License

MIT License

👤 Author

Built by Muhammed Rasin


Misata - From story to synthetic database in one command.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

misata-0.3.1b0.tar.gz (120.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

misata-0.3.1b0-py3-none-any.whl (115.4 kB view details)

Uploaded Python 3

File details

Details for the file misata-0.3.1b0.tar.gz.

File metadata

  • Download URL: misata-0.3.1b0.tar.gz
  • Upload date:
  • Size: 120.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for misata-0.3.1b0.tar.gz
Algorithm Hash digest
SHA256 108abcd3fa795b5c23be08037835fd66cd714894eb467ad70f30782e65dc7a1e
MD5 f49285fdc97cc6692f6d5b81bcf5557c
BLAKE2b-256 7ce72ec00dbb470e5f9f129e756bf8b088c8cb575c6802b26f780f4f4a731303

See more details on using hashes here.

File details

Details for the file misata-0.3.1b0-py3-none-any.whl.

File metadata

  • Download URL: misata-0.3.1b0-py3-none-any.whl
  • Upload date:
  • Size: 115.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for misata-0.3.1b0-py3-none-any.whl
Algorithm Hash digest
SHA256 09bb9389d2adc0269cf0ab4a783abcf68a57660fcad6ee4b26aa61c0de476034
MD5 bcb6f8c5a6715ef695e4ed25053bb82d
BLAKE2b-256 2dc0f076623a07e493257934ed320c2bd129b862520c6455205ec040ef736ea8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page