Skip to main content

AI-Powered Synthetic Data Engine - Generate realistic multi-table datasets from natural language

Project description

🧠 Misata

Generate realistic multi-table datasets from natural language.

No schema writing. No training data. Just describe what you need.

Version License Python

✨ What Makes Misata Different

Feature Faker SDV Misata
Natural language input
Auto schema generation
Relational integrity
Business constraints
No training data needed
Streaming (10M+ rows)

🚀 Quick Start

pip install misata

With Groq (Free, Fast)

export GROQ_API_KEY=your_key  # Get free: https://console.groq.com
misata generate --story "A SaaS with 50K users, subscriptions, and payments" --use-llm

With OpenAI

export OPENAI_API_KEY=your_key
misata generate --story "E-commerce with products and orders" --use-llm --provider openai

With Ollama (Local, Free, Private)

ollama run llama3  # Start Ollama first
misata generate --story "Fitness app with workouts" --use-llm --provider ollama

📊 Example Output

$ misata generate --story "A fitness app with 50K users" --use-llm

🧠 Using Groq (llama-3.3-70b-versatile) for intelligent parsing...
✅ LLM schema generated successfully!

📋 Schema: FitnessApp
   Tables: 5
   Relationships: 4

🔧 Generating 5 table(s)...

   ✓ exercises     (10 rows)
   ✓ plans         (5 rows)
   ✓ users         (50,000 rows)
   ✓ subscriptions (45,000 rows)
   ✓ workouts      (500,000 rows)

⏱️  Generation time: 2.34 seconds
🚀 Performance: 213,675 rows/second
💾 Data saved to: ./generated_data

💻 Python API

from misata import DataSimulator, SchemaConfig
from misata.llm_parser import LLMSchemaGenerator

# Generate schema from story
llm = LLMSchemaGenerator(provider="groq")  # or "openai", "ollama"
config = llm.generate_from_story(
    "A mobile fitness app with 50K users, workout tracking, "
    "premium subscriptions, and January signup spikes"
)

# Generate data
for table_name, batch in DataSimulator(config).generate_all():
    print(f"Generated {len(batch)} rows for {table_name}")

🔧 CLI Reference

# Basic generation (rule-based, no API key needed)
misata generate --story "SaaS company with users and subscriptions"

# LLM-powered generation
misata generate --story "..." --use-llm

# Specify provider and model
misata generate --story "..." --use-llm --provider ollama --model llama3

# Custom output directory
misata generate --story "..." --use-llm --output-dir ./my_data

# Set row count
misata generate --story "..." --use-llm --rows 100000

# Reproducible with seed
misata generate --story "..." --use-llm --seed 42

🎯 Business Rule Constraints

Define rules like "employees can't log >8 hours/day":

from misata import Constraint, Table

timesheets = Table(
    name="timesheets",
    row_count=10000,
    constraints=[
        Constraint(
            name="max_daily_hours",
            type="sum_limit",
            group_by=["employee_id", "date"],
            column="hours",
            value=8.0,
            action="redistribute"
        )
    ]
)

🔑 LLM Providers

Provider Env Variable Free Tier Notes
Groq GROQ_API_KEY ✅ 30 req/min Fastest, recommended
OpenAI OPENAI_API_KEY Best quality
Ollama None ✅ Local Private, no internet

📈 Extending Data Pools

from misata import TextGenerator

# Add custom names
TextGenerator.extend_pool("first_names", ["Arjun", "Priya", "Rahul"])

# Load from file
TextGenerator.load_pools_from_file("custom_pools.json")

# Save for reuse
TextGenerator.save_pools_to_file("expanded_pools.json")

🤖 ML Training Data

Make your synthetic data indistinguishable from real-world data with noise injection:

from misata import add_noise, NoiseInjector

# Quick noise injection
noisy_df = add_noise(df,
    null_rate=0.05,      # 5% missing values
    outlier_rate=0.02,   # 2% statistical outliers
    typo_rate=0.01,      # 1% typos in text
    duplicate_rate=0.03, # 3% duplicate rows
    seed=42
)

# Advanced: Temporal distribution drift
injector = NoiseInjector(seed=42)
df = injector.apply_temporal_drift(df, 
    date_column="created_at",
    value_column="revenue", 
    drift_rate=0.15,      # 15% increase over time
    drift_direction="up"
)

Attribute Customization

from misata import Customizer, ColumnOverride
import numpy as np

customizer = Customizer(seed=42)

# Custom age distribution (realistic, not uniform)
customizer.add_override("users", ColumnOverride(
    name="age",
    generator=lambda n: np.random.normal(35, 12, n).clip(18, 80).astype(int)
))

# Conditional values based on other columns
customizer.add_conditional("orders", "shipping_cost", {
    "country": {"US": 5.99, "UK": 9.99, "DE": 7.99}
})

# Apply to generated data
df = customizer.apply(df, "users")

⚡ Performance

Rows Time Speed
10K 0.03s 333K rows/sec
100K 0.26s 385K rows/sec
1M 2.6s 390K rows/sec
10M 26s 390K rows/sec (streaming)

� Try It Now

Open In Colab

Try Misata in your browser without installing anything!

💼 Enterprise & Consulting

Need help with complex scenarios?

  • 🏢 Custom enterprise data schemas (10M+ rows)
  • 🔧 Integration with your existing pipelines
  • 📊 Industry-specific realistic data generation
  • 🎓 Training and onboarding for your team

📧 Contact: rasinbinabdulla@gmail.com

�📄 License

MIT License

👤 Author

Built by Muhammed Rasin


Misata - From story to synthetic database in one command.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

misata-0.1.0b0.tar.gz (82.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

misata-0.1.0b0-py3-none-any.whl (73.4 kB view details)

Uploaded Python 3

File details

Details for the file misata-0.1.0b0.tar.gz.

File metadata

  • Download URL: misata-0.1.0b0.tar.gz
  • Upload date:
  • Size: 82.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for misata-0.1.0b0.tar.gz
Algorithm Hash digest
SHA256 23633bbf28e9920a7ea211c310280b3f77aba8fbad8f770fc2800f16b5aa422d
MD5 d05d3e502a3128e97b5775b434d44155
BLAKE2b-256 39f3480e501b76f9b548722ed7332e248e9f8ae48c4a654f2a6fe30e4ecbda11

See more details on using hashes here.

File details

Details for the file misata-0.1.0b0-py3-none-any.whl.

File metadata

  • Download URL: misata-0.1.0b0-py3-none-any.whl
  • Upload date:
  • Size: 73.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for misata-0.1.0b0-py3-none-any.whl
Algorithm Hash digest
SHA256 63b70410ed7ecbc4fc4cee73d37511fb107b4a9f2ae1ef62e5af8268ae3bd8ee
MD5 75606cc769a2f92e038594a6b2a030f7
BLAKE2b-256 cde6fa2054f73f14475c4fb64e9c96c75be38a3abc79d38233d8ed23cb181e50

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page