Generate realistic mock data from YAML schema definitions

These details have not been verified by PyPI

Project links

Project description

MockMySchema

🎯 Generate realistic mock data from YAML schema definitions at scale

MockMySchema is a powerful Python CLI tool that transforms simple YAML schema definitions into realistic CSV datasets with proper foreign key relationships, unique constraints, and statistical distributions. Built for developers who need realistic test data that scales to millions of rows.

✨ One-Liner Demo

# Generate 1M customers + 5M orders with realistic relationships
mockmyschema generate ecommerce.yaml -o ./data --seed 42

🚀 Quick Start

Installation

pip install mockmyschema

Create Your First Schema

# Generate a template
mockmyschema create-template simple -o my_schema.yaml

# Edit the schema (or use as-is)
# Generate data
mockmyschema generate my_schema.yaml -o ./output

Example Schema

version: "1.0"
locale: en_US

tables:
  customers:
    rows: 100_000
    columns:
      customer_id:
        type: uuid
        primary_key: true
      name:
        type: name
      email:
        type: email
        unique: true
      tier:
        type: enum
        values: [bronze, silver, gold, platinum]
        weights: [50, 30, 15, 5]
      signup_date:
        type: datetime
        start: "2023-01-01"
        end: "2024-12-31"

  orders:
    rows: 500_000
    columns:
      order_id:
        type: sequence
        primary_key: true
      customer_id:
        type: ref
        table: customers
        column: customer_id
        distribution: zipf
      order_date:
        type: datetime
        after: customers.signup_date
        end: "2024-12-31"
      total:
        type: decimal
        min_value: 10.00
        max_value: 5000.00
        precision: 10
        scale: 2
        distribution: lognormal

Generated Output

# CSV (default)
mockmyschema generate schema.yaml -o ./data

# SQL INSERT statements
mockmyschema generate schema.yaml -o ./data --format sql

# Both CSV + SQL
mockmyschema generate schema.yaml -o ./data --format both

output/
├── customers.csv    # 100K realistic customers with emails, names, tiers
├── customers.sql    # SQL INSERT statements (batched, 1000 rows per INSERT)
├── orders.csv       # 500K orders with valid foreign keys and temporal ordering
└── orders.sql       # Ready to run in any SQL database

SQL output example:

-- Generated by MockMySchema
-- Table: customers (5 rows)

INSERT INTO customers (id, name, email, age) VALUES
(1, 'Allison Hill', 'allison@example.com', 22),
(2, 'Noah Rhodes', 'noah@example.com', 55),
(3, 'Angie Henderson', 'angie@example.com', 49),
(4, 'Daniel Wagner', 'daniel@example.com', 39),
(5, 'Cristian Santos', 'cristian@example.com', 38);

🎪 Key Features

🔗 Smart Relationships

Foreign Keys: Automatic reference pools with distribution control (uniform, zipf, normal)
Temporal Ordering: after constraints ensure logical time sequences
Referential Integrity: All foreign keys point to valid primary keys

📊 Statistical Distributions

Uniform: Equal probability for all values
Normal: Bell curve distribution with mean/std
Log-normal: For realistic price, income, size data
Zipf: Power law for popularity, frequency data
Exponential: For time intervals, queue lengths

🌍 Realistic Data Types

Primitive: sequence, uuid, int, float, decimal, string, bool, enum
Semantic: name, email, phone, address, city, company (Faker-powered)
Temporal: datetime, date with range and ordering constraints
Reference: ref for foreign key relationships

⚡ Production Ready

Memory Efficient: Chunked generation for millions of rows
Deterministic: Seed support for reproducible datasets
Fast: Numpy-vectorized generation
Scalable: Handles complex schemas with deep dependencies

🎨 Developer Experience

YAML First: Clean, readable schema definitions
CLI Focused: Simple commands, rich output
Template Library: Pre-built schemas for common domains
Validation: Comprehensive schema validation with helpful errors

📋 Column Types Reference

Primitive Types

Type	Description	Parameters
`sequence`	Auto-increment integers	`start`, `step`
`uuid`	UUID v4 strings	None
`int`	Random integers	`min_value`, `max_value`, `distribution`
`float`	Random floats	`min_value`, `max_value`, `precision`, `distribution`
`decimal`	High-precision decimals	`min_value`, `max_value`, `precision`, `scale`, `distribution`
`string`	Random strings	`min_length`, `max_length`, `prefix`, `suffix`
`bool`	Boolean values	`true_pct`
`enum`	Enumerated values	`values`, `weights`

Semantic Types (Faker-powered)

Type	Description	Locale Support
`name`	Full names	✅
`first_name`	First names	✅
`last_name`	Last names	✅
`email`	Email addresses	✅
`phone`	Phone numbers	✅
`address`	Street addresses	✅
`city`	City names	✅
`country`	Country names	✅
`company`	Company names	✅
`text`	Lorem ipsum text	✅

Temporal Types

Type	Description	Parameters
`datetime`	Date and time	`start`, `end`, `after`, `distribution`
`date`	Date only	`start`, `end`, `after`, `distribution`

Reference Types

Type	Description	Parameters
`ref`	Foreign key reference	`table`, `column`, `distribution`

🎯 Distribution Types

uniform: Equal probability (default)
normal: Bell curve (specify mean, std)
lognormal: Right-skewed for prices, sizes
zipf: Power law for popularity rankings
exponential: For time intervals, queue lengths

📚 Examples & Templates

Built-in Templates

# E-commerce with customers, products, orders
mockmyschema create-template ecommerce -o ecommerce.yaml

# Banking with accounts, transactions
mockmyschema create-template banking -o banking.yaml

# SaaS with organizations, users, projects
mockmyschema create-template saas -o saas.yaml

# Simple blog with users, posts
mockmyschema create-template simple -o blog.yaml

Real-World Examples

# Generate 10M row e-commerce dataset
mockmyschema generate ecommerce.yaml -o ./big_data --chunk-size 50000

# Compressed output for storage efficiency
mockmyschema generate banking.yaml -o ./bank_data --compress

# Reproducible datasets with seeds
mockmyschema generate saas.yaml -o ./test_data --seed 12345

# Validation without generation
mockmyschema validate my_schema.yaml

🏁 CLI Commands

Generate Data

mockmyschema generate schema.yaml [OPTIONS]

Options:
  -o, --output DIR          Output directory (default: ./output)
  --format [csv|sql|both]  Output format (default: csv)
  --seed INT               Random seed for reproducible generation
  --chunk-size INT         Chunk size for memory-efficient generation
  --compress              Compress output files with gzip
  --quiet                 Suppress progress output
  --validate-only         Only validate schema without generating data
  --stats                 Show generation statistics

Validate Schema

mockmyschema validate schema.yaml

Create Templates

mockmyschema create-template {simple,ecommerce,banking,saas} -o output.yaml

System Info

mockmyschema info

🆚 MockMySchema vs Alternatives

Feature	MockMySchema	Faker	Mockaroo
Schema-driven	✅ YAML	❌ Code only	✅ Web UI
Foreign Keys	✅ Smart pools	❌ Manual	✅ Limited
Distributions	✅ 7 types	❌ Limited	✅ Some
Scale	✅ Millions	❌ Memory bound	✅ Paid tiers
Temporal Logic	✅ `after` constraints	❌ None	❌ None
Reproducible	✅ Seed support	✅ Basic	❌ No
CLI First	✅ Rich CLI	❌ Library only	❌ Web only
Open Source	✅ MIT	✅ MIT	❌ Freemium

🛠 Advanced Usage

Complex Relationships

# Multi-level dependencies with temporal ordering
users → accounts → transactions
  ↓       ↓           ↓
signup  opened     after_opened

Distribution Examples

# Realistic price distribution (most items cheap, few expensive)
price:
  type: decimal
  min_value: 5.99
  max_value: 999.99
  distribution: lognormal
  
# Popular items get more orders (80/20 rule)
product_id:
  type: ref
  table: products
  column: product_id
  distribution: zipf
  
# Normal age distribution
age:
  type: int
  min_value: 18
  max_value: 80
  distribution: normal

Memory Optimization

# For large tables, use smaller chunks
large_table:
  rows: 10_000_000
  chunk_size: 100_000  # Process in 100K chunks

🗺 Roadmap

🔄 Multiple Output Formats: Parquet, JSON, Delta Lake
🚀 Parallel Generation: Multi-core processing for massive datasets
🌊 Streaming Output: Kafka, database connectors
🌐 Web UI: Visual schema builder and preview
📊 Data Profiling: Statistics and quality metrics
🔌 Plugin System: Custom generators and formats
☁️ Cloud Integration: S3, BigQuery, Snowflake outputs

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Development Setup

git clone https://github.com/radha9887/mockmyschema.git
cd mockmyschema
pip install -e ".[dev]"
pytest

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Faker: For the excellent semantic data generation library
Click: For the beautiful CLI framework
NumPy: For fast numerical computing
PyYAML: For clean configuration parsing

Made with ❤️ for developers who need realistic test data

⭐ Star us on GitHub if MockMySchema helps your project!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Apr 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mockmyschema-1.0.0.tar.gz (42.2 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mockmyschema-1.0.0-py3-none-any.whl (39.4 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file mockmyschema-1.0.0.tar.gz.

File metadata

Download URL: mockmyschema-1.0.0.tar.gz
Upload date: Apr 15, 2026
Size: 42.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for mockmyschema-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`f4036875932b91021c33e3643c350e41261bb755bb0c451bc8dfdd7d773b943c`
MD5	`5551fee5efbdd6e923f65a3697ae7958`
BLAKE2b-256	`e159255c99eae0ff540bae9667d8f4212441980eaa12d64ddd758b712badf4cc`

See more details on using hashes here.

File details

Details for the file mockmyschema-1.0.0-py3-none-any.whl.

File metadata

Download URL: mockmyschema-1.0.0-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 39.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for mockmyschema-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`764dc1ba7ae9e402b070928b44885ed75442c9604daa415b8f91c0fc5aa58543`
MD5	`a5aa719ea16a6e780db9a80bce55761e`
BLAKE2b-256	`f446be1ca464286047b48ac9ea033677ff6628e51683ea24c5f6ad0439737417`

See more details on using hashes here.

mockmyschema 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MockMySchema

✨ One-Liner Demo

🚀 Quick Start

Installation

Create Your First Schema

Example Schema

Generated Output

🎪 Key Features

🔗 Smart Relationships

📊 Statistical Distributions

🌍 Realistic Data Types

⚡ Production Ready

🎨 Developer Experience

📋 Column Types Reference

Primitive Types

Semantic Types (Faker-powered)

Temporal Types

Reference Types

🎯 Distribution Types

📚 Examples & Templates

Built-in Templates

Real-World Examples

🏁 CLI Commands

Generate Data

Validate Schema

Create Templates

System Info

🆚 MockMySchema vs Alternatives

🛠 Advanced Usage

Complex Relationships

Distribution Examples

Memory Optimization

🗺 Roadmap

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes