Skip to main content

Generate realistic mock data from YAML schema definitions

Project description

MockMySchema

License: MIT Python 3.8+ PyPI version

๐ŸŽฏ Generate realistic mock data from YAML schema definitions at scale

MockMySchema is a powerful Python CLI tool that transforms simple YAML schema definitions into realistic CSV datasets with proper foreign key relationships, unique constraints, and statistical distributions. Built for developers who need realistic test data that scales to millions of rows.

โœจ One-Liner Demo

# Generate 1M customers + 5M orders with realistic relationships
mockmyschema generate ecommerce.yaml -o ./data --seed 42

๐Ÿš€ Quick Start

Installation

pip install mockmyschema

Create Your First Schema

# Generate a template
mockmyschema create-template simple -o my_schema.yaml

# Edit the schema (or use as-is)
# Generate data
mockmyschema generate my_schema.yaml -o ./output

Example Schema

version: "1.0"
locale: en_US

tables:
  customers:
    rows: 100_000
    columns:
      customer_id:
        type: uuid
        primary_key: true
      name:
        type: name
      email:
        type: email
        unique: true
      tier:
        type: enum
        values: [bronze, silver, gold, platinum]
        weights: [50, 30, 15, 5]
      signup_date:
        type: datetime
        start: "2023-01-01"
        end: "2024-12-31"

  orders:
    rows: 500_000
    columns:
      order_id:
        type: sequence
        primary_key: true
      customer_id:
        type: ref
        table: customers
        column: customer_id
        distribution: zipf
      order_date:
        type: datetime
        after: customers.signup_date
        end: "2024-12-31"
      total:
        type: decimal
        min_value: 10.00
        max_value: 5000.00
        precision: 10
        scale: 2
        distribution: lognormal

Generated Output

# CSV (default)
mockmyschema generate schema.yaml -o ./data

# SQL INSERT statements
mockmyschema generate schema.yaml -o ./data --format sql

# Both CSV + SQL
mockmyschema generate schema.yaml -o ./data --format both
output/
โ”œโ”€โ”€ customers.csv    # 100K realistic customers with emails, names, tiers
โ”œโ”€โ”€ customers.sql    # SQL INSERT statements (batched, 1000 rows per INSERT)
โ”œโ”€โ”€ orders.csv       # 500K orders with valid foreign keys and temporal ordering
โ””โ”€โ”€ orders.sql       # Ready to run in any SQL database

SQL output example:

-- Generated by MockMySchema
-- Table: customers (5 rows)

INSERT INTO customers (id, name, email, age) VALUES
(1, 'Allison Hill', 'allison@example.com', 22),
(2, 'Noah Rhodes', 'noah@example.com', 55),
(3, 'Angie Henderson', 'angie@example.com', 49),
(4, 'Daniel Wagner', 'daniel@example.com', 39),
(5, 'Cristian Santos', 'cristian@example.com', 38);

๐ŸŽช Key Features

๐Ÿ”— Smart Relationships

  • Foreign Keys: Automatic reference pools with distribution control (uniform, zipf, normal)
  • Temporal Ordering: after constraints ensure logical time sequences
  • Referential Integrity: All foreign keys point to valid primary keys

๐Ÿ“Š Statistical Distributions

  • Uniform: Equal probability for all values
  • Normal: Bell curve distribution with mean/std
  • Log-normal: For realistic price, income, size data
  • Zipf: Power law for popularity, frequency data
  • Exponential: For time intervals, queue lengths

๐ŸŒ Realistic Data Types

  • Primitive: sequence, uuid, int, float, decimal, string, bool, enum
  • Semantic: name, email, phone, address, city, company (Faker-powered)
  • Temporal: datetime, date with range and ordering constraints
  • Reference: ref for foreign key relationships

โšก Production Ready

  • Memory Efficient: Chunked generation for millions of rows
  • Deterministic: Seed support for reproducible datasets
  • Fast: Numpy-vectorized generation
  • Scalable: Handles complex schemas with deep dependencies

๐ŸŽจ Developer Experience

  • YAML First: Clean, readable schema definitions
  • CLI Focused: Simple commands, rich output
  • Template Library: Pre-built schemas for common domains
  • Validation: Comprehensive schema validation with helpful errors

๐Ÿ“‹ Column Types Reference

Primitive Types

Type Description Parameters
sequence Auto-increment integers start, step
uuid UUID v4 strings None
int Random integers min_value, max_value, distribution
float Random floats min_value, max_value, precision, distribution
decimal High-precision decimals min_value, max_value, precision, scale, distribution
string Random strings min_length, max_length, prefix, suffix
bool Boolean values true_pct
enum Enumerated values values, weights

Semantic Types (Faker-powered)

Type Description Locale Support
name Full names โœ…
first_name First names โœ…
last_name Last names โœ…
email Email addresses โœ…
phone Phone numbers โœ…
address Street addresses โœ…
city City names โœ…
country Country names โœ…
company Company names โœ…
text Lorem ipsum text โœ…

Temporal Types

Type Description Parameters
datetime Date and time start, end, after, distribution
date Date only start, end, after, distribution

Reference Types

Type Description Parameters
ref Foreign key reference table, column, distribution

๐ŸŽฏ Distribution Types

  • uniform: Equal probability (default)
  • normal: Bell curve (specify mean, std)
  • lognormal: Right-skewed for prices, sizes
  • zipf: Power law for popularity rankings
  • exponential: For time intervals, queue lengths

๐Ÿ“š Examples & Templates

Built-in Templates

# E-commerce with customers, products, orders
mockmyschema create-template ecommerce -o ecommerce.yaml

# Banking with accounts, transactions
mockmyschema create-template banking -o banking.yaml

# SaaS with organizations, users, projects
mockmyschema create-template saas -o saas.yaml

# Simple blog with users, posts
mockmyschema create-template simple -o blog.yaml

Real-World Examples

# Generate 10M row e-commerce dataset
mockmyschema generate ecommerce.yaml -o ./big_data --chunk-size 50000

# Compressed output for storage efficiency
mockmyschema generate banking.yaml -o ./bank_data --compress

# Reproducible datasets with seeds
mockmyschema generate saas.yaml -o ./test_data --seed 12345

# Validation without generation
mockmyschema validate my_schema.yaml

๐Ÿ CLI Commands

Generate Data

mockmyschema generate schema.yaml [OPTIONS]

Options:
  -o, --output DIR          Output directory (default: ./output)
  --format [csv|sql|both]  Output format (default: csv)
  --seed INT               Random seed for reproducible generation
  --chunk-size INT         Chunk size for memory-efficient generation
  --compress              Compress output files with gzip
  --quiet                 Suppress progress output
  --validate-only         Only validate schema without generating data
  --stats                 Show generation statistics

Validate Schema

mockmyschema validate schema.yaml

Create Templates

mockmyschema create-template {simple,ecommerce,banking,saas} -o output.yaml

System Info

mockmyschema info

๐Ÿ†š MockMySchema vs Alternatives

Feature MockMySchema Faker Mockaroo
Schema-driven โœ… YAML โŒ Code only โœ… Web UI
Foreign Keys โœ… Smart pools โŒ Manual โœ… Limited
Distributions โœ… 7 types โŒ Limited โœ… Some
Scale โœ… Millions โŒ Memory bound โœ… Paid tiers
Temporal Logic โœ… after constraints โŒ None โŒ None
Reproducible โœ… Seed support โœ… Basic โŒ No
CLI First โœ… Rich CLI โŒ Library only โŒ Web only
Open Source โœ… MIT โœ… MIT โŒ Freemium

๐Ÿ›  Advanced Usage

Complex Relationships

# Multi-level dependencies with temporal ordering
users โ†’ accounts โ†’ transactions
  โ†“       โ†“           โ†“
signup  opened     after_opened

Distribution Examples

# Realistic price distribution (most items cheap, few expensive)
price:
  type: decimal
  min_value: 5.99
  max_value: 999.99
  distribution: lognormal
  
# Popular items get more orders (80/20 rule)
product_id:
  type: ref
  table: products
  column: product_id
  distribution: zipf
  
# Normal age distribution
age:
  type: int
  min_value: 18
  max_value: 80
  distribution: normal

Memory Optimization

# For large tables, use smaller chunks
large_table:
  rows: 10_000_000
  chunk_size: 100_000  # Process in 100K chunks

๐Ÿ—บ Roadmap

  • ๐Ÿ”„ Multiple Output Formats: Parquet, JSON, Delta Lake
  • ๐Ÿš€ Parallel Generation: Multi-core processing for massive datasets
  • ๐ŸŒŠ Streaming Output: Kafka, database connectors
  • ๐ŸŒ Web UI: Visual schema builder and preview
  • ๐Ÿ“Š Data Profiling: Statistics and quality metrics
  • ๐Ÿ”Œ Plugin System: Custom generators and formats
  • โ˜๏ธ Cloud Integration: S3, BigQuery, Snowflake outputs

๐Ÿค Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Development Setup

git clone https://github.com/radha9887/mockmyschema.git
cd mockmyschema
pip install -e ".[dev]"
pytest

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Faker: For the excellent semantic data generation library
  • Click: For the beautiful CLI framework
  • NumPy: For fast numerical computing
  • PyYAML: For clean configuration parsing

Made with โค๏ธ for developers who need realistic test data

โญ Star us on GitHub if MockMySchema helps your project!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mockmyschema-1.0.0.tar.gz (42.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mockmyschema-1.0.0-py3-none-any.whl (39.4 kB view details)

Uploaded Python 3

File details

Details for the file mockmyschema-1.0.0.tar.gz.

File metadata

  • Download URL: mockmyschema-1.0.0.tar.gz
  • Upload date:
  • Size: 42.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for mockmyschema-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f4036875932b91021c33e3643c350e41261bb755bb0c451bc8dfdd7d773b943c
MD5 5551fee5efbdd6e923f65a3697ae7958
BLAKE2b-256 e159255c99eae0ff540bae9667d8f4212441980eaa12d64ddd758b712badf4cc

See more details on using hashes here.

File details

Details for the file mockmyschema-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: mockmyschema-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 39.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for mockmyschema-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 764dc1ba7ae9e402b070928b44885ed75442c9604daa415b8f91c0fc5aa58543
MD5 a5aa719ea16a6e780db9a80bce55761e
BLAKE2b-256 f446be1ca464286047b48ac9ea033677ff6628e51683ea24c5f6ad0439737417

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page