Generate realistic mock data from YAML schema definitions
Project description
MockMySchema
๐ฏ Generate realistic mock data from YAML schema definitions at scale
MockMySchema is a powerful Python CLI tool that transforms simple YAML schema definitions into realistic CSV datasets with proper foreign key relationships, unique constraints, and statistical distributions. Built for developers who need realistic test data that scales to millions of rows.
โจ One-Liner Demo
# Generate 1M customers + 5M orders with realistic relationships
mockmyschema generate ecommerce.yaml -o ./data --seed 42
๐ Quick Start
Installation
pip install mockmyschema
Create Your First Schema
# Generate a template
mockmyschema create-template simple -o my_schema.yaml
# Edit the schema (or use as-is)
# Generate data
mockmyschema generate my_schema.yaml -o ./output
Example Schema
version: "1.0"
locale: en_US
tables:
customers:
rows: 100_000
columns:
customer_id:
type: uuid
primary_key: true
name:
type: name
email:
type: email
unique: true
tier:
type: enum
values: [bronze, silver, gold, platinum]
weights: [50, 30, 15, 5]
signup_date:
type: datetime
start: "2023-01-01"
end: "2024-12-31"
orders:
rows: 500_000
columns:
order_id:
type: sequence
primary_key: true
customer_id:
type: ref
table: customers
column: customer_id
distribution: zipf
order_date:
type: datetime
after: customers.signup_date
end: "2024-12-31"
total:
type: decimal
min_value: 10.00
max_value: 5000.00
precision: 10
scale: 2
distribution: lognormal
Generated Output
# CSV (default)
mockmyschema generate schema.yaml -o ./data
# SQL INSERT statements
mockmyschema generate schema.yaml -o ./data --format sql
# Both CSV + SQL
mockmyschema generate schema.yaml -o ./data --format both
output/
โโโ customers.csv # 100K realistic customers with emails, names, tiers
โโโ customers.sql # SQL INSERT statements (batched, 1000 rows per INSERT)
โโโ orders.csv # 500K orders with valid foreign keys and temporal ordering
โโโ orders.sql # Ready to run in any SQL database
SQL output example:
-- Generated by MockMySchema
-- Table: customers (5 rows)
INSERT INTO customers (id, name, email, age) VALUES
(1, 'Allison Hill', 'allison@example.com', 22),
(2, 'Noah Rhodes', 'noah@example.com', 55),
(3, 'Angie Henderson', 'angie@example.com', 49),
(4, 'Daniel Wagner', 'daniel@example.com', 39),
(5, 'Cristian Santos', 'cristian@example.com', 38);
๐ช Key Features
๐ Smart Relationships
- Foreign Keys: Automatic reference pools with distribution control (uniform, zipf, normal)
- Temporal Ordering:
afterconstraints ensure logical time sequences - Referential Integrity: All foreign keys point to valid primary keys
๐ Statistical Distributions
- Uniform: Equal probability for all values
- Normal: Bell curve distribution with mean/std
- Log-normal: For realistic price, income, size data
- Zipf: Power law for popularity, frequency data
- Exponential: For time intervals, queue lengths
๐ Realistic Data Types
- Primitive:
sequence,uuid,int,float,decimal,string,bool,enum - Semantic:
name,email,phone,address,city,company(Faker-powered) - Temporal:
datetime,datewith range and ordering constraints - Reference:
reffor foreign key relationships
โก Production Ready
- Memory Efficient: Chunked generation for millions of rows
- Deterministic: Seed support for reproducible datasets
- Fast: Numpy-vectorized generation
- Scalable: Handles complex schemas with deep dependencies
๐จ Developer Experience
- YAML First: Clean, readable schema definitions
- CLI Focused: Simple commands, rich output
- Template Library: Pre-built schemas for common domains
- Validation: Comprehensive schema validation with helpful errors
๐ Column Types Reference
Primitive Types
| Type | Description | Parameters |
|---|---|---|
sequence |
Auto-increment integers | start, step |
uuid |
UUID v4 strings | None |
int |
Random integers | min_value, max_value, distribution |
float |
Random floats | min_value, max_value, precision, distribution |
decimal |
High-precision decimals | min_value, max_value, precision, scale, distribution |
string |
Random strings | min_length, max_length, prefix, suffix |
bool |
Boolean values | true_pct |
enum |
Enumerated values | values, weights |
Semantic Types (Faker-powered)
| Type | Description | Locale Support |
|---|---|---|
name |
Full names | โ |
first_name |
First names | โ |
last_name |
Last names | โ |
email |
Email addresses | โ |
phone |
Phone numbers | โ |
address |
Street addresses | โ |
city |
City names | โ |
country |
Country names | โ |
company |
Company names | โ |
text |
Lorem ipsum text | โ |
Temporal Types
| Type | Description | Parameters |
|---|---|---|
datetime |
Date and time | start, end, after, distribution |
date |
Date only | start, end, after, distribution |
Reference Types
| Type | Description | Parameters |
|---|---|---|
ref |
Foreign key reference | table, column, distribution |
๐ฏ Distribution Types
uniform: Equal probability (default)normal: Bell curve (specifymean,std)lognormal: Right-skewed for prices, sizeszipf: Power law for popularity rankingsexponential: For time intervals, queue lengths
๐ Examples & Templates
Built-in Templates
# E-commerce with customers, products, orders
mockmyschema create-template ecommerce -o ecommerce.yaml
# Banking with accounts, transactions
mockmyschema create-template banking -o banking.yaml
# SaaS with organizations, users, projects
mockmyschema create-template saas -o saas.yaml
# Simple blog with users, posts
mockmyschema create-template simple -o blog.yaml
Real-World Examples
# Generate 10M row e-commerce dataset
mockmyschema generate ecommerce.yaml -o ./big_data --chunk-size 50000
# Compressed output for storage efficiency
mockmyschema generate banking.yaml -o ./bank_data --compress
# Reproducible datasets with seeds
mockmyschema generate saas.yaml -o ./test_data --seed 12345
# Validation without generation
mockmyschema validate my_schema.yaml
๐ CLI Commands
Generate Data
mockmyschema generate schema.yaml [OPTIONS]
Options:
-o, --output DIR Output directory (default: ./output)
--format [csv|sql|both] Output format (default: csv)
--seed INT Random seed for reproducible generation
--chunk-size INT Chunk size for memory-efficient generation
--compress Compress output files with gzip
--quiet Suppress progress output
--validate-only Only validate schema without generating data
--stats Show generation statistics
Validate Schema
mockmyschema validate schema.yaml
Create Templates
mockmyschema create-template {simple,ecommerce,banking,saas} -o output.yaml
System Info
mockmyschema info
๐ MockMySchema vs Alternatives
| Feature | MockMySchema | Faker | Mockaroo |
|---|---|---|---|
| Schema-driven | โ YAML | โ Code only | โ Web UI |
| Foreign Keys | โ Smart pools | โ Manual | โ Limited |
| Distributions | โ 7 types | โ Limited | โ Some |
| Scale | โ Millions | โ Memory bound | โ Paid tiers |
| Temporal Logic | โ
after constraints |
โ None | โ None |
| Reproducible | โ Seed support | โ Basic | โ No |
| CLI First | โ Rich CLI | โ Library only | โ Web only |
| Open Source | โ MIT | โ MIT | โ Freemium |
๐ Advanced Usage
Complex Relationships
# Multi-level dependencies with temporal ordering
users โ accounts โ transactions
โ โ โ
signup opened after_opened
Distribution Examples
# Realistic price distribution (most items cheap, few expensive)
price:
type: decimal
min_value: 5.99
max_value: 999.99
distribution: lognormal
# Popular items get more orders (80/20 rule)
product_id:
type: ref
table: products
column: product_id
distribution: zipf
# Normal age distribution
age:
type: int
min_value: 18
max_value: 80
distribution: normal
Memory Optimization
# For large tables, use smaller chunks
large_table:
rows: 10_000_000
chunk_size: 100_000 # Process in 100K chunks
๐บ Roadmap
- ๐ Multiple Output Formats: Parquet, JSON, Delta Lake
- ๐ Parallel Generation: Multi-core processing for massive datasets
- ๐ Streaming Output: Kafka, database connectors
- ๐ Web UI: Visual schema builder and preview
- ๐ Data Profiling: Statistics and quality metrics
- ๐ Plugin System: Custom generators and formats
- โ๏ธ Cloud Integration: S3, BigQuery, Snowflake outputs
๐ค Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Development Setup
git clone https://github.com/radha9887/mockmyschema.git
cd mockmyschema
pip install -e ".[dev]"
pytest
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
- Faker: For the excellent semantic data generation library
- Click: For the beautiful CLI framework
- NumPy: For fast numerical computing
- PyYAML: For clean configuration parsing
Made with โค๏ธ for developers who need realistic test data
โญ Star us on GitHub if MockMySchema helps your project!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mockmyschema-1.0.0.tar.gz.
File metadata
- Download URL: mockmyschema-1.0.0.tar.gz
- Upload date:
- Size: 42.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4036875932b91021c33e3643c350e41261bb755bb0c451bc8dfdd7d773b943c
|
|
| MD5 |
5551fee5efbdd6e923f65a3697ae7958
|
|
| BLAKE2b-256 |
e159255c99eae0ff540bae9667d8f4212441980eaa12d64ddd758b712badf4cc
|
File details
Details for the file mockmyschema-1.0.0-py3-none-any.whl.
File metadata
- Download URL: mockmyschema-1.0.0-py3-none-any.whl
- Upload date:
- Size: 39.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
764dc1ba7ae9e402b070928b44885ed75442c9604daa415b8f91c0fc5aa58543
|
|
| MD5 |
a5aa719ea16a6e780db9a80bce55761e
|
|
| BLAKE2b-256 |
f446be1ca464286047b48ac9ea033677ff6628e51683ea24c5f6ad0439737417
|