Rand Engine v2. Package with some methods to generate random data in different formats. Great to mock data while testing or developing.
Project description
Rand Engine
High-performance synthetic data generation for testing, development, and prototyping.
A Python library for generating millions of rows of realistic synthetic data through declarative specifications. Built on NumPy and Pandas for maximum performance.
๐ฅ What's New in v0.6.1
- โ Constraints System: Primary Keys (PK) and Foreign Keys (FK) for referential integrity between specs
- โ Composite Keys: Support for multi-column primary and foreign keys
- โ Watermarks: Temporal windows for realistic time-based relationships
- โ Enhanced Validation: Educational error messages with examples
- โ Logging System: Transparent logging with Python's built-in logger
- โ Windows Support: Full cross-platform compatibility (Linux, macOS, Windows)
๐ Complete documentation: CONSTRAINTS.md | EXAMPLES.md
๐ฆ Installation
pip install rand-engine
๐ฏ Who Is This For?
- Data Engineers: Test ETL/ELT pipelines without production data dependencies
- QA Engineers: Generate realistic datasets for load and integration testing
- Data Scientists: Mock data during model development and validation
- Backend Developers: Populate development and staging environments
- BI Professionals: Create demos and POCs without exposing sensitive data
๐ Quick Start
1. Pre-Built Examples (Fastest Way to Start)
DataGenerator produces synthetic datasets in seconds. It leverages NumPy and Pandas for blazing-fast random data generation.
Creating 1 million rows is as simple as:
- Choose a built-in RandSpec (e.g.,
customers,orders,transactions) - Set the size (number of rows)
- Optionally set a seed for reproducibility
- Call
.get_df()to obtain a pandas DataFrame
What's a RandSpec? A declarative specification dictionary that defines your dataset's structure and generation rules.
Rand Engine includes 10+ ready-to-use RandSpecs covering common business domainsโno configuration needed.
from rand_engine import DataGenerator, RandSpecs
# Generate 1 million customer records in seconds
customers_spec = RandSpecs.customers() # Pre-built specification
df_customers = DataGenerator(customers_spec, seed=42).size(1_000_000).get_df()
print(df_customers.head())
Output:
customer_id name age email is_active account_balance
0 C00000001 John Smith 42 john.smith@email.com True 15432.50
1 C00000002 Jane Brown 28 jane.brown@email.com True 8721.33
2 C00000003 Bob Wilson 56 bob.wilson@email.com False 42156.89
3 C00000004 Alice Davis 33 alice.davis@email.com True 23400.12
4 C00000005 Tom Miller 49 tom.miller@email.com True 31245.67
Size parameter can be an integer or a callable function that returns an integer.
from rand_engine import DataGenerator, RandSpecs
from random import randint
lambda_size = lambda: randint(500_000, 2_000_000)
# Generate 1 million customer records in seconds
customers_spec = RandSpecs.customers() # Pre-built specification
df_customers = DataGenerator(customers_spec, seed=42).size(lambda_size).get_df()
print(df_customers.shape)
2. Databricks Integration
Seamlessly integrate with Databricks and other Spark environments. Generate synthetic data and convert to Spark DataFrames with zero friction.
from rand_engine import DataGenerator, RandSpecs
# Generate synthetic data for Databricks
transactions_spec = RandSpecs.transactions()
df_pandas = DataGenerator(transactions_spec, seed=42).size(1_000_000).get_df()
# Option 1: Display pandas DataFrame
display(df_pandas)
# Option 2: Convert to Spark DataFrame for distributed processing
df_spark = spark.createDataFrame(df_pandas)
display(df_spark)
# Write directly to Delta Lake, Parquet, or any Spark-supported format
df_spark.write.format("delta").mode("overwrite").save("/path/to/delta/table")
3. Explore Built-In RandSpecs
Rand Engine provides 10+ production-ready specifications across multiple business domains. Each spec generates realistic, correlated data with 6+ fields.
from rand_engine import DataGenerator, RandSpecs
# ๐๏ธ E-Commerce & Retail
builtin_specs = [
RandSpecs.customers(), # Customer profiles with contact info
RandSpecs.products(), # Product catalog with pricing
RandSpecs.orders(), # Orders with currency/country correlation
RandSpecs.invoices(), # Invoice records with payment details
RandSpecs.shipments(), # Shipping data with carrier/destination
]
# ๐ฐ Financial Services
builtin_specs += [
RandSpecs.transactions(), # Financial transactions with amounts
]
# ๐ฅ HR & User Management
builtin_specs += [
RandSpecs.employees(), # Employee records (dept/level/role)
RandSpecs.users(), # Application users with auth data
]
# ๐ง IoT & System Monitoring
builtin_specs += [
RandSpecs.devices(), # IoT device telemetry
RandSpecs.events(), # System event logs
]
# Generate millions of rows from any spec
for spec in builtin_specs:
df = DataGenerator(spec, seed=42).size(1_000_000).get_df()
print(f"\n{spec['__meta__']['name']}:")
print(df.head())
๐ก Pro Tip: These specs include realistic correlations (e.g., orders.currency matches orders.country, employees.level correlates with salary). Perfect for testing real-world scenarios!
4. File Writing Capabilities
Generate and write synthetic data directly to filesโno intermediate DataFrames needed. Supports CSV, Parquet, and JSON with advanced options.
4.1. Supported Formats & Compression
| Format | Compression Options | Use Case |
|---|---|---|
| CSV | None, gzip, zip, bz2 | Human-readable, spreadsheet imports |
| Parquet | None, snappy, gzip | Columnar analytics, data lakes |
| JSON | None, gzip, zip, bz2 | APIs, document stores, config files |
4.2. Batch Writing Mode
Write synthetic data in single or multiple files with full control over format, compression, and write modes.
- As rand-engine generates pandas DataFrames under the hood, data can be written efficiently using Pandas' built-in I/O capabilities.
- Although users can easily obtain DataFrames via
.get_df()and write them manually, the built-in.writeinterface simplifies the process significantly.
๐ Single File Writing
Write an entire dataset to a single fileโideal for small to medium datasets or when you need one consolidated output.
from rand_engine import DataGenerator, RandSpecs
# Write 10,000 customer records to a single CSV file
local_path = "./data/customers"
# Or Databricks: "/Volumes/prd/demo_volumes/rand_engine_data/customers"
(
DataGenerator(RandSpecs.customers())
.write
.size(10_000)
.format("csv") # Format: csv, json, or parquet
.mode("overwrite") # Mode: overwrite or append
.option("compression", None) # Optional: gzip, zip, bz2
.save(local_path)
)
Key Features:
- โ
Auto file extension:
customersโcustomers.csvautomatically - โ Overwrite mode: Replaces existing file
- โ Append mode: Adds new records to existing file
- โ Full path control: Specify exact output location
Result: Single file created at ./data/customers.csv with 10,000 rows.
๐ฆ Multiple Files Writing
- Write large datasets using multiple batches.
- Perfect for generating massive datasets on disk without overwhelming memory.
from rand_engine import DataGenerator, RandSpecs
# Write 500,000 records split across 5 CSV files (100,000 rows each)
(
DataGenerator(RandSpecs.customers())
.write
.size(100_000)
.format("csv")
.mode("overwrite")
.option("numFiles", 5) # Split into 5 files
.option("compression", "gzip") # Compress each file
.save("./data/customers_multi")
)
Output Structure:
data/
โโโ customers_multi/
โโโ part_a3f2c91e.csv.gz (10,000 rows)
โโโ part_b7e4d23a.csv.gz (10,000 rows)
โโโ part_c1f8e45b.csv.gz (10,000 rows)
โโโ part_d9a2f67c.csv.gz (10,000 rows)
โโโ part_e5b3g89d.csv.gz (10,000 rows)
Important Behaviors:
- ๐๏ธ Folder creation:
numFiles > 1automatically creates a directory - โ Append mode: Adds new files to existing folder without removing old ones
- ๐ Overwrite mode: Clears folder contents before writing new files
- ๐ฒ Random names: Files get unique identifiers (
part_<hash>.<ext>)
๐ฏ Advanced Example: Testing All Format/Compression Combinations
Easy for Data Engineers to generate synthetic data for learning, development, and testing.
- Test your entire data pipeline with different format and compression options.
- Users can quickly understand different performance and storage trade-offs.
- Great for benchmarking read/write speeds and compression ratios.
from rand_engine import DataGenerator, RandSpecs
base_path = "./data/batch_tests"
# Define all format and compression combinations to test
test_configs = [
# CSV variations
{"format": "csv", "compression": None, "path": f"{base_path}/csv/default/customers"},
{"format": "csv", "compression": "gzip", "path": f"{base_path}/csv/gzip/customers"},
{"format": "csv", "compression": "zip", "path": f"{base_path}/csv/zip/customers"},
{"format": "csv", "compression": "bz2", "path": f"{base_path}/csv/bz2/customers"},
# JSON variations
{"format": "json", "compression": None, "path": f"{base_path}/json/default/customers"},
{"format": "json", "compression": "gzip", "path": f"{base_path}/json/gzip/customers"},
{"format": "json", "compression": "zip", "path": f"{base_path}/json/zip/customers"},
{"format": "json", "compression": "bz2", "path": f"{base_path}/json/bz2/customers"},
# Parquet variations
{"format": "parquet", "compression": None, "path": f"{base_path}/parquet/default/customers"},
{"format": "parquet", "compression": "snappy", "path": f"{base_path}/parquet/snappy/customers"},
{"format": "parquet", "compression": "gzip", "path": f"{base_path}/parquet/gzip/customers"},
]
# Test 1: Write single files (10,000 rows each)
print("๐ Writing single files...")
for config in test_configs:
(
DataGenerator(RandSpecs.customers())
.write
.size(10_000)
.format(config["format"])
.mode("overwrite")
.option("compression", config["compression"])
.save(config["path"])
)
print(f" โ
{config['format']} ({config['compression'] or 'none'}) โ {config['path']}")
# Test 2: Write multiple files (5 files ร 10,000 rows = 50,000 total)
print("\n๐ฆ Writing multiple files...")
for config in test_configs:
multi_path = config["path"].replace("/batch_tests/", "/batch_tests/multi_")
(
DataGenerator(RandSpecs.customers())
.write
.size(50_000)
.format(config["format"])
.mode("overwrite")
.option("numFiles", 5)
.option("compression", config["compression"])
.save(multi_path)
)
print(f" โ
{config['format']} ({config['compression'] or 'none'}) โ {multi_path}/ (5 files)")
Use Cases:
- ๐งช Testing data pipelines with different file formats
- ๐ Benchmarking compression ratios and read/write performance
- ๐ CI/CD validation of file processing workflows
- ๐ Data lake ingestion testing with various formats
4.3. Streaming Write Mode
Generate and write data continuously with controlled throughputโperfect for testing real-time pipelines, Kafka producers, or event-driven systems.
from rand_engine import DataGenerator, RandSpecs
# Stream customer records at 20k records/second
(
DataGenerator(RandSpecs.customers())
.size(10**5)
.writeStream
.format("json")
.mode("overwrite")
.option("compression", "gzip")
.option("timeout", 100)
.trigger(frequency=5)
.start(path="./data/stream/customers")
)
Key Features:
- ๐ Controlled throughput: Simulate realistic event rates (20k records/sec)
- โพ๏ธ Continuous generation: Runs indefinitely until stopped or timeout reached
- โฑ๏ธ Auto timestamps: Each record includes
timestamp_createdfield - ๐ Append mode: New files created every N records (configurable)
- ๐ Databricks integration: Perfect to feed Auto loader on Databricks
Output Structure:
data/stream/customers/
โโโ stream_2025-10-25_14-30-05.json.gz (1,000 records)
โโโ stream_2025-10-25_14-31-12.json.gz (1,000 records)
โโโ stream_2025-10-25_14-32-18.json.gz (1,000 records)
โโโ ... (continues streaming)
Use Cases:
- ๐ Kafka testing: Simulate producers with realistic data
- ๐ CDC pipelines: Test change data capture workflows
- ๐ Real-time analytics: Feed data to streaming platforms (Spark Streaming, Flink)
- ๐งช Load testing: Stress-test event ingestion systems
5. Build Custom Specifications
Ready to create your own specs? Define custom data structures with full control over generation logic.
from rand_engine import DataGenerator
# Define your custom specification
custom_spec = {
"user_id": {
"method": "unique_ids",
"kwargs": {"strategy": "zint", "length": 8} # Zero-padded integers: 00000001
},
"age": {
"method": "integers",
"kwargs": {"min": 18, "max": 65} # Random integers
},
"salary": {
"method": "floats",
"kwargs": {"min": 30_000.0, "max": 150_000.0, "round": 2} # Decimals
},
"is_premium": {
"method": "booleans",
"kwargs": {"true_prob": 0.15} # 15% will be True
},
"department": {
"method": "distincts",
"kwargs": {"distincts": ["Engineering", "Sales", "Marketing", "HR"]}
}
}
# Generate 10 million rows
df = DataGenerator(custom_spec, seed=42).size(10_000_000).get_df()
print(df.head())
Spec Anatomy:
"method": Core generation function (see table below)"kwargs": Method-specific parameters- Declarative: Define what you want, not how to generate it
๐ Core Generation Methods Reference
Complete API for building custom specifications:
| Method | Description | Parameters | Example |
|---|---|---|---|
unique_ids |
Unique identifiers | strategy: "zint", "uuid4", "sequence"length: digits (zint only) |
User IDs, order numbers, SKUs |
integers |
Random integers | min: minimum valuemax: maximum value |
Ages, quantities, counts |
floats |
Random decimals | min: minimum valuemax: maximum valueround: decimal places |
Prices, weights, percentages |
floats_normal |
Normal distribution | mean: center valuestd: spreadround: decimals |
Heights, test scores, temperatures |
booleans |
True/False flags | true_prob: probability of True (0.0-1.0) |
Active flags, feature toggles |
distincts |
Random selection | distincts: list of values |
Categories, statuses, types |
distincts_prop |
Weighted selection | distincts: {value: weight, ...} |
Product mix (70% A, 30% B) |
unix_timestamps |
Date/time values | start: start date (YYYY-MM-DD)end: end dateformato: output format |
Created dates, event times |
Quick Example:
# Product catalog with realistic distributions
product_spec = {
"product_id": {
"method": "unique_ids",
"kwargs": {"strategy": "zint", "length": 10}
},
"price": {
"method": "floats",
"kwargs": {"min": 9.99, "max": 999.99, "round": 2}
},
"category": {
"method": "distincts",
"kwargs": {"distincts": ["Electronics", "Clothing", "Food", "Books"]}
},
"in_stock": {
"method": "booleans",
"kwargs": {"true_prob": 0.85} # 85% in stock
},
"rating": {
"method": "floats_normal",
"kwargs": {"mean": 4.2, "std": 0.8, "round": 1} # Bell curve around 4.2โ
}
}
df_products = DataGenerator(product_spec, seed=123).size(1_000_000).get_df()
print(df_products)
๐ For advanced examples: See EXAMPLES.md for correlated columns, composite keys, and more.
๐จ Real-World Use Cases
๐ E-Commerce with Referential Integrity
Create realistic multi-level datasets with proper Primary Key (PK) and Foreign Key (FK) relationships. Rand Engine uses an internal checkpoint database (DuckDB/SQLite) to ensure 100% referential integrity.
from rand_engine import DataGenerator
# Level 1: Categories (Primary Key)
spec_categories = {
"category_id": {
"method": "unique_ids",
"kwargs": {"strategy": "zint", "length": 4}
},
"category_name": {
"method": "distincts",
"kwargs": {"distincts": ["Electronics", "Books", "Clothing", "Home"]}
},
"constraints": {
"category_pk": {
"name": "category_pk",
"tipo": "PK",
"fields": ["category_id VARCHAR(4)"]
}
}
}
# Level 2: Products (Foreign Key โ Categories)
spec_products = {
"product_id": {
"method": "unique_ids",
"kwargs": {"strategy": "zint", "length": 8}
},
"product_name": {
"method": "distincts",
"kwargs": {"distincts": [f"Product {i:03d}" for i in range(100)]}
},
"price": {
"method": "floats",
"kwargs": {"min": 10.0, "max": 1000.0, "round": 2}
},
"constraints": {
"product_pk": {
"name": "product_pk",
"tipo": "PK",
"fields": ["product_id VARCHAR(8)"]
},
"category_fk": {
"name": "category_pk", # References category_pk constraint
"tipo": "FK",
"fields": ["category_id"],
"watermark": 60 # Only reference categories created in last 60 records
}
}
}
# Level 3: Orders (Foreign Key โ Products)
spec_orders = {
"order_id": {
"method": "unique_ids",
"kwargs": {"strategy": "uuid4"}
},
"quantity": {
"method": "integers",
"kwargs": {"min": 1, "max": 10}
},
"total": {
"method": "floats",
"kwargs": {"min": 10.0, "max": 5000.0, "round": 2}
},
"constraints": {
"product_fk": {
"name": "product_pk", # References product_pk constraint
"tipo": "FK",
"fields": ["product_id"],
"watermark": 120
}
}
}
# Generate datasets (order matters: Categories โ Products โ Orders)
df_categories = DataGenerator(spec_categories).size(10).get_df()
df_products = DataGenerator(spec_products).size(100).get_df()
df_orders = DataGenerator(spec_orders).size(1_000).get_df()
# Verify referential integrity
print(f"โ
All products reference valid categories: {set(df_products['category_id']).issubset(set(df_categories['category_id']))}")
print(f"โ
All orders reference valid products: {set(df_orders['product_id']).issubset(set(df_products['product_id']))}")
๐ Complete constraints guide: CONSTRAINTS.md
๐๏ธ Architecture
Design Philosophy
- Declarative: Specify what you want, not how to generate it
- Performance: Built on NumPy for vectorized operations (millions of rows/second)
- Simplicity: Pre-built examples for immediate use
- Extensibility: Easy to create custom specifications
Public API
from rand_engine import DataGenerator, RandSpecs
# That's it! Simple and clean.
All internal modules (prefixed with _) are implementation details.
๐งช Quality & Testing
- 236 tests passing (20 new constraint tests in v0.6.1)
- Comprehensive coverage of all generation methods
- Validated on millions of generated records
- Battle-tested in production ETL pipelines
- Constraint validation with 100% integrity checks
# Run tests
pytest
# Run constraint tests only
pytest tests/test_8_consistency.py -v
# With coverage report
pytest --cov=rand_engine --cov-report=html
๐ก Tips & Best Practices
For Data Engineers
- Use
seedparameter for reproducible test data - Export to Parquet with compression for large datasets
- Use streaming mode for continuous data generation
- Leverage constraints for multi-table data generation with referential integrity
- Use
.checkpoint(":memory:")for in-memory databases or.checkpoint("path/to/db.duckdb")for persistence
For QA Engineers
- Start with pre-built specs (RandSpecs)
- Use validation mode (
validate=True) during development - Generate edge cases with low probability booleans
- Create multiple test datasets with different seeds
- Test PK/FK relationships with constraints for realistic scenarios
Performance Tips
- Generate data in batches for optimal memory usage
- Use Parquet format for large datasets (10x smaller than CSV)
- Enable compression for file exports
- Reuse DataGenerator instances when generating multiple datasets
- Use watermarks to control FK relationship size (avoid loading entire checkpoint tables)
Constraints Best Practices
- Use composite keys for complex relationships (e.g.,
client_id + client_type) - Set appropriate watermarks (60-3600 seconds) based on data freshness requirements
- Use in-memory databases (
:memory:) for testing, disk-based for production - Generate PK specs before FK specs to ensure checkpoint tables exist
- Validate integrity with set operations:
set(fk_values).issubset(set(pk_values))
๐ 50+ production-ready examples: EXAMPLES.md
๐ Requirements
- Python: >= 3.10
- numpy: >= 2.1.1
- pandas: >= 2.2.2
- faker: >= 28.4.1 (optional, for realistic names/addresses)
- duckdb: >= 1.1.0 (optional, for constraints with DuckDB)
- sqlite3: Built-in Python (for constraints with SQLite)
๐ Documentation
- EXAMPLES.md: 50+ production-ready examples (1,600+ lines)
- CONSTRAINTS.md: Complete guide to PK/FK system (900+ lines)
- API_REFERENCE.md: Full method reference
- LOGGING.md: Logging configuration guide
๐ Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: marcourelioreislima@gmail.com
๐ License
MIT License - see LICENSE file for details.
๐ Star History
If you find this project useful, consider giving it a โญ on GitHub!
Built with โค๏ธ for Data Engineers, QA Engineers, and the entire data community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rand_engine-0.6.3rc1.tar.gz.
File metadata
- Download URL: rand_engine-0.6.3rc1.tar.gz
- Upload date:
- Size: 44.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15ea9f67607dfac89d5a10e5da52a25811bb4fa2d5b494ae6c998e5137c4b6c7
|
|
| MD5 |
1a2b01f5d9f00fd5ef667066a45f229e
|
|
| BLAKE2b-256 |
5a404424101f5f272ab810661aab50a345cf46d5630048b4105c80bacb90cb7c
|
Provenance
The following attestation bundles were made for rand_engine-0.6.3rc1.tar.gz:
Publisher:
auto_tag_publish_development.yml on marcoaureliomenezes/rand_engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rand_engine-0.6.3rc1.tar.gz -
Subject digest:
15ea9f67607dfac89d5a10e5da52a25811bb4fa2d5b494ae6c998e5137c4b6c7 - Sigstore transparency entry: 653671926
- Sigstore integration time:
-
Permalink:
marcoaureliomenezes/rand_engine@b9595d76c82d1a15122e8e12fda02e9e3601ac91 -
Branch / Tag:
refs/heads/development - Owner: https://github.com/marcoaureliomenezes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
auto_tag_publish_development.yml@b9595d76c82d1a15122e8e12fda02e9e3601ac91 -
Trigger Event:
pull_request
-
Statement type:
File details
Details for the file rand_engine-0.6.3rc1-py3-none-any.whl.
File metadata
- Download URL: rand_engine-0.6.3rc1-py3-none-any.whl
- Upload date:
- Size: 49.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7aed5cf2bd882274e000b494725eecfa48300fab1ec24b760e8aaf722692a07f
|
|
| MD5 |
0a268e3282a2cc3ef8e35207ff734b9e
|
|
| BLAKE2b-256 |
2c9891dd4e7e6af3eec78c9d7a8c445594d6773d7cfd5e67ca71300b6a0fe811
|
Provenance
The following attestation bundles were made for rand_engine-0.6.3rc1-py3-none-any.whl:
Publisher:
auto_tag_publish_development.yml on marcoaureliomenezes/rand_engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rand_engine-0.6.3rc1-py3-none-any.whl -
Subject digest:
7aed5cf2bd882274e000b494725eecfa48300fab1ec24b760e8aaf722692a07f - Sigstore transparency entry: 653671941
- Sigstore integration time:
-
Permalink:
marcoaureliomenezes/rand_engine@b9595d76c82d1a15122e8e12fda02e9e3601ac91 -
Branch / Tag:
refs/heads/development - Owner: https://github.com/marcoaureliomenezes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
auto_tag_publish_development.yml@b9595d76c82d1a15122e8e12fda02e9e3601ac91 -
Trigger Event:
pull_request
-
Statement type: