Rand Engine v2. Package with some methods to generate random data in different formats. Great to mock data while testing or developing.
Project description
Rand Engine
High-performance synthetic data generation for testing, development, and prototyping.
A Python library for generating millions of rows of realistic synthetic data through declarative specifications. Built on NumPy and Pandas for maximum performance.
📦 Installation
pip install rand-engine
✅ Requirements
- Python: >= 3.10
- numpy: >= 2.1.1
- pandas: >= 2.2.2
- faker: >= 28.4.1 (optional, for realistic data)
- duckdb: >= 1.1.0 (optional, for database integrations)
🎯 Who Is This For?
- Data Engineers: Test ETL/ELT pipelines without production data dependencies
- QA Engineers: Generate realistic datasets for load and integration testing
- Data Scientists: Mock data during model development and validation
- Backend Developers: Populate development and staging environments
- BI Professionals: Create demos and POCs without exposing sensitive data
🚀 Quick Start
1. Simple Data Generation
from rand_engine import DataGenerator
# Declarative specification
spec = {
"user_id": {
"method": "unique_ids",
"kwargs": {"strategy": "zint", "length": 8}
},
"age": {
"method": "integers",
"kwargs": {"min": 18, "max": 65}
},
"salary": {
"method": "floats",
"kwargs": {"min": 30000.0, "max": 150000.0, "round": 2}
},
"is_active": {
"method": "booleans",
"kwargs": {"true_prob": 0.8}
},
"plan": {
"method": "distincts",
"kwargs": {"distincts": ["free", "basic", "premium", "enterprise"]}
}
}
# Generate DataFrame
generator = DataGenerator(spec, seed=42)
df = generator.size(10000).get_df()
print(df.head())
Output:
user_id age salary is_active plan
0 00000001 42 87543.21 True premium
1 00000002 28 45621.89 True free
2 00000003 56 132041.50 False basic
3 00000004 33 62789.12 True premium
4 00000005 49 98234.77 True enterprise
2. Export to Multiple Formats
# CSV with gzip compression
generator.write.size(100000).format("csv").option("compression", "gzip").save("users.csv")
# Parquet with snappy compression
generator.write.size(1000000).format("parquet").option("compression", "snappy").save("users.parquet")
# JSON
generator.write.size(50000).format("json").save("users.json")
3. Streaming Data Generation
# Generate continuous stream of records
stream = generator.stream_dict(min_throughput=5, max_throughput=15)
for record in stream:
# Each record includes automatic timestamp_created field
print(record)
# Send to Kafka, API, database, etc.
4. Reproducible Data with Seeds
# Same seed = identical data
df1 = DataGenerator(spec, seed=42).size(1000).get_df()
df2 = DataGenerator(spec, seed=42).size(1000).get_df()
assert df1.equals(df2) # True
📚 Available Generation Methods
Core Methods
| Method | Description | Example |
|---|---|---|
| integers | Random integers within range | {"method": "integers", "kwargs": {"min": 0, "max": 100}} |
| int_zfilled | Zero-padded numeric strings | {"method": "int_zfilled", "kwargs": {"length": 8}} |
| floats | Random floats with precision | {"method": "floats", "kwargs": {"min": 0.0, "max": 100.0, "round": 2}} |
| floats_normal | Normally distributed floats | {"method": "floats_normal", "kwargs": {"mean": 50, "std": 10, "round": 2}} |
| booleans | Boolean values with probability | {"method": "booleans", "kwargs": {"true_prob": 0.7}} |
| distincts | Random selection from list | {"method": "distincts", "kwargs": {"distincts": ["A", "B", "C"]}} |
| distincts_prop | Weighted random selection | {"method": "distincts_prop", "kwargs": {"distincts": {"mobile": 70, "desktop": 30}}} |
| unix_timestamps | Unix timestamps in range | {"method": "unix_timestamps", "kwargs": {"start": "01-01-2020", "end": "31-12-2023", "format": "%d-%m-%Y"}} |
| unique_ids | Unique identifiers | {"method": "unique_ids", "kwargs": {"strategy": "zint", "length": 10}} |
Advanced Methods
| Method | Description | Use Case |
|---|---|---|
| distincts_map | Correlated 2-column pairs | Device → OS mapping |
| distincts_map_prop | Weighted correlated pairs | Product → Status with weights |
| distincts_multi_map | N-column Cartesian products | Company → Sector → Size |
| complex_distincts | Pattern-based generation | IP addresses, URLs, codes |
� Advanced Features
1. Correlated Columns (2-Column Mapping)
Generate correlated data where one column determines another:
spec = {
"device_os": {
"method": "distincts_map",
"cols": ["device_type", "os"],
"kwargs": {
"distincts": {
"smartphone": ["Android", "iOS"],
"tablet": ["Android", "iOS", "iPadOS"],
"desktop": ["Windows", "macOS", "Linux"]
}
}
}
}
df = DataGenerator(spec).size(1000).get_df()
# Result: 2 columns (device_type, os) with valid combinations
2. Weighted Correlated Data
spec = {
"product_status": {
"method": "distincts_map_prop",
"cols": ["product", "status"],
"kwargs": {
"distincts": {
"laptop": [("new", 90), ("refurbished", 10)],
"phone": [("new", 95), ("refurbished", 5)],
"tablet": [("new", 85), ("refurbished", 15)]
}
}
}
}
df = DataGenerator(spec).size(10000).get_df()
# 90% of laptops will be "new", 10% "refurbished"
3. Complex Patterns (IP Addresses, URLs)
spec = {
"ip_address": {
"method": "complex_distincts",
"kwargs": {
"pattern": "x.x.x.x",
"replacement": "x",
"templates": [
{"method": "distincts", "kwargs": {"distincts": ["192", "10", "172"]}},
{"method": "integers", "kwargs": {"min": 0, "max": 255}},
{"method": "integers", "kwargs": {"min": 0, "max": 255}},
{"method": "integers", "kwargs": {"min": 1, "max": 254}}
]
}
}
}
df = DataGenerator(spec).size(100).get_df()
# Output: 192.168.1.45, 10.0.52.231, 172.24.133.89, etc.
4. Data Transformers
Apply transformations to generated data:
from datetime import datetime
spec = {
"timestamp": {
"method": "unix_timestamps",
"kwargs": {"start": "01-01-2023", "end": "31-12-2023", "format": "%d-%m-%Y"},
# Column-level transformer
"transformers": [
lambda ts: datetime.fromtimestamp(ts).strftime("%Y-%m-%d %H:%M:%S")
]
},
"value": {
"method": "integers",
"kwargs": {"min": 100, "max": 1000}
}
}
# DataFrame-level transformer
def add_year_column(df):
df['year'] = df['timestamp'].str[:4]
return df
df = (DataGenerator(spec)
.transformers([add_year_column])
.size(1000)
.get_df())
5. Spec Validation
Enable validation to catch errors early:
invalid_spec = {
"age": {
"method": "integers" # Missing required "min" and "max"
}
}
try:
generator = DataGenerator(invalid_spec, validate=True)
except Exception as e:
print(e)
# ❌ Column 'age': Missing required parameter 'min'
# Correct example:
# {
# "age": {
# "method": "integers",
# "kwargs": {"min": 18, "max": 65}
# }
# }
🎨 Real-World Examples
E-commerce Orders
spec = {
"order_id": {
"method": "unique_ids",
"kwargs": {"strategy": "zint", "length": 10}
},
"customer_id": {
"method": "integers",
"kwargs": {"min": 1000, "max": 50000}
},
"product_category": {
"method": "distincts_prop",
"kwargs": {
"distincts": {
"electronics": 40,
"clothing": 30,
"home": 20,
"sports": 10
}
}
},
"amount": {
"method": "floats",
"kwargs": {"min": 10.0, "max": 5000.0, "round": 2}
},
"payment_status": {
"method": "distincts_prop",
"kwargs": {
"distincts": {
"paid": 85,
"pending": 10,
"failed": 5
}
}
},
"created_at": {
"method": "unix_timestamps",
"kwargs": {"start": "01-01-2024", "end": "31-12-2024", "format": "%d-%m-%Y"}
}
}
# Generate 1 million orders
orders = DataGenerator(spec, seed=42).size(1000000).get_df()
orders.to_parquet("orders.parquet", compression="snappy")
IoT Sensor Data
spec = {
"sensor_id": {
"method": "distincts",
"kwargs": {"distincts": [f"SENSOR_{i:03d}" for i in range(1, 101)]}
},
"temperature": {
"method": "floats_normal",
"kwargs": {"mean": 22.0, "std": 3.5, "round": 2}
},
"humidity": {
"method": "floats_normal",
"kwargs": {"mean": 60.0, "std": 10.0, "round": 1}
},
"battery_level": {
"method": "integers",
"kwargs": {"min": 0, "max": 100}
},
"status": {
"method": "distincts_prop",
"kwargs": {
"distincts": {
"active": 95,
"warning": 4,
"error": 1
}
}
}
}
# Stream sensor readings
stream = DataGenerator(spec).stream_dict(min_throughput=10, max_throughput=50)
for reading in stream:
# Send to time-series database
print(f"Sensor {reading['sensor_id']}: {reading['temperature']}°C")
User Behavior Logs
spec = {
"session_id": {
"method": "unique_ids",
"kwargs": {"strategy": "zint", "length": 12}
},
"device_os": {
"method": "distincts_map",
"cols": ["device", "os"],
"kwargs": {
"distincts": {
"mobile": ["Android", "iOS"],
"tablet": ["Android", "iOS"],
"desktop": ["Windows", "macOS", "Linux"]
}
}
},
"page_views": {
"method": "integers",
"kwargs": {"min": 1, "max": 50}
},
"duration_seconds": {
"method": "integers",
"kwargs": {"min": 10, "max": 3600}
},
"converted": {
"method": "booleans",
"kwargs": {"true_prob": 0.03} # 3% conversion rate
}
}
logs = DataGenerator(spec, seed=123).size(500000).get_df()
🗂️ File Export Options
Batch Writing
from rand_engine import DataGenerator
spec = {...} # Your spec here
# CSV with compression
(DataGenerator(spec)
.write
.size(100000)
.format("csv")
.option("compression", "gzip")
.option("index", False)
.mode("overwrite")
.save("output/data.csv"))
# Parquet with multiple files
(DataGenerator(spec)
.write
.size(5000000)
.format("parquet")
.option("compression", "snappy")
.option("numFiles", 10) # Split into 10 files
.save("output/data.parquet"))
# JSON with pretty print
(DataGenerator(spec)
.write
.size(10000)
.format("json")
.option("indent", 2)
.save("output/data.json"))
Streaming Writing
# Write data in micro-batches
(DataGenerator(spec)
.writeStream
.microbatch_size(1000)
.max_microbatches(100)
.format("csv")
.option("compression", "gzip")
.save("output/stream/"))
🔌 Database Integrations
DuckDB Integration
from rand_engine.integrations._duckdb_handler import DuckDBHandler
# Generate and insert data
spec = {...}
df = DataGenerator(spec).size(100000).get_df()
# Create handler (in-memory or file-based)
handler = DuckDBHandler(":memory:") # or DuckDBHandler("mydb.duckdb")
# Create table
handler.create_table("users", "user_id VARCHAR(10)")
# Insert data
handler.insert_df("users", df, pk_cols=["user_id"])
# Query data
result = handler.select_all("users")
print(result.head())
# Cleanup
handler.close()
SQLite Integration
from rand_engine.integrations._sqlite_handler import SQLiteHandler
handler = SQLiteHandler("test.db")
handler.create_table("events", "event_id VARCHAR(10)")
handler.insert_df("events", df, pk_cols=["event_id"])
# Query with column selection
result = handler.select_all("events", columns=["event_id", "timestamp"])
handler.close()
🏗️ Architecture
Design Principles
- Declarative Specifications: Define what you want, not how to generate it
- High Performance: Built on NumPy for vectorized operations
- Type Safety: Full type hints and validation
- Composability: Chain methods for fluent API
- Extensibility: Easy to add custom generators and transformers
Public API
The library exposes a single entry point:
from rand_engine import DataGenerator
All internal modules (prefixed with _) are implementation details and may change.
Key Components
- DataGenerator: Main class for data generation
- SpecValidator: Educational validator with helpful error messages
- File Writers: Batch and stream writers for multiple formats
- Database Handlers: DuckDB and SQLite integrations with connection pooling
- Core Generators: Stateless NumPy-based generation methods
🧪 Testing
The library has comprehensive test coverage:
- 189 tests across all components
- 82% code coverage
- Unit tests: Core generation methods
- Integration tests: File writers, database handlers
- API tests: Public interface validation
Run tests:
# All tests
pytest
# With coverage
pytest --cov=rand_engine --cov-report=html
# Specific module
pytest tests/integrations/
📖 Documentation
Method Reference
All 13 generation methods are documented in the validator:
from rand_engine.validators.spec_validator import SpecValidator
# See all available methods and their parameters
print(SpecValidator.METHOD_SPECS.keys())
# dict_keys(['integers', 'int_zfilled', 'floats', 'floats_normal', 'booleans',
# 'distincts', 'distincts_prop', 'distincts_map', 'distincts_map_prop',
# 'distincts_multi_map', 'complex_distincts', 'unix_timestamps', 'unique_ids'])
Getting Help
Enable validation for helpful error messages:
spec = {
"age": {
"method": "unknown_method" # Typo!
}
}
try:
DataGenerator(spec, validate=True)
except Exception as e:
print(e)
# Shows correct method names and examples
🤝 Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Write tests for your changes
- Ensure all tests pass (
pytest) - Submit a pull request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🌟 Acknowledgments
- Built with NumPy and Pandas
- Inspired by modern data engineering practices
- Community feedback and contributions
📞 Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: marco.a.menezes@gmail.com
🗺️ Roadmap
- PostgreSQL integration
- MySQL/MariaDB support
- Apache Arrow format support
- Distributed generation with Dask
- Web UI for spec building
- More pre-built templates
Made with ❤️ for the data community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rand_engine-0.6.0rc1.tar.gz.
File metadata
- Download URL: rand_engine-0.6.0rc1.tar.gz
- Upload date:
- Size: 28.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c68187ad184a302175d2c88e1f9ff9b8c971ad2cf76dd3bcc6d76935bce2315a
|
|
| MD5 |
80ceb71ef3be7a78a3a1957709b8ad73
|
|
| BLAKE2b-256 |
f601558ee8fe1faa54c833e37df3765fb725c2c78bf0405f45f8584f939c9975
|
Provenance
The following attestation bundles were made for rand_engine-0.6.0rc1.tar.gz:
Publisher:
auto_tag_publish_development.yml on marcoaureliomenezes/rand_engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rand_engine-0.6.0rc1.tar.gz -
Subject digest:
c68187ad184a302175d2c88e1f9ff9b8c971ad2cf76dd3bcc6d76935bce2315a - Sigstore transparency entry: 621760746
- Sigstore integration time:
-
Permalink:
marcoaureliomenezes/rand_engine@76135c8d9426fce10907853fd8c2a3c7e23c4037 -
Branch / Tag:
refs/heads/development - Owner: https://github.com/marcoaureliomenezes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
auto_tag_publish_development.yml@76135c8d9426fce10907853fd8c2a3c7e23c4037 -
Trigger Event:
pull_request
-
Statement type:
File details
Details for the file rand_engine-0.6.0rc1-py3-none-any.whl.
File metadata
- Download URL: rand_engine-0.6.0rc1-py3-none-any.whl
- Upload date:
- Size: 34.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99a32fe381c9956dcaec338d9ce951a6bd5f673a1ad2731b1a1ab251d4ba3b82
|
|
| MD5 |
50dc485f89f5bbb053670fd5d3cc5022
|
|
| BLAKE2b-256 |
a2669542acb2576801e6c21c801be03d65a6342af83c761d407e768389b7f343
|
Provenance
The following attestation bundles were made for rand_engine-0.6.0rc1-py3-none-any.whl:
Publisher:
auto_tag_publish_development.yml on marcoaureliomenezes/rand_engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rand_engine-0.6.0rc1-py3-none-any.whl -
Subject digest:
99a32fe381c9956dcaec338d9ce951a6bd5f673a1ad2731b1a1ab251d4ba3b82 - Sigstore transparency entry: 621760748
- Sigstore integration time:
-
Permalink:
marcoaureliomenezes/rand_engine@76135c8d9426fce10907853fd8c2a3c7e23c4037 -
Branch / Tag:
refs/heads/development - Owner: https://github.com/marcoaureliomenezes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
auto_tag_publish_development.yml@76135c8d9426fce10907853fd8c2a3c7e23c4037 -
Trigger Event:
pull_request
-
Statement type: