Rand Engine v2. Package with some methods to generate random data in different formats. Great to mock data while testing or developing.

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

marco_menezes

These details have not been verified by PyPI

Project description

Rand Engine

High-performance synthetic data generation for testing, development, and prototyping.

A Python library for generating millions of rows of realistic synthetic data through declarative specifications. Built on NumPy and Pandas for maximum performance.

📦 Installation

pip install rand-engine

✅ Requirements

Python: >= 3.10
numpy: >= 2.1.1
pandas: >= 2.2.2
faker: >= 28.4.1 (optional, for realistic data)
duckdb: >= 1.1.0 (optional, for database integrations)

🎯 Who Is This For?

Data Engineers: Test ETL/ELT pipelines without production data dependencies
QA Engineers: Generate realistic datasets for load and integration testing
Data Scientists: Mock data during model development and validation
Backend Developers: Populate development and staging environments
BI Professionals: Create demos and POCs without exposing sensitive data

🚀 Quick Start

1. Simple Data Generation

from rand_engine import DataGenerator

# Declarative specification
spec = {
    "user_id": {
        "method": "unique_ids",
        "kwargs": {"strategy": "zint", "length": 8}
    },
    "age": {
        "method": "integers",
        "kwargs": {"min": 18, "max": 65}
    },
    "salary": {
        "method": "floats",
        "kwargs": {"min": 30000.0, "max": 150000.0, "round": 2}
    },
    "is_active": {
        "method": "booleans",
        "kwargs": {"true_prob": 0.8}
    },
    "plan": {
        "method": "distincts",
        "kwargs": {"distincts": ["free", "basic", "premium", "enterprise"]}
    }
}

# Generate DataFrame
generator = DataGenerator(spec, seed=42)
df = generator.size(10000).get_df()
print(df.head())

Output:

   user_id  age    salary  is_active      plan
0  00000001   42  87543.21       True  premium
1  00000002   28  45621.89       True     free
2  00000003   56 132041.50      False    basic
3  00000004   33  62789.12       True  premium
4  00000005   49  98234.77       True enterprise

2. Export to Multiple Formats

# CSV with gzip compression
generator.write.size(100000).format("csv").option("compression", "gzip").save("users.csv")

# Parquet with snappy compression
generator.write.size(1000000).format("parquet").option("compression", "snappy").save("users.parquet")

# JSON
generator.write.size(50000).format("json").save("users.json")

3. Streaming Data Generation

# Generate continuous stream of records
stream = generator.stream_dict(min_throughput=5, max_throughput=15)

for record in stream:
    # Each record includes automatic timestamp_created field
    print(record)
    # Send to Kafka, API, database, etc.

4. Reproducible Data with Seeds

# Same seed = identical data
df1 = DataGenerator(spec, seed=42).size(1000).get_df()
df2 = DataGenerator(spec, seed=42).size(1000).get_df()

assert df1.equals(df2)  # True

📚 Available Generation Methods

Core Methods

Method	Description	Example
integers	Random integers within range	`{"method": "integers", "kwargs": {"min": 0, "max": 100}}`
int_zfilled	Zero-padded numeric strings	`{"method": "int_zfilled", "kwargs": {"length": 8}}`
floats	Random floats with precision	`{"method": "floats", "kwargs": {"min": 0.0, "max": 100.0, "round": 2}}`
floats_normal	Normally distributed floats	`{"method": "floats_normal", "kwargs": {"mean": 50, "std": 10, "round": 2}}`
booleans	Boolean values with probability	`{"method": "booleans", "kwargs": {"true_prob": 0.7}}`
distincts	Random selection from list	`{"method": "distincts", "kwargs": {"distincts": ["A", "B", "C"]}}`
distincts_prop	Weighted random selection	`{"method": "distincts_prop", "kwargs": {"distincts": {"mobile": 70, "desktop": 30}}}`
unix_timestamps	Unix timestamps in range	`{"method": "unix_timestamps", "kwargs": {"start": "01-01-2020", "end": "31-12-2023", "format": "%d-%m-%Y"}}`
unique_ids	Unique identifiers	`{"method": "unique_ids", "kwargs": {"strategy": "zint", "length": 10}}`

Advanced Methods

Method	Description	Use Case
distincts_map	Correlated 2-column pairs	Device → OS mapping
distincts_map_prop	Weighted correlated pairs	Product → Status with weights
distincts_multi_map	N-column Cartesian products	Company → Sector → Size
complex_distincts	Pattern-based generation	IP addresses, URLs, codes

� Advanced Features

1. Correlated Columns (2-Column Mapping)

Generate correlated data where one column determines another:

spec = {
    "device_os": {
        "method": "distincts_map",
        "cols": ["device_type", "os"],
        "kwargs": {
            "distincts": {
                "smartphone": ["Android", "iOS"],
                "tablet": ["Android", "iOS", "iPadOS"],
                "desktop": ["Windows", "macOS", "Linux"]
            }
        }
    }
}

df = DataGenerator(spec).size(1000).get_df()
# Result: 2 columns (device_type, os) with valid combinations

2. Weighted Correlated Data

spec = {
    "product_status": {
        "method": "distincts_map_prop",
        "cols": ["product", "status"],
        "kwargs": {
            "distincts": {
                "laptop": [("new", 90), ("refurbished", 10)],
                "phone": [("new", 95), ("refurbished", 5)],
                "tablet": [("new", 85), ("refurbished", 15)]
            }
        }
    }
}

df = DataGenerator(spec).size(10000).get_df()
# 90% of laptops will be "new", 10% "refurbished"

3. Complex Patterns (IP Addresses, URLs)

spec = {
    "ip_address": {
        "method": "complex_distincts",
        "kwargs": {
            "pattern": "x.x.x.x",
            "replacement": "x",
            "templates": [
                {"method": "distincts", "kwargs": {"distincts": ["192", "10", "172"]}},
                {"method": "integers", "kwargs": {"min": 0, "max": 255}},
                {"method": "integers", "kwargs": {"min": 0, "max": 255}},
                {"method": "integers", "kwargs": {"min": 1, "max": 254}}
            ]
        }
    }
}

df = DataGenerator(spec).size(100).get_df()
# Output: 192.168.1.45, 10.0.52.231, 172.24.133.89, etc.

4. Data Transformers

Apply transformations to generated data:

from datetime import datetime

spec = {
    "timestamp": {
        "method": "unix_timestamps",
        "kwargs": {"start": "01-01-2023", "end": "31-12-2023", "format": "%d-%m-%Y"},
        # Column-level transformer
        "transformers": [
            lambda ts: datetime.fromtimestamp(ts).strftime("%Y-%m-%d %H:%M:%S")
        ]
    },
    "value": {
        "method": "integers",
        "kwargs": {"min": 100, "max": 1000}
    }
}

# DataFrame-level transformer
def add_year_column(df):
    df['year'] = df['timestamp'].str[:4]
    return df

df = (DataGenerator(spec)
    .transformers([add_year_column])
    .size(1000)
    .get_df())

5. Spec Validation

Enable validation to catch errors early:

invalid_spec = {
    "age": {
        "method": "integers"  # Missing required "min" and "max"
    }
}

try:
    generator = DataGenerator(invalid_spec, validate=True)
except Exception as e:
    print(e)
    # ❌ Column 'age': Missing required parameter 'min'
    #    Correct example:
    #    {
    #        "age": {
    #            "method": "integers",
    #            "kwargs": {"min": 18, "max": 65}
    #        }
    #    }

🎨 Real-World Examples

E-commerce Orders

spec = {
    "order_id": {
        "method": "unique_ids",
        "kwargs": {"strategy": "zint", "length": 10}
    },
    "customer_id": {
        "method": "integers",
        "kwargs": {"min": 1000, "max": 50000}
    },
    "product_category": {
        "method": "distincts_prop",
        "kwargs": {
            "distincts": {
                "electronics": 40,
                "clothing": 30,
                "home": 20,
                "sports": 10
            }
        }
    },
    "amount": {
        "method": "floats",
        "kwargs": {"min": 10.0, "max": 5000.0, "round": 2}
    },
    "payment_status": {
        "method": "distincts_prop",
        "kwargs": {
            "distincts": {
                "paid": 85,
                "pending": 10,
                "failed": 5
            }
        }
    },
    "created_at": {
        "method": "unix_timestamps",
        "kwargs": {"start": "01-01-2024", "end": "31-12-2024", "format": "%d-%m-%Y"}
    }
}

# Generate 1 million orders
orders = DataGenerator(spec, seed=42).size(1000000).get_df()
orders.to_parquet("orders.parquet", compression="snappy")

IoT Sensor Data

spec = {
    "sensor_id": {
        "method": "distincts",
        "kwargs": {"distincts": [f"SENSOR_{i:03d}" for i in range(1, 101)]}
    },
    "temperature": {
        "method": "floats_normal",
        "kwargs": {"mean": 22.0, "std": 3.5, "round": 2}
    },
    "humidity": {
        "method": "floats_normal",
        "kwargs": {"mean": 60.0, "std": 10.0, "round": 1}
    },
    "battery_level": {
        "method": "integers",
        "kwargs": {"min": 0, "max": 100}
    },
    "status": {
        "method": "distincts_prop",
        "kwargs": {
            "distincts": {
                "active": 95,
                "warning": 4,
                "error": 1
            }
        }
    }
}

# Stream sensor readings
stream = DataGenerator(spec).stream_dict(min_throughput=10, max_throughput=50)

for reading in stream:
    # Send to time-series database
    print(f"Sensor {reading['sensor_id']}: {reading['temperature']}°C")

User Behavior Logs

spec = {
    "session_id": {
        "method": "unique_ids",
        "kwargs": {"strategy": "zint", "length": 12}
    },
    "device_os": {
        "method": "distincts_map",
        "cols": ["device", "os"],
        "kwargs": {
            "distincts": {
                "mobile": ["Android", "iOS"],
                "tablet": ["Android", "iOS"],
                "desktop": ["Windows", "macOS", "Linux"]
            }
        }
    },
    "page_views": {
        "method": "integers",
        "kwargs": {"min": 1, "max": 50}
    },
    "duration_seconds": {
        "method": "integers",
        "kwargs": {"min": 10, "max": 3600}
    },
    "converted": {
        "method": "booleans",
        "kwargs": {"true_prob": 0.03}  # 3% conversion rate
    }
}

logs = DataGenerator(spec, seed=123).size(500000).get_df()

🗂️ File Export Options

Batch Writing

from rand_engine import DataGenerator

spec = {...}  # Your spec here

# CSV with compression
(DataGenerator(spec)
    .write
    .size(100000)
    .format("csv")
    .option("compression", "gzip")
    .option("index", False)
    .mode("overwrite")
    .save("output/data.csv"))

# Parquet with multiple files
(DataGenerator(spec)
    .write
    .size(5000000)
    .format("parquet")
    .option("compression", "snappy")
    .option("numFiles", 10)  # Split into 10 files
    .save("output/data.parquet"))

# JSON with pretty print
(DataGenerator(spec)
    .write
    .size(10000)
    .format("json")
    .option("indent", 2)
    .save("output/data.json"))

Streaming Writing

# Write data in micro-batches
(DataGenerator(spec)
    .writeStream
    .microbatch_size(1000)
    .max_microbatches(100)
    .format("csv")
    .option("compression", "gzip")
    .save("output/stream/"))

🔌 Database Integrations

DuckDB Integration

from rand_engine.integrations._duckdb_handler import DuckDBHandler

# Generate and insert data
spec = {...}
df = DataGenerator(spec).size(100000).get_df()

# Create handler (in-memory or file-based)
handler = DuckDBHandler(":memory:")  # or DuckDBHandler("mydb.duckdb")

# Create table
handler.create_table("users", "user_id VARCHAR(10)")

# Insert data
handler.insert_df("users", df, pk_cols=["user_id"])

# Query data
result = handler.select_all("users")
print(result.head())

# Cleanup
handler.close()

SQLite Integration

from rand_engine.integrations._sqlite_handler import SQLiteHandler

handler = SQLiteHandler("test.db")
handler.create_table("events", "event_id VARCHAR(10)")
handler.insert_df("events", df, pk_cols=["event_id"])

# Query with column selection
result = handler.select_all("events", columns=["event_id", "timestamp"])

handler.close()

🏗️ Architecture

Design Principles

Declarative Specifications: Define what you want, not how to generate it
High Performance: Built on NumPy for vectorized operations
Type Safety: Full type hints and validation
Composability: Chain methods for fluent API
Extensibility: Easy to add custom generators and transformers

Public API

The library exposes a single entry point:

from rand_engine import DataGenerator

All internal modules (prefixed with _) are implementation details and may change.

Key Components

DataGenerator: Main class for data generation
SpecValidator: Educational validator with helpful error messages
File Writers: Batch and stream writers for multiple formats
Database Handlers: DuckDB and SQLite integrations with connection pooling
Core Generators: Stateless NumPy-based generation methods

🧪 Testing

The library has comprehensive test coverage:

189 tests across all components
82% code coverage
Unit tests: Core generation methods
Integration tests: File writers, database handlers
API tests: Public interface validation

Run tests:

# All tests
pytest

# With coverage
pytest --cov=rand_engine --cov-report=html

# Specific module
pytest tests/integrations/

📖 Documentation

Method Reference

All 13 generation methods are documented in the validator:

from rand_engine.validators.spec_validator import SpecValidator

# See all available methods and their parameters
print(SpecValidator.METHOD_SPECS.keys())
# dict_keys(['integers', 'int_zfilled', 'floats', 'floats_normal', 'booleans', 
#            'distincts', 'distincts_prop', 'distincts_map', 'distincts_map_prop',
#            'distincts_multi_map', 'complex_distincts', 'unix_timestamps', 'unique_ids'])

Getting Help

Enable validation for helpful error messages:

spec = {
    "age": {
        "method": "unknown_method"  # Typo!
    }
}

try:
    DataGenerator(spec, validate=True)
except Exception as e:
    print(e)
    # Shows correct method names and examples

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Write tests for your changes
Ensure all tests pass (pytest)
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🌟 Acknowledgments

Built with NumPy and Pandas
Inspired by modern data engineering practices
Community feedback and contributions

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: marco.a.menezes@gmail.com

🗺️ Roadmap

PostgreSQL integration
MySQL/MariaDB support
Apache Arrow format support
Distributed generation with Dask
Web UI for spec building
More pre-built templates

Made with ❤️ for the data community

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

marco_menezes

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.4rc1 pre-release

Nov 5, 2025

0.6.3

Nov 1, 2025

0.6.3rc3 pre-release

Nov 1, 2025

0.6.3rc2 pre-release

Nov 1, 2025

0.6.3rc1 pre-release

Oct 30, 2025

0.6.2

Oct 26, 2025

0.6.2rc1 pre-release

Oct 26, 2025

0.6.1

Oct 24, 2025

0.6.1rc4 pre-release

Oct 24, 2025

0.6.1rc3 pre-release

Oct 24, 2025

0.6.1rc2 pre-release

Oct 23, 2025

0.6.1rc1 pre-release

Oct 22, 2025

0.6.0

Oct 19, 2025

0.6.0rc2 pre-release

Oct 19, 2025

This version

0.6.0rc1 pre-release

Oct 19, 2025

0.5.5

Oct 17, 2025

0.5.5rc2 pre-release

Oct 17, 2025

0.5.5rc1 pre-release

Oct 17, 2025

0.5.4rc1 pre-release

Oct 13, 2025

0.5.3

Oct 12, 2025

0.5.2rc1 pre-release

Oct 12, 2025

0.5.1rc1 pre-release

Oct 11, 2025

0.4.7

Oct 11, 2025

0.4.5

Sep 23, 2025

0.4.4

Sep 23, 2025

0.4.3

Sep 23, 2025

0.4.2

Sep 23, 2025

0.4.1

Sep 23, 2025

0.4.0

Sep 23, 2025

0.3.14

Sep 18, 2025

0.3.13

Sep 18, 2025

0.3.12

Sep 18, 2025

0.3.11

Sep 18, 2025

0.3.9

Sep 18, 2025

0.3.8

Sep 18, 2025

0.3.7

Sep 9, 2025

0.3.5

Feb 2, 2025

0.3.3

Dec 1, 2024

0.2.0

Oct 27, 2024

0.1.1

Sep 24, 2024

0.0.3

Jun 23, 2022

0.0.2

Apr 11, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rand_engine-0.6.0rc1.tar.gz (28.1 kB view details)

Uploaded Oct 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rand_engine-0.6.0rc1-py3-none-any.whl (34.0 kB view details)

Uploaded Oct 19, 2025 Python 3

File details

Details for the file rand_engine-0.6.0rc1.tar.gz.

File metadata

Download URL: rand_engine-0.6.0rc1.tar.gz
Upload date: Oct 19, 2025
Size: 28.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rand_engine-0.6.0rc1.tar.gz
Algorithm	Hash digest
SHA256	`c68187ad184a302175d2c88e1f9ff9b8c971ad2cf76dd3bcc6d76935bce2315a`
MD5	`80ceb71ef3be7a78a3a1957709b8ad73`
BLAKE2b-256	`f601558ee8fe1faa54c833e37df3765fb725c2c78bf0405f45f8584f939c9975`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rand_engine-0.6.0rc1.tar.gz:

Publisher: auto_tag_publish_development.yml on marcoaureliomenezes/rand_engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rand_engine-0.6.0rc1.tar.gz
- Subject digest: c68187ad184a302175d2c88e1f9ff9b8c971ad2cf76dd3bcc6d76935bce2315a
- Sigstore transparency entry: 621760746
- Sigstore integration time: Oct 19, 2025
Source repository:
- Permalink: marcoaureliomenezes/rand_engine@76135c8d9426fce10907853fd8c2a3c7e23c4037
- Branch / Tag: refs/heads/development
- Owner: https://github.com/marcoaureliomenezes
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: auto_tag_publish_development.yml@76135c8d9426fce10907853fd8c2a3c7e23c4037
- Trigger Event: pull_request

File details

Details for the file rand_engine-0.6.0rc1-py3-none-any.whl.

File metadata

Download URL: rand_engine-0.6.0rc1-py3-none-any.whl
Upload date: Oct 19, 2025
Size: 34.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rand_engine-0.6.0rc1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`99a32fe381c9956dcaec338d9ce951a6bd5f673a1ad2731b1a1ab251d4ba3b82`
MD5	`50dc485f89f5bbb053670fd5d3cc5022`
BLAKE2b-256	`a2669542acb2576801e6c21c801be03d65a6342af83c761d407e768389b7f343`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rand_engine-0.6.0rc1-py3-none-any.whl:

Publisher: auto_tag_publish_development.yml on marcoaureliomenezes/rand_engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rand_engine-0.6.0rc1-py3-none-any.whl
- Subject digest: 99a32fe381c9956dcaec338d9ce951a6bd5f673a1ad2731b1a1ab251d4ba3b82
- Sigstore transparency entry: 621760748
- Sigstore integration time: Oct 19, 2025
Source repository:
- Permalink: marcoaureliomenezes/rand_engine@76135c8d9426fce10907853fd8c2a3c7e23c4037
- Branch / Tag: refs/heads/development
- Owner: https://github.com/marcoaureliomenezes
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: auto_tag_publish_development.yml@76135c8d9426fce10907853fd8c2a3c7e23c4037
- Trigger Event: pull_request

rand-engine 0.6.0rc1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Rand Engine

📦 Installation

✅ Requirements

🎯 Who Is This For?

🚀 Quick Start

1. Simple Data Generation

2. Export to Multiple Formats

3. Streaming Data Generation

4. Reproducible Data with Seeds

📚 Available Generation Methods

Core Methods

Advanced Methods

� Advanced Features

1. Correlated Columns (2-Column Mapping)

2. Weighted Correlated Data

3. Complex Patterns (IP Addresses, URLs)

4. Data Transformers

5. Spec Validation

🎨 Real-World Examples

E-commerce Orders

IoT Sensor Data

User Behavior Logs

🗂️ File Export Options

Batch Writing

Streaming Writing

🔌 Database Integrations

DuckDB Integration

SQLite Integration

🏗️ Architecture

Design Principles

Public API

Key Components

🧪 Testing

📖 Documentation

Method Reference

Getting Help

🤝 Contributing

📄 License

🌟 Acknowledgments

📞 Support

🗺️ Roadmap

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance