Skip to main content

The world's fastest Python random data generation - with NUMA optimization and zero-copy interface

Project description

dgen-py

The worlds fastest Python random data generation - with NUMA optimization and zero-copy interface

Version License: MIT OR Apache-2.0 PyPI Python Version Tests

Features

  • 🚀 Blazing Fast: 10 GB/s per core, up to 300 GB/s verified
  • Ultra-Fast Allocation: create_bytearrays() for 1,280x faster pre-allocation than Python (NEW in v0.2.0)
  • 🎯 Controllable Characteristics: Configurable deduplication and compression ratios
  • 🔄 Reproducible Data: Seed parameter for identical data generation (v0.1.6) with dynamic reseeding (v0.1.7)
  • 🔬 Multi-Process NUMA: One Python process per NUMA node for maximum throughput
  • 🐍 True Zero-Copy: Python buffer protocol with direct memory access (no data copying)
  • 📦 Streaming API: Generate terabytes of data with constant 32 MB memory usage
  • 🧵 Thread Pool Reuse: Created once, reused across all operations
  • 🛠️ Built with Rust: Memory-safe, production-quality implementation

Performance

Streaming Benchmark - 100 GB Test

Comparison of streaming random data generation methods on a 12-core system:

Method Throughput Speedup vs Baseline Memory Required
os.urandom() (baseline) 0.34 GB/s 1.0x Minimal
NumPy Multi-Thread 1.06 GB/s 3.1x 100 GB RAM*
Numba JIT Xoshiro256++ (streaming) 57.11 GB/s 165.7x 32 MB RAM
dgen-py v0.1.5 (streaming) 58.46 GB/s 169.6x 32 MB RAM

* NumPy requires full dataset in memory (10 GB tested, would need 100 GB for 100 GB dataset)

Key Findings:

  • dgen-py matches Numba's streaming performance (58.46 vs 57.11 GB/s)
  • 55x faster than NumPy while using 3,000x less memory (32 MB vs 100 GB)
  • Streaming architecture: Can generate unlimited data with only 32 MB RAM
  • Per-core throughput: 4.87 GB/s (12 cores)

⚠️ Critical for Storage Testing: ONLY dgen-py supports configurable deduplication and compression ratios. All other methods (os.urandom, NumPy, Numba) generate purely random data with maximum entropy, making them unsuitable for realistic storage system testing. Real-world storage workloads require controllable data characteristics to test deduplication engines, compression algorithms, and storage efficiency—capabilities unique to dgen-py.

Multi-NUMA Scalability - GCP Emerald Rapid

Scalability testing on Google Cloud Platform Intel Emerald Rapid systems (1024 GB workload, compress=1.0):

Instance Physical Cores NUMA Nodes Aggregate Throughput Per-Core Scaling Efficiency
C4-8 4 1 (UMA) 36.26 GB/s 9.07 GB/s Baseline
C4-16 8 1 (UMA) 86.41 GB/s 10.80 GB/s 119%
C4-32 16 1 (UMA) 162.78 GB/s 10.17 GB/s 112%
C4-96 48 2 (NUMA) 248.53 GB/s 5.18 GB/s 51%*

* NUMA penalty: 49% per-core reduction on multi-socket systems, but still achieves highest absolute throughput

Key Findings:

  • Excellent UMA scaling: 112-119% efficiency on single-NUMA systems (super-linear due to larger L3 cache)
  • Per-core performance: 10.80 GB/s on C4-16 (3.0x improvement vs dgen-py v0.1.3's 3.60 GB/s)
  • Compression tradeoff: compress=2.0 provides 1.3-1.5x speedup, but makes data compressible (choose based on your test requirements, not performance)
  • Storage headroom: Even modest 8-core systems exceed 86 GB/s (far beyond typical storage requirements)

See docs/BENCHMARK_RESULTS_V0.1.5.md for complete analysis

Installation

From PyPI (Recommended)

pip install dgen-py

System Requirements

For NUMA support (Linux only):

# Ubuntu/Debian
sudo apt-get install libudev-dev libhwloc-dev

# RHEL/CentOS/Fedora
sudo yum install systemd-devel hwloc-devel

Note: NUMA support is optional. Without these libraries, the package works perfectly on single-NUMA systems (workstations, cloud VMs).

Quick Start

Version 0.2.0: Ultra-Fast Bulk Buffer Allocation 🎉

For scenarios where you need to pre-generate all data in memory before writing, use create_bytearrays() for 1,280x faster allocation than Python list comprehension:

import dgen_py
import time

# Pre-generate 24 GB in 32 MB chunks 
total_size = 24 * 1024**3  # 24 GB
chunk_size = 32 * 1024**2  # 32 MB chunks
num_chunks = total_size // chunk_size  # 768 chunks

# ✅ FAST: Rust-optimized allocation (7-11 ms for 24 GB!)
start = time.perf_counter()
chunks = dgen_py.create_bytearrays(count=num_chunks, size=chunk_size)
alloc_time = time.perf_counter() - start
print(f"Allocation: {alloc_time*1000:.1f} ms @ {(total_size/(1024**3))/alloc_time:.0f} GB/s")

# Fill buffers with high-performance generation
gen = dgen_py.Generator(size=total_size, numa_mode="auto", max_threads=None)

start = time.perf_counter()
for buf in chunks:
    gen.fill_chunk(buf)
gen_time = time.perf_counter() - start
print(f"Generation: {gen_time:.2f}s @ {(total_size/(1024**3))/gen_time:.1f} GB/s")

# Now write to storage...
# for buf in chunks:
#     f.write(buf)

Performance (12-core system):

Allocation: 10.9 ms @ 2204 GB/s  # 1,280x faster than Python!
Generation: 1.59s @ 15.1 GB/s

Performance comparison:

Method Allocation Time (24 GB) Speedup
Python [bytearray(size) for _ in ...] 12-14 seconds 1x (baseline)
dgen_py.create_bytearrays() 7-11 ms 1,280x faster

When to use:

  • ✅ Pre-generation pattern (DLIO benchmark, batch data loading)
  • ✅ Need all data in RAM before writing
  • ❌ Streaming - use Generator.fill_chunk() with reusable buffer instead (see below)

Why it's fast:

  • Uses Python C API (PyByteArray_Resize) directly from Rust
  • For 32 MB chunks, glibc automatically uses mmap (≥128 KB threshold)
  • Zero-copy kernel page allocation, no heap fragmentation
  • Bypasses Python interpreter overhead

Version 0.1.7: Dynamic Seed Changes

Dynamically change the random seed to reset the data stream or create alternating patterns without recreating the Generator:

import dgen_py

gen = dgen_py.Generator(size=100 * 1024**3, seed=1111)
buffer = bytearray(10 * 1024**2)

# Generate data with seed A
gen.set_seed(1111)
gen.fill_chunk(buffer)  # Pattern A

# Switch to seed B
gen.set_seed(2222)
gen.fill_chunk(buffer)  # Pattern B

# Back to seed A - resets the stream!
gen.set_seed(1111)
gen.fill_chunk(buffer)  # SAME as first chunk (pattern A)

Use cases:

  • RAID stripe testing with alternating patterns per drive
  • Multi-phase AI/ML workloads (different patterns for metadata/payload/footer)
  • Complex reproducible benchmark scenarios
  • Low-overhead stream reset (no Generator recreation)

Version 0.1.6: Reproducible Data Generation

Generate identical data across runs for reproducible benchmarking and testing:

import dgen_py

# Reproducible mode - same seed produces identical data
gen1 = dgen_py.Generator(size=10 * 1024**3, seed=12345)
gen2 = dgen_py.Generator(size=10 * 1024**3, seed=12345)
# ⇒ gen1 and gen2 produce IDENTICAL data streams

# Non-deterministic mode (default) - different data each run  
gen3 = dgen_py.Generator(size=10 * 1024**3)  # seed=None (default)

Use cases:

  • 🔬 Reproducible benchmarking: Compare storage systems with identical workloads
  • ✅ Consistent testing: Same test data across CI/CD pipeline runs
  • 🐛 Debugging: Regenerate exact data streams for issue investigation
  • 📊 Compliance: Verifiable data generation for audits

Streaming API (Basic Usage)

For unlimited data generation with constant memory usage, use the streaming API:

import dgen_py
import time

# Generate 100 GB with streaming (only 32 MB in memory at a time)
gen = dgen_py.Generator(
    size=100 * 1024**3,      # 100 GB total
    dedup_ratio=1.0,         # No deduplication 
    compress_ratio=1.0,      # Incompressible data
    numa_mode="auto",        # Auto-detect NUMA topology
    max_threads=None         # Use all available cores
)

# Create single reusable buffer
buffer = bytearray(gen.chunk_size)

# Stream data in chunks (zero-copy, parallel generation)
start = time.perf_counter()
while not gen.is_complete():
    nbytes = gen.fill_chunk(buffer)
    if nbytes == 0:
        break
    # Write to file/network: buffer[:nbytes]

duration = time.perf_counter() - start
print(f"Throughput: {(100 / duration):.2f} GB/s")

Example output (8-core system):

Throughput: 86.41 GB/s

When to use:

  • ✅ Generating very large datasets (> available RAM)
  • ✅ Consistent low memory footprint (32 MB)
  • ✅ Network streaming, continuous data generation

System Information

import dgen_py

info = dgen_py.get_system_info()
if info:
    print(f"NUMA nodes: {info['num_nodes']}")
    print(f"Physical cores: {info['physical_cores']}")
    print(f"Deployment: {info['deployment_type']}")

Advanced Usage

Multi-Process NUMA (For Multi-NUMA Systems)

For maximum throughput on multi-socket systems, use one Python process per NUMA node with process affinity pinning.

See python/examples/benchmark_numa_multiprocess_v2.py for complete implementation.

Key architecture:

  • One Python process per NUMA node
  • Process pinning via os.sched_setaffinity() to local cores
  • Local memory allocation on each NUMA node
  • Synchronized start with multiprocessing.Barrier

Results:

  • C4-96 (48 cores, 2 NUMA nodes): 248.53 GB/s aggregate
  • C4-32 (16 cores, 1 NUMA node): 162.78 GB/s with 112% scaling efficiency

Chunk Size Optimization

Default chunk size is automatically optimized for your system. You can override if needed:

gen = dgen_py.Generator(
    size=100 * 1024**3,
    chunk_size=64 * 1024**2  # Override to 64 MB
)

Newer CPUs (Emerald Rapid, Sapphire Rapids) with larger L3 cache benefit from 64 MB chunks.

Deduplication and Compression Ratios

Performance vs Test Accuracy Tradeoff:

# FAST: Incompressible data (1.0x baseline)
gen = dgen_py.Generator(
    size=100 * 1024**3,
    dedup_ratio=1.0,      # No dedup (no performance impact)
    compress_ratio=1.0    # Incompressible data
)

# FASTER: More compressible (1.3-1.5x speedup)
gen = dgen_py.Generator(
    size=100 * 1024**3,
    dedup_ratio=1.0,      # No dedup (no performance impact)
    compress_ratio=2.0    # 2:1 compressible data
)

Important: Higher compress_ratio values improve generation performance (1.3-1.5x faster) BUT make the data more compressible, which may not represent your actual workload:

  • compress_ratio=1.0: Incompressible data (realistic for encrypted files, compressed archives)
  • compress_ratio=2.0: 2:1 compressible data (realistic for text, logs, uncompressed images)
  • compress_ratio=3.0+: Highly compressible data (may not be realistic)

Choose based on YOUR test requirements, not performance numbers. If testing storage with compression enabled, use compress_ratio=1.0 to avoid inflating storage efficiency metrics.

Note: dedup_ratio has zero performance impact (< 1% variance)

NUMA Modes

# Auto-detect topology (recommended)
gen = dgen_py.Generator(..., numa_mode="auto")

# Force UMA (single-socket)
gen = dgen_py.Generator(..., numa_mode="uma")

# Manual NUMA node binding (multi-process only)
gen = dgen_py.Generator(..., numa_node=0)  # Bind to node 0

Architecture

Zero-Copy Implementation

Python buffer protocol with direct memory access:

  • No data copying between Rust and Python
  • GIL released during generation (true parallelism)
  • Memoryview creation < 0.001ms (verified zero-copy)

Parallel Generation

  • 4 MiB internal blocks distributed across all cores
  • Thread pool created once, reused for all operations
  • Xoshiro256++ RNG (5-10x faster than ChaCha20)
  • Optimal for L3 cache performance

NUMA Optimization

  • Multi-process architecture (one process per NUMA node)
  • Local memory allocation on each node
  • Local core affinity (no cross-node traffic)
  • Automatic topology detection via hwloc

Use Cases

  • Storage benchmarking: Generate realistic test data at 40-188 GB/s
  • Network testing: High-throughput data sources
  • AI/ML profiling: Simulate data loading pipelines
  • Compression testing: Validate compressor behavior with controlled ratios
  • Deduplication testing: Test dedup systems with known ratios

License

Dual-licensed under MIT OR Apache-2.0

Credits

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dgen_py-0.2.0.tar.gz (178.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dgen_py-0.2.0-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl (577.1 kB view details)

Uploaded PyPymanylinux: glibc 2.28+ x86-64

dgen_py-0.2.0-cp314-cp314-manylinux_2_28_x86_64.whl (576.9 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ x86-64

dgen_py-0.2.0-cp313-cp313-manylinux_2_28_x86_64.whl (576.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

dgen_py-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl (577.1 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

dgen_py-0.2.0-cp311-cp311-manylinux_2_28_x86_64.whl (576.9 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

dgen_py-0.2.0-cp310-cp310-manylinux_2_28_x86_64.whl (577.0 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

File details

Details for the file dgen_py-0.2.0.tar.gz.

File metadata

  • Download URL: dgen_py-0.2.0.tar.gz
  • Upload date:
  • Size: 178.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for dgen_py-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7dfa77b7eac160e4c8fe7453ab31134c0187cc656a8a7576ed066be88cdd745b
MD5 3d68128a63ac1dc0224b1dc39a4570c7
BLAKE2b-256 93f49a19a5bce1245550e5a2cf94a1eddd5aed4fe502517316224205f91be601

See more details on using hashes here.

File details

Details for the file dgen_py-0.2.0-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.2.0-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 50b50c113a485b9d06887fa19abfad9dc1b94f79db7670ca15088de13e113621
MD5 bd1e51a514d8a3eb24fb20df5a43d228
BLAKE2b-256 392227bc46d0d017dfd98aa779c932b148d67ebb2f4a83fb4232d52479cc666c

See more details on using hashes here.

File details

Details for the file dgen_py-0.2.0-cp314-cp314-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.2.0-cp314-cp314-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 eb87fe24037113c7a60b143a6c469ef6279a00fe8f878ae86ea388bb9833af4c
MD5 a60ea57a38bc9297ba5e0d03aab07fe8
BLAKE2b-256 e4a2e8ba3bb0c05e9c754d195d34514eb017b13995aeb6217572da8a4fd55216

See more details on using hashes here.

File details

Details for the file dgen_py-0.2.0-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.2.0-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ddb7e10f31c5b741fe80120d0b9874206d3c9b5599912b3b871f4fe5954c34bf
MD5 17361e1d453eed912f42f8e709d4b08a
BLAKE2b-256 5f4bdeb1d5ba6c623350c5ae444c80389d6e7fea82c9f184762815cbbef3ddb8

See more details on using hashes here.

File details

Details for the file dgen_py-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e61e85053589d59988614c7cd8ff9a0e71c39a0d2b3a4e1ed6f577145fb59e8b
MD5 8075a5f84a58fac2e25310427312bf24
BLAKE2b-256 47608362428502cfd8668e9ed5c201f9f8a6f772cab15587f8bdfa07b41b65d4

See more details on using hashes here.

File details

Details for the file dgen_py-0.2.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.2.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c6dba104bbd28f545acb2309edd2a42a7dafbaff5ac5f98b9eff738e00d0753a
MD5 3d465093d30f2f634467e609480a982c
BLAKE2b-256 a259326a0973232a3efa4779dfc97d31327583f1e526d84f93209e1bf2e06464

See more details on using hashes here.

File details

Details for the file dgen_py-0.2.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.2.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 31235bc90c49c42935ec94b051979154995a9b3a16d0f87ce6d933c35e0e7c29
MD5 e0dcea408be6e1428dd5275bb03e87d7
BLAKE2b-256 f90fa6629d69e34bccf00b06a2ed33fd3794d243f6cc6c6d5a191a28c028fe3d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page