High-performance random data generation with NUMA optimization and zero-copy Python interface
Project description
dgen-py
High-performance random data generation with NUMA optimization and zero-copy Python interface
Features
- 🚀 Blazing Fast: 40+ GB/s on 12 cores, 126 GB/s on 96 cores, 188 GB/s on 368 cores
- 🎯 Controllable Characteristics: Configurable deduplication and compression ratios
- 🔬 Multi-Process NUMA: One Python process per NUMA node for maximum throughput
- 🐍 True Zero-Copy: Python buffer protocol with direct memory access (no data copying)
- 📦 Streaming API: Generate terabytes of data with constant memory usage
- 🧵 Thread Pool Reuse: Created once, reused across all operations
- 🛠️ Built with Rust: Memory-safe, production-quality implementation
Performance
Real-World Benchmarks (v0.1.3)
Multi-NUMA Systems (one Python process per NUMA node):
| System | Cores | NUMA Nodes | Throughput | Per-Core | Efficiency |
|---|---|---|---|---|---|
| GCP C4-16 | 16 | 1 (UMA) | 39.87 GB/s | 2.49 GB/s | 100% (baseline) |
| GCP C4-96 | 96 | 4 | 126.96 GB/s | 1.32 GB/s | 53% |
| Azure HBv5 | 368 | 16 | 188.24 GB/s | 0.51 GB/s | 20% |
Single-NUMA Systems (one Python process):
| System | Cores | Throughput | Per-Core | Notes |
|---|---|---|---|---|
| Workstation | 12 | 41.23 GB/s | 3.44 GB/s | Development system, UMA |
Key Findings:
- Sub-linear scaling is expected for memory-intensive workloads (memory bandwidth bottleneck)
- All systems far exceed 80 GB/s storage testing requirements
- Maximum throughput: 188 GB/s on 368-core HBv5 system
- Excellent single-node performance: 40+ GB/s on commodity hardware
Installation
From PyPI (Recommended)
pip install dgen-py
System Requirements
For NUMA support (Linux only):
# Ubuntu/Debian
sudo apt-get install libudev-dev libhwloc-dev
# RHEL/CentOS/Fedora
sudo yum install systemd-devel hwloc-devel
Note: NUMA support is optional. Without these libraries, the package works perfectly on single-NUMA systems (workstations, cloud VMs).
Quick Start
Basic Usage (Fastest - No Dedup/Compression)
import dgen_py
# Generate 100 GB of random data (incompressible, no dedup)
gen = dgen_py.Generator(
size=100 * 1024**3, # 100 GB
dedup_ratio=1.0, # No deduplication (fastest)
compress_ratio=1.0, # Incompressible (fastest)
numa_mode="auto", # Auto-detect NUMA topology
max_threads=None # Use all available cores
)
# Create buffer (uses optimal 32 MB chunk size)
buffer = bytearray(gen.chunk_size)
# Stream data in chunks (zero-copy, parallel generation)
while not gen.is_complete():
nbytes = gen.fill_chunk(buffer)
if nbytes == 0:
break
# Write to file/network: buffer[:nbytes]
Performance Example (Actual Results)
import dgen_py
import time
# 100 GB incompressible test
TEST_SIZE = 100 * 1024**3
gen = dgen_py.Generator(
size=TEST_SIZE,
dedup_ratio=1.0, # No deduplication
compress_ratio=1.0, # Incompressible
numa_mode="auto",
max_threads=None
)
buffer = bytearray(gen.chunk_size)
start = time.perf_counter()
while not gen.is_complete():
nbytes = gen.fill_chunk(buffer)
if nbytes == 0:
break
duration = time.perf_counter() - start
throughput = (TEST_SIZE / 1024**3) / duration
print(f"Duration: {duration:.2f} seconds")
print(f"Throughput: {throughput:.2f} GB/s")
Complete benchmark output (12-core workstation):
NUMA nodes: 1
Physical cores: 12
Deployment: UMA (single NUMA node - cloud VM or workstation)
Starting Benchmark: 3 runs of 100 GB each
Using ZERO-COPY PARALLEL STREAMING
============================================================
TEST 1: DEFAULT CHUNK SIZE (should use optimal 32 MB)
============================================================
Using chunk size: 32 MB
------------------------------------------------------------
Run 01: 3.0401 seconds | 32.89 GB/s
Run 02: 2.1536 seconds | 46.43 GB/s
Run 03: 2.0826 seconds | 48.02 GB/s
------------------------------------------------------------
AVERAGE DURATION: 2.4254 seconds
AVERAGE THROUGHPUT: 41.23 GB/s
PER-CORE THROUGHPUT: 3.44 GB/s
============================================================
TEST 2: OVERRIDE CHUNK SIZE TO 64 MB
============================================================
Using chunk size: 64 MB
------------------------------------------------------------
Run 01: 2.2696 seconds | 44.06 GB/s
Run 02: 2.2647 seconds | 44.16 GB/s
Run 03: 2.2709 seconds | 44.04 GB/s
------------------------------------------------------------
AVERAGE DURATION: 2.2684 seconds
AVERAGE THROUGHPUT: 44.08 GB/s
PER-CORE THROUGHPUT: 3.67 GB/s
============================================================
COMPARISON
============================================================
32 MB (default): 41.23 GB/s
64 MB (override): 44.08 GB/s
64 MB is 6.5% faster than 32 MB
OPTIMIZATION NOTES:
- Thread pool created ONCE and reused
- ZERO-COPY: Generates directly into output buffer
- Internal parallelization: 4 MiB blocks (optimal for L3 cache)
- Parallel generation distributes blocks across all available cores
System Information
import dgen_py
info = dgen_py.get_system_info()
if info:
print(f"NUMA nodes: {info['num_nodes']}")
print(f"Physical cores: {info['physical_cores']}")
print(f"Deployment: {info['deployment_type']}")
# Example output (12-core workstation):
# NUMA nodes: 1
# Physical cores: 12
# Deployment: UMA (single NUMA node - cloud VM or workstation)
Advanced Usage
Multi-Process NUMA (For Multi-NUMA Systems)
}
### Multi-Process NUMA (For Multi-NUMA Systems)
For maximum throughput on multi-socket systems, use **one Python process per NUMA node**:
```python
from multiprocessing import Process, Queue, Barrier
import dgen_py
def worker_process(numa_node: int, barrier: Barrier, result_queue: Queue):
"""One process per NUMA node for maximum performance"""
gen = dgen_py.Generator(
size=100 * 1024**3, # 100 GB per process
dedup_ratio=1.0, # No deduplication
compress_ratio=1.0, # Incompressible
numa_node=numa_node, # Bind to specific NUMA node
max_threads=None
)
buffer = bytearray(gen.chunk_size)
barrier.wait() # Synchronized start
start = time.perf_counter()
while not gen.is_complete():
nbytes = gen.fill_chunk(buffer)
if nbytes == 0:
break
# Write buffer[:nbytes] to storage
duration = time.perf_counter() - start
result_queue.put({'numa_node': numa_node, 'duration': duration})
# Detect NUMA topology
num_numa_nodes = dgen_py.detect_numa_nodes()
# Spawn one process per NUMA node
barrier = Barrier(num_numa_nodes)
result_queue = Queue()
processes = [
Process(target=worker_process, args=(i, barrier, result_queue))
for i in range(num_numa_nodes)
]
for p in processes:
p.start()
for p in processes:
p.join()
# Collect results
# On C4-96 (4 NUMA nodes): 126.96 GB/s aggregate
# On HBv5 (16 NUMA nodes): 188.24 GB/s aggregate
Performance Notes
Chunk Size Optimization
32 MB chunks are optimal (default), but you can override:
gen = dgen_py.Generator(
size=100 * 1024**3,
dedup_ratio=1.0,
compress_ratio=1.0,
chunk_size=64 * 1024**2 # Override to 64 MB
)
Benchmark results (12-core workstation, 100 GB test):
- 32 MB chunks: 41.23 GB/s (3.44 GB/s per core)
- 64 MB chunks: 44.08 GB/s (3.67 GB/s per core)
- Difference: 64 MB is 6.5% faster on this system
Deduplication and Compression
For maximum performance, use dedup_ratio=1.0 and compress_ratio=1.0:
# FASTEST: No deduplication, incompressible
gen = dgen_py.Generator(
size=100 * 1024**3,
dedup_ratio=1.0, # No dedup (fastest)
compress_ratio=1.0 # Incompressible (fastest)
)
Higher ratios reduce throughput:
# SLOWER: With dedup and compression
gen = dgen_py.Generator(
size=100 * 1024**3,
dedup_ratio=2.0, # 2:1 deduplication
compress_ratio=3.0 # 3:1 compression
)
# Throughput will be lower due to processing overhead
NUMA Modes
# Auto-detect topology (recommended)
gen = dgen_py.Generator(..., numa_mode="auto")
# Force UMA (single-socket)
gen = dgen_py.Generator(..., numa_mode="uma")
# Manual NUMA node binding (multi-process only)
gen = dgen_py.Generator(..., numa_node=0) # Bind to node 0
Architecture
Zero-Copy Implementation
Python buffer protocol with direct memory access:
- No data copying between Rust and Python
- GIL released during generation (true parallelism)
- Memoryview creation < 0.001ms (verified zero-copy)
Parallel Generation
- 4 MiB internal blocks distributed across all cores
- Thread pool created once, reused for all operations
- Xoshiro256++ RNG (5-10x faster than ChaCha20)
- Optimal for L3 cache performance
NUMA Optimization
- Multi-process architecture (one process per NUMA node)
- Local memory allocation on each node
- Local core affinity (no cross-node traffic)
- Automatic topology detection via hwloc
Use Cases
- Storage benchmarking: Generate realistic test data at 40-188 GB/s
- Network testing: High-throughput data sources
- AI/ML profiling: Simulate data loading pipelines
- Compression testing: Validate compressor behavior with controlled ratios
- Deduplication testing: Test dedup systems with known ratios
License
Dual-licensed under MIT OR Apache-2.0
Credits
- Built with PyO3 and Maturin
- Uses hwlocality for NUMA topology detection
- Xoshiro256++ RNG from rand crate
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dgen_py-0.1.4.tar.gz.
File metadata
- Download URL: dgen_py-0.1.4.tar.gz
- Upload date:
- Size: 152.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ec3c16f58d862b4c33c10455bff649cb04517e9f0b97b3fe0401ef1109fcee5
|
|
| MD5 |
97631c3975bba9c72b93aac3c2b14df3
|
|
| BLAKE2b-256 |
f0c1c735f6928bace9708f188fa579aa4b4bbed3c30701387218cb86d7d02e76
|
File details
Details for the file dgen_py-0.1.4-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: dgen_py-0.1.4-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 558.3 kB
- Tags: PyPy, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
910037ad58f18b2a9d453d318ffdd1b864bc32acedaf58a5a87e8b0543c8d9b2
|
|
| MD5 |
fa92c2d3896a70eed1c9b2c37a5993d1
|
|
| BLAKE2b-256 |
c6daa923777f85504fbee5b673c9c2d7ee77603fc4d5c75e55863daa9a7d87cd
|
File details
Details for the file dgen_py-0.1.4-cp314-cp314-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: dgen_py-0.1.4-cp314-cp314-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 558.2 kB
- Tags: CPython 3.14, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a9e89003e1440e2d3d503a2f31b91a84a9f3f9c2f542fc5a89b8181059b08e7
|
|
| MD5 |
3db5a26a895769076ebd4da0ce8836dd
|
|
| BLAKE2b-256 |
4c72f575fbfd72abe12005fc4211451c9e4939361622bc92cc8f6f969550bb8d
|
File details
Details for the file dgen_py-0.1.4-cp313-cp313-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: dgen_py-0.1.4-cp313-cp313-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 558.3 kB
- Tags: CPython 3.13, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f0ccadf274cb5a8adfa31af351903a9319fe1c710e9278d6fa0ec34d8557ff5
|
|
| MD5 |
fc12ea13ec1d6f2f7896e9863502878a
|
|
| BLAKE2b-256 |
6d78d79062c4ad81f1a149d873a2548481c5fd5a30cc36a27cfc056f711bccff
|
File details
Details for the file dgen_py-0.1.4-cp312-cp312-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: dgen_py-0.1.4-cp312-cp312-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 558.2 kB
- Tags: CPython 3.12, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30352d1da86a04da763df41dc9dc73f2cf046d6b65aa066d07ff6e622f8c703f
|
|
| MD5 |
c9be335ec60434d67c9e589f4746d29f
|
|
| BLAKE2b-256 |
943eb3a1abe6e7cb4f9558786f7765dd0a47770209930c067ee54304f376d38a
|
File details
Details for the file dgen_py-0.1.4-cp311-cp311-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: dgen_py-0.1.4-cp311-cp311-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 558.2 kB
- Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12575cd186a9446e39523388dee53d837661a25d36a26ebe6d05f00dafa86b98
|
|
| MD5 |
c7436b3ba71ef3c166372b43e9cf915d
|
|
| BLAKE2b-256 |
e25489e47f19e13c03fc7be72f233b83ba112aed0f3e3c386432fe3b64f4844a
|
File details
Details for the file dgen_py-0.1.4-cp310-cp310-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: dgen_py-0.1.4-cp310-cp310-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 558.3 kB
- Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b43fb3963080a5c784101f5e8412d8fa219bebe7c915041aae7c22067ac192c2
|
|
| MD5 |
8b2b1f11fa9f34a622872e9f5b9cb649
|
|
| BLAKE2b-256 |
8eee8017542b65b431beaad4d56b609066911090a669edd0031bc67fb2495f0a
|