High-performance random data generation with NUMA optimization and zero-copy Python interface
Project description
dgen-rs / dgen-py
High-performance random data generation with controllable deduplication, compression, and NUMA optimization
Features
- 🚀 Blazing Fast: 5-15 GB/s per core using Xoshiro256++ RNG
- 🎯 Controllable Characteristics:
- Deduplication ratios (1:1 to N:1)
- Compression ratios (1:1 to N:1)
- 🔬 NUMA-Aware: Automatic topology detection and optimization on multi-socket systems
- 🐍 Zero-Copy Python API: Direct buffer writes, no unnecessary copies
- 📦 Both Simple and Streaming: Single-call or incremental generation
- 🛠️ Built with Rust: Memory-safe, production-quality code
Quick Start
Python Installation
# Install from PyPI (when published)
pip install dgen-py
# Or build from source
cd dgen-rs
maturin develop --release
Python Usage
Simple API (generate all at once):
import dgen_py
# Generate 100 MiB incompressible data
data = dgen_py.generate_data(100 * 1024 * 1024)
print(f"Generated {len(data)} bytes")
# Generate with 2:1 dedup and 3:1 compression
data = dgen_py.generate_data(
size=100 * 1024 * 1024,
dedup_ratio=2.0,
compress_ratio=3.0
)
Zero-Copy API (write into existing buffer):
import dgen_py
import numpy as np
# Pre-allocate buffer
buf = bytearray(1024 * 1024)
# Generate directly into buffer (zero-copy!)
nbytes = dgen_py.fill_buffer(buf, compress_ratio=2.0)
print(f"Wrote {nbytes} bytes")
# Works with NumPy arrays
arr = np.zeros(100 * 1024 * 1024, dtype=np.uint8)
dgen_py.fill_buffer(arr, dedup_ratio=2.0, compress_ratio=3.0)
Streaming API (incremental generation):
import dgen_py
# Create generator for 1 GiB
gen = dgen_py.Generator(
size=1024 * 1024 * 1024,
dedup_ratio=2.0,
compress_ratio=3.0,
numa_aware=True # Auto-detected by default
)
# Generate in chunks
chunk_size = 8192
buf = bytearray(chunk_size)
total = 0
while not gen.is_complete():
nbytes = gen.fill_chunk(buf)
if nbytes == 0:
break
total += nbytes
# Process chunk (write to file, network, etc.)
# ...
print(f"Generated {total} bytes")
NUMA Information:
import dgen_py
info = dgen_py.get_system_info()
if info:
print(f"NUMA nodes: {info['num_nodes']}")
print(f"Physical cores: {info['physical_cores']}")
print(f"Deployment: {info['deployment_type']}")
Rust Usage
use dgen_rs::{generate_data_simple, GeneratorConfig, DataGenerator};
// Simple API
let data = generate_data_simple(100 * 1024 * 1024, 1, 1);
// Full configuration
let config = GeneratorConfig {
size: 100 * 1024 * 1024,
dedup_factor: 2,
compress_factor: 3,
numa_aware: true,
};
let data = dgen_rs::generate_data(config);
// Streaming
let mut gen = DataGenerator::new(config);
let mut chunk = vec![0u8; 8192];
while !gen.is_complete() {
let written = gen.fill_chunk(&mut chunk);
if written == 0 {
break;
}
// Process chunk...
}
How It Works
Deduplication
Deduplication ratio N means:
- Generate
total_blocks / Nunique blocks - Reuse blocks in round-robin fashion
- Example: 100 blocks, dedup=2 → 50 unique blocks, repeated 2x each
Compression
Compression ratio N means:
- Fill block with high-entropy Xoshiro256++ keystream
- Add local back-references to achieve N:1 compressibility
- Example: compress=3 → zstd will compress to ~33% of original size
compress=1: Truly incompressible (zstd ratio ~1.00-1.02)
compress>1: Target ratio via local back-refs, evenly distributed
NUMA Optimization
On multi-socket systems (NUMA nodes > 1):
- Detects topology via
/sys/devices/system/node(Linux) - Can pin rayon threads to specific NUMA nodes (optional)
- Ensures memory locality for maximum bandwidth
Performance
Typical throughput on modern CPUs:
- Incompressible (compress=1): 5-15 GB/s per core
- Compressible (compress=3): 1-4 GB/s per core
- Multi-core: Near-linear scaling with rayon
Benchmark on AMD EPYC 7742 (64 cores):
Incompressible: ~500 GB/s (all cores)
Compress 3:1: ~150 GB/s (all cores)
Algorithm Details
Based on s3dlio's data_gen_alt.rs:
- Block-level generation: 4 MiB blocks processed in parallel
- Xoshiro256++: 5-10x faster than ChaCha20, cryptographically strong
- Integer error accumulation: Even compression distribution
- No cross-block compression: Realistic compressor behavior
- Per-call entropy: Unique data across distributed nodes
Use Cases
- Storage benchmarking: Generate realistic test data
- Network testing: High-throughput data sources
- AI/ML profiling: Simulate data loading pipelines
- Compression testing: Validate compressor behavior
- Deduplication testing: Test dedup ratios
Building from Source
# Clone repository
git clone https://github.com/russfellows/dgen-rs.git
cd dgen-rs
# Build Rust library
cargo build --release
# Build Python wheel
maturin build --release
# Install locally
maturin develop --release
# Run tests
cargo test
python -m pytest python/tests/
Requirements
- Rust: 1.90+ (edition 2021)
- Python: 3.10+ (for Python bindings)
- Platform: Linux (NUMA detection required)
License
Dual-licensed under MIT OR Apache-2.0
Credits
See Also
- s3dlio: High-performance multi-protocol storage I/O
- sai3-bench: Multi-protocol I/O benchmarking suite
- kv-cache-bench: LLM KV cache storage benchmarking
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dgen_py-0.1.2.tar.gz.
File metadata
- Download URL: dgen_py-0.1.2.tar.gz
- Upload date:
- Size: 93.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d71ddd0eb4add54484c5ff38d5efbe34f5a5ad5d69a60d1e32204290c910c3d6
|
|
| MD5 |
a5c6ccc1c9dd520497bcc9354dcd9a6b
|
|
| BLAKE2b-256 |
c49697ae59ead50d61956c59121eb0037219d101332dcdfd6350350ffc6ad9fb
|
File details
Details for the file dgen_py-0.1.2-pp311-pypy311_pp73-manylinux_2_24_x86_64.whl.
File metadata
- Download URL: dgen_py-0.1.2-pp311-pypy311_pp73-manylinux_2_24_x86_64.whl
- Upload date:
- Size: 414.1 kB
- Tags: PyPy, manylinux: glibc 2.24+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c872277cfd640c7d8ff8f0a038001a95d7f194831164235494f09508d1e40e69
|
|
| MD5 |
e2b8e0fed3437f8606f38024c507c9b2
|
|
| BLAKE2b-256 |
84f4d9a30d5ab98e788060acf7d91cb4c7959a66e6b40592025460c9c2186a88
|
File details
Details for the file dgen_py-0.1.2-cp314-cp314-manylinux_2_24_x86_64.whl.
File metadata
- Download URL: dgen_py-0.1.2-cp314-cp314-manylinux_2_24_x86_64.whl
- Upload date:
- Size: 414.1 kB
- Tags: CPython 3.14, manylinux: glibc 2.24+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1b5c062bc20ba6931e18d226ae6c970e05ff35c07214ff59b8ba9999d1a3722
|
|
| MD5 |
5340d520d3c60a7ec93debb5f667b7a9
|
|
| BLAKE2b-256 |
756f51f4a6f1678a907189b9d077e297efb284b487a02d4403d09c0f82a180f3
|
File details
Details for the file dgen_py-0.1.2-cp313-cp313-manylinux_2_24_x86_64.whl.
File metadata
- Download URL: dgen_py-0.1.2-cp313-cp313-manylinux_2_24_x86_64.whl
- Upload date:
- Size: 414.4 kB
- Tags: CPython 3.13, manylinux: glibc 2.24+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10c0331cc819f6d81000a9aaeae8094f859a8ed8f524d1aaa3f94711fcf9b0b2
|
|
| MD5 |
485e85e2e61eacce26abbf8303f1efcb
|
|
| BLAKE2b-256 |
cb88bf9ca51300bafec936fcd13d1882b36e3a5e54a016ff7435b81e02510f22
|
File details
Details for the file dgen_py-0.1.2-cp312-cp312-manylinux_2_24_x86_64.whl.
File metadata
- Download URL: dgen_py-0.1.2-cp312-cp312-manylinux_2_24_x86_64.whl
- Upload date:
- Size: 414.2 kB
- Tags: CPython 3.12, manylinux: glibc 2.24+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8c6dd02629cc3416acc0dbb6cf0d47e0a098069c0d64d19f25e26b9da2ab2f6
|
|
| MD5 |
a9b9f23ba6a62627b285601370e0b044
|
|
| BLAKE2b-256 |
10fda8987b7a04c8524f06fb588bd49786080e57cbebbd1848567963013baa43
|
File details
Details for the file dgen_py-0.1.2-cp311-cp311-manylinux_2_24_x86_64.whl.
File metadata
- Download URL: dgen_py-0.1.2-cp311-cp311-manylinux_2_24_x86_64.whl
- Upload date:
- Size: 414.2 kB
- Tags: CPython 3.11, manylinux: glibc 2.24+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7c07113e06a509629accf103d481ea830e6a46b2693a3b32466e711e7b47cbd
|
|
| MD5 |
c5d5217e0116b67dd56622295954697e
|
|
| BLAKE2b-256 |
15bec773d3fca968a8a5c3a376605c3b0e72440656f450527deff32e49620b36
|
File details
Details for the file dgen_py-0.1.2-cp310-cp310-manylinux_2_24_x86_64.whl.
File metadata
- Download URL: dgen_py-0.1.2-cp310-cp310-manylinux_2_24_x86_64.whl
- Upload date:
- Size: 414.3 kB
- Tags: CPython 3.10, manylinux: glibc 2.24+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7d125c2dd7d6deb7739495d678e2b226cc8ccdd3540250f07d78786ea85422b
|
|
| MD5 |
0899deeecee73e91519e5e50319817be
|
|
| BLAKE2b-256 |
0cd0d37ffd9f7d431c5131d38a47729d2917e8d0d0af4d53ab4e6f76c416be9e
|