Skip to main content

High-performance random data generation with NUMA optimization and zero-copy Python interface

Project description

dgen-rs / dgen-py

High-performance random data generation with controllable deduplication, compression, and NUMA optimization

License: MIT OR Apache-2.0 Rust Version Python Version Tests

Features

  • 🚀 Blazing Fast: 5-15 GB/s per core using Xoshiro256++ RNG
  • 🎯 Controllable Characteristics:
    • Deduplication ratios (1:1 to N:1)
    • Compression ratios (1:1 to N:1)
  • 🔬 NUMA-Aware: Automatic topology detection and optimization on multi-socket systems
  • 🐍 Zero-Copy Python API: Direct buffer writes, no unnecessary copies
  • 📦 Both Simple and Streaming: Single-call or incremental generation
  • 🛠️ Built with Rust: Memory-safe, production-quality code

Quick Start

Python Installation

# Install from PyPI (when published)
pip install dgen-py

# Or build from source
cd dgen-rs
maturin develop --release

Python Usage

Simple API (generate all at once):

import dgen_py

# Generate 100 MiB incompressible data
data = dgen_py.generate_data(100 * 1024 * 1024)
print(f"Generated {len(data)} bytes")

# Generate with 2:1 dedup and 3:1 compression
data = dgen_py.generate_data(
    size=100 * 1024 * 1024,
    dedup_ratio=2.0,
    compress_ratio=3.0
)

Zero-Copy API (write into existing buffer):

import dgen_py
import numpy as np

# Pre-allocate buffer
buf = bytearray(1024 * 1024)

# Generate directly into buffer (zero-copy!)
nbytes = dgen_py.fill_buffer(buf, compress_ratio=2.0)
print(f"Wrote {nbytes} bytes")

# Works with NumPy arrays
arr = np.zeros(100 * 1024 * 1024, dtype=np.uint8)
dgen_py.fill_buffer(arr, dedup_ratio=2.0, compress_ratio=3.0)

Streaming API (incremental generation):

import dgen_py

# Create generator for 1 GiB
gen = dgen_py.Generator(
    size=1024 * 1024 * 1024,
    dedup_ratio=2.0,
    compress_ratio=3.0,
    numa_aware=True  # Auto-detected by default
)

# Generate in chunks
chunk_size = 8192
buf = bytearray(chunk_size)
total = 0

while not gen.is_complete():
    nbytes = gen.fill_chunk(buf)
    if nbytes == 0:
        break
    
    total += nbytes
    # Process chunk (write to file, network, etc.)
    # ... 

print(f"Generated {total} bytes")

NUMA Information:

import dgen_py

info = dgen_py.get_system_info()
if info:
    print(f"NUMA nodes: {info['num_nodes']}")
    print(f"Physical cores: {info['physical_cores']}")
    print(f"Deployment: {info['deployment_type']}")

Rust Usage

use dgen_rs::{generate_data_simple, GeneratorConfig, DataGenerator};

// Simple API
let data = generate_data_simple(100 * 1024 * 1024, 1, 1);

// Full configuration
let config = GeneratorConfig {
    size: 100 * 1024 * 1024,
    dedup_factor: 2,
    compress_factor: 3,
    numa_aware: true,
};
let data = dgen_rs::generate_data(config);

// Streaming
let mut gen = DataGenerator::new(config);
let mut chunk = vec![0u8; 8192];
while !gen.is_complete() {
    let written = gen.fill_chunk(&mut chunk);
    if written == 0 {
        break;
    }
    // Process chunk...
}

How It Works

Deduplication

Deduplication ratio N means:

  • Generate total_blocks / N unique blocks
  • Reuse blocks in round-robin fashion
  • Example: 100 blocks, dedup=2 → 50 unique blocks, repeated 2x each

Compression

Compression ratio N means:

  • Fill block with high-entropy Xoshiro256++ keystream
  • Add local back-references to achieve N:1 compressibility
  • Example: compress=3 → zstd will compress to ~33% of original size

compress=1: Truly incompressible (zstd ratio ~1.00-1.02)
compress>1: Target ratio via local back-refs, evenly distributed

NUMA Optimization

On multi-socket systems (NUMA nodes > 1):

  • Detects topology via /sys/devices/system/node (Linux)
  • Can pin rayon threads to specific NUMA nodes (optional)
  • Ensures memory locality for maximum bandwidth

Performance

Typical throughput on modern CPUs:

  • Incompressible (compress=1): 5-15 GB/s per core
  • Compressible (compress=3): 1-4 GB/s per core
  • Multi-core: Near-linear scaling with rayon

Benchmark on AMD EPYC 7742 (64 cores):

Incompressible:  ~500 GB/s (all cores)
Compress 3:1:    ~150 GB/s (all cores)

Algorithm Details

Based on s3dlio's data_gen_alt.rs:

  1. Block-level generation: 4 MiB blocks processed in parallel
  2. Xoshiro256++: 5-10x faster than ChaCha20, cryptographically strong
  3. Integer error accumulation: Even compression distribution
  4. No cross-block compression: Realistic compressor behavior
  5. Per-call entropy: Unique data across distributed nodes

Use Cases

  • Storage benchmarking: Generate realistic test data
  • Network testing: High-throughput data sources
  • AI/ML profiling: Simulate data loading pipelines
  • Compression testing: Validate compressor behavior
  • Deduplication testing: Test dedup ratios

Building from Source

# Clone repository
git clone https://github.com/russfellows/dgen-rs.git
cd dgen-rs

# Build Rust library
cargo build --release

# Build Python wheel
maturin build --release

# Install locally
maturin develop --release

# Run tests
cargo test
python -m pytest python/tests/

Requirements

  • Rust: 1.90+ (edition 2021)
  • Python: 3.10+ (for Python bindings)
  • Platform: Linux (NUMA detection required)

License

Dual-licensed under MIT OR Apache-2.0

Credits

See Also

  • s3dlio: High-performance multi-protocol storage I/O
  • sai3-bench: Multi-protocol I/O benchmarking suite
  • kv-cache-bench: LLM KV cache storage benchmarking

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dgen_py-0.1.2.tar.gz (93.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dgen_py-0.1.2-pp311-pypy311_pp73-manylinux_2_24_x86_64.whl (414.1 kB view details)

Uploaded PyPymanylinux: glibc 2.24+ x86-64

dgen_py-0.1.2-cp314-cp314-manylinux_2_24_x86_64.whl (414.1 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.24+ x86-64

dgen_py-0.1.2-cp313-cp313-manylinux_2_24_x86_64.whl (414.4 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64

dgen_py-0.1.2-cp312-cp312-manylinux_2_24_x86_64.whl (414.2 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64

dgen_py-0.1.2-cp311-cp311-manylinux_2_24_x86_64.whl (414.2 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64

dgen_py-0.1.2-cp310-cp310-manylinux_2_24_x86_64.whl (414.3 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64

File details

Details for the file dgen_py-0.1.2.tar.gz.

File metadata

  • Download URL: dgen_py-0.1.2.tar.gz
  • Upload date:
  • Size: 93.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.4

File hashes

Hashes for dgen_py-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d71ddd0eb4add54484c5ff38d5efbe34f5a5ad5d69a60d1e32204290c910c3d6
MD5 a5c6ccc1c9dd520497bcc9354dcd9a6b
BLAKE2b-256 c49697ae59ead50d61956c59121eb0037219d101332dcdfd6350350ffc6ad9fb

See more details on using hashes here.

File details

Details for the file dgen_py-0.1.2-pp311-pypy311_pp73-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.1.2-pp311-pypy311_pp73-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 c872277cfd640c7d8ff8f0a038001a95d7f194831164235494f09508d1e40e69
MD5 e2b8e0fed3437f8606f38024c507c9b2
BLAKE2b-256 84f4d9a30d5ab98e788060acf7d91cb4c7959a66e6b40592025460c9c2186a88

See more details on using hashes here.

File details

Details for the file dgen_py-0.1.2-cp314-cp314-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.1.2-cp314-cp314-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 c1b5c062bc20ba6931e18d226ae6c970e05ff35c07214ff59b8ba9999d1a3722
MD5 5340d520d3c60a7ec93debb5f667b7a9
BLAKE2b-256 756f51f4a6f1678a907189b9d077e297efb284b487a02d4403d09c0f82a180f3

See more details on using hashes here.

File details

Details for the file dgen_py-0.1.2-cp313-cp313-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.1.2-cp313-cp313-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 10c0331cc819f6d81000a9aaeae8094f859a8ed8f524d1aaa3f94711fcf9b0b2
MD5 485e85e2e61eacce26abbf8303f1efcb
BLAKE2b-256 cb88bf9ca51300bafec936fcd13d1882b36e3a5e54a016ff7435b81e02510f22

See more details on using hashes here.

File details

Details for the file dgen_py-0.1.2-cp312-cp312-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.1.2-cp312-cp312-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 c8c6dd02629cc3416acc0dbb6cf0d47e0a098069c0d64d19f25e26b9da2ab2f6
MD5 a9b9f23ba6a62627b285601370e0b044
BLAKE2b-256 10fda8987b7a04c8524f06fb588bd49786080e57cbebbd1848567963013baa43

See more details on using hashes here.

File details

Details for the file dgen_py-0.1.2-cp311-cp311-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.1.2-cp311-cp311-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 f7c07113e06a509629accf103d481ea830e6a46b2693a3b32466e711e7b47cbd
MD5 c5d5217e0116b67dd56622295954697e
BLAKE2b-256 15bec773d3fca968a8a5c3a376605c3b0e72440656f450527deff32e49620b36

See more details on using hashes here.

File details

Details for the file dgen_py-0.1.2-cp310-cp310-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.1.2-cp310-cp310-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 a7d125c2dd7d6deb7739495d678e2b226cc8ccdd3540250f07d78786ea85422b
MD5 0899deeecee73e91519e5e50319817be
BLAKE2b-256 0cd0d37ffd9f7d431c5131d38a47729d2917e8d0d0af4d53ab4e6f76c416be9e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page