Skip to main content

High-performance random data generation with NUMA optimization and zero-copy Python interface

Project description

dgen-rs / dgen-py

High-performance random data generation with controllable deduplication, compression, and NUMA optimization

License: MIT OR Apache-2.0

Features

  • 🚀 Blazing Fast: 5-15 GB/s per core using Xoshiro256++ RNG
  • 🎯 Controllable Characteristics:
    • Deduplication ratios (1:1 to N:1)
    • Compression ratios (1:1 to N:1)
  • 🔬 NUMA-Aware: Automatic topology detection and optimization on multi-socket systems
  • 🐍 Zero-Copy Python API: Direct buffer writes, no unnecessary copies
  • 📦 Both Simple and Streaming: Single-call or incremental generation
  • 🛠️ Built with Rust: Memory-safe, production-quality code

Quick Start

Python Installation

# Install from PyPI (when published)
pip install dgen-py

# Or build from source
cd dgen-rs
maturin develop --release

Python Usage

Simple API (generate all at once):

import dgen_py

# Generate 100 MiB incompressible data
data = dgen_py.generate_data(100 * 1024 * 1024)
print(f"Generated {len(data)} bytes")

# Generate with 2:1 dedup and 3:1 compression
data = dgen_py.generate_data(
    size=100 * 1024 * 1024,
    dedup_ratio=2.0,
    compress_ratio=3.0
)

Zero-Copy API (write into existing buffer):

import dgen_py
import numpy as np

# Pre-allocate buffer
buf = bytearray(1024 * 1024)

# Generate directly into buffer (zero-copy!)
nbytes = dgen_py.fill_buffer(buf, compress_ratio=2.0)
print(f"Wrote {nbytes} bytes")

# Works with NumPy arrays
arr = np.zeros(100 * 1024 * 1024, dtype=np.uint8)
dgen_py.fill_buffer(arr, dedup_ratio=2.0, compress_ratio=3.0)

Streaming API (incremental generation):

import dgen_py

# Create generator for 1 GiB
gen = dgen_py.Generator(
    size=1024 * 1024 * 1024,
    dedup_ratio=2.0,
    compress_ratio=3.0,
    numa_aware=True  # Auto-detected by default
)

# Generate in chunks
chunk_size = 8192
buf = bytearray(chunk_size)
total = 0

while not gen.is_complete():
    nbytes = gen.fill_chunk(buf)
    if nbytes == 0:
        break
    
    total += nbytes
    # Process chunk (write to file, network, etc.)
    # ... 

print(f"Generated {total} bytes")

NUMA Information:

import dgen_py

info = dgen_py.get_system_info()
if info:
    print(f"NUMA nodes: {info['num_nodes']}")
    print(f"Physical cores: {info['physical_cores']}")
    print(f"Deployment: {info['deployment_type']}")

Rust Usage

use dgen_rs::{generate_data_simple, GeneratorConfig, DataGenerator};

// Simple API
let data = generate_data_simple(100 * 1024 * 1024, 1, 1);

// Full configuration
let config = GeneratorConfig {
    size: 100 * 1024 * 1024,
    dedup_factor: 2,
    compress_factor: 3,
    numa_aware: true,
};
let data = dgen_rs::generate_data(config);

// Streaming
let mut gen = DataGenerator::new(config);
let mut chunk = vec![0u8; 8192];
while !gen.is_complete() {
    let written = gen.fill_chunk(&mut chunk);
    if written == 0 {
        break;
    }
    // Process chunk...
}

How It Works

Deduplication

Deduplication ratio N means:

  • Generate total_blocks / N unique blocks
  • Reuse blocks in round-robin fashion
  • Example: 100 blocks, dedup=2 → 50 unique blocks, repeated 2x each

Compression

Compression ratio N means:

  • Fill block with high-entropy Xoshiro256++ keystream
  • Add local back-references to achieve N:1 compressibility
  • Example: compress=3 → zstd will compress to ~33% of original size

compress=1: Truly incompressible (zstd ratio ~1.00-1.02)
compress>1: Target ratio via local back-refs, evenly distributed

NUMA Optimization

On multi-socket systems (NUMA nodes > 1):

  • Detects topology via /sys/devices/system/node (Linux)
  • Can pin rayon threads to specific NUMA nodes (optional)
  • Ensures memory locality for maximum bandwidth

Performance

Typical throughput on modern CPUs:

  • Incompressible (compress=1): 5-15 GB/s per core
  • Compressible (compress=3): 1-4 GB/s per core
  • Multi-core: Near-linear scaling with rayon

Benchmark on AMD EPYC 7742 (64 cores):

Incompressible:  ~500 GB/s (all cores)
Compress 3:1:    ~150 GB/s (all cores)

Algorithm Details

Based on s3dlio's data_gen_alt.rs:

  1. Block-level generation: 4 MiB blocks processed in parallel
  2. Xoshiro256++: 5-10x faster than ChaCha20, cryptographically strong
  3. Integer error accumulation: Even compression distribution
  4. No cross-block compression: Realistic compressor behavior
  5. Per-call entropy: Unique data across distributed nodes

Use Cases

  • Storage benchmarking: Generate realistic test data
  • Network testing: High-throughput data sources
  • AI/ML profiling: Simulate data loading pipelines
  • Compression testing: Validate compressor behavior
  • Deduplication testing: Test dedup ratios

Building from Source

# Clone repository
git clone https://github.com/russfellows/dgen-rs.git
cd dgen-rs

# Build Rust library
cargo build --release

# Build Python wheel
maturin build --release

# Install locally
maturin develop --release

# Run tests
cargo test
python -m pytest python/tests/

Requirements

  • Rust: 1.70+ (edition 2021)
  • Python: 3.12+ (for Python bindings)
  • Platform: Linux (NUMA detection), macOS, Windows

License

Dual-licensed under MIT OR Apache-2.0

Credits

See Also

  • s3dlio: High-performance multi-protocol storage I/O
  • sai3-bench: Multi-protocol I/O benchmarking suite
  • kv-cache-bench: LLM KV cache storage benchmarking

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dgen_py-0.1.1.tar.gz (93.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dgen_py-0.1.1-pp311-pypy311_pp73-manylinux_2_24_x86_64.whl (414.1 kB view details)

Uploaded PyPymanylinux: glibc 2.24+ x86-64

dgen_py-0.1.1-cp314-cp314-manylinux_2_24_x86_64.whl (414.1 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.24+ x86-64

dgen_py-0.1.1-cp313-cp313-manylinux_2_24_x86_64.whl (414.4 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64

dgen_py-0.1.1-cp312-cp312-manylinux_2_24_x86_64.whl (414.2 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64

dgen_py-0.1.1-cp311-cp311-manylinux_2_24_x86_64.whl (414.2 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64

dgen_py-0.1.1-cp310-cp310-manylinux_2_24_x86_64.whl (414.3 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64

File details

Details for the file dgen_py-0.1.1.tar.gz.

File metadata

  • Download URL: dgen_py-0.1.1.tar.gz
  • Upload date:
  • Size: 93.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.4

File hashes

Hashes for dgen_py-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3c8282ad9803fd9c9085b87cc636b04f0bc009b235d7e52f75d121d0b54bf4e9
MD5 568fd29e143e8dbbff2d372380bcb5bc
BLAKE2b-256 ea01a50f5124997f9475a50d1fc66498ff5a05a39d5ec102193b8ff03a5793e2

See more details on using hashes here.

File details

Details for the file dgen_py-0.1.1-pp311-pypy311_pp73-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.1.1-pp311-pypy311_pp73-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 47ccee272c327865158a7b4791dbba63d5d5457965c604a9e1a1b6ef55cea96f
MD5 cc8247d07294335c47b3bf4a7dfb9279
BLAKE2b-256 61aad513d460d2c5fdc44a958577f3bd61fc01fa4161bd5b1024751abf95069d

See more details on using hashes here.

File details

Details for the file dgen_py-0.1.1-cp314-cp314-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.1.1-cp314-cp314-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 027becc1de4e0730c0e9fa219757996abf7552f8b578b65d3d1518fb5603aa64
MD5 e0be235b78be99bd7316ee67d34236aa
BLAKE2b-256 a596fd9ad2deaf0f2b8ce5e595262b78979a02eb6ad626cd70bff8a49aff70a5

See more details on using hashes here.

File details

Details for the file dgen_py-0.1.1-cp313-cp313-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.1.1-cp313-cp313-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 58afc70e072aa5a43557e15aba425122830423905e1165681eb091ab78b032d9
MD5 0022577215baacfb604d11e3bb91c5c2
BLAKE2b-256 bf70b37c2a0ea21cfba26d45f4274e2964e464ade0115d842031ffa30cd42fa2

See more details on using hashes here.

File details

Details for the file dgen_py-0.1.1-cp312-cp312-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.1.1-cp312-cp312-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 54791cd34e1e7aed572379d6de91c9829aaa87671412570c37092980cd06a5bf
MD5 42d7aeaba9b56d92b8d656b45b5b0c7e
BLAKE2b-256 8e5196378f203b1ca617efd9fb49ba43e51c5bc0185597eae6e7bc9ae43a9376

See more details on using hashes here.

File details

Details for the file dgen_py-0.1.1-cp311-cp311-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.1.1-cp311-cp311-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 8d2f260ad99018a4736b5fc310f2b3e3ce39a31202ef00b9c1b0be7dd5216bdc
MD5 dbcb3f41ebf553df29b53af2ee83d14a
BLAKE2b-256 5bd6e296f946c7719d14866c735b3cf9778ccf64b1ae86d68f78a4bff4bc9bf4

See more details on using hashes here.

File details

Details for the file dgen_py-0.1.1-cp310-cp310-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for dgen_py-0.1.1-cp310-cp310-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 ed890324bbccf9c4081b540220d2e80fbcc9bc7a198f612362267bc039a78035
MD5 ccce24c68992f6219156d42e3a55a0c6
BLAKE2b-256 7540cc6cabe36a836490e8a02fdc8d33dfeffe02d6ff60bcfe89a231c58cae56

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page