Skip to main content

High-performance snapshot storage library with compression and encryption

Project description

hexz-loader

Python bindings for Hexz - high-performance ML data loading with zero-copy reads and background prefetching.

Overview

hexz-loader provides Python bindings to the Hexz engine via PyO3. It's designed for AI/ML training workflows where you need to stream massive datasets directly from compressed storage (local files, S3, HTTP) into GPU memory without Python GIL overhead.

The loader bypasses Python's multiprocessing by handling prefetching in lightweight Rust threads, eliminating "GPU starvation" during training.

Installation

From PyPI (Coming Soon)

# Minimal installation (core features only, ~5MB)
pip install hexz

# With PyTorch support
pip install hexz[torch]

# With TensorFlow support
pip install hexz[tensorflow]

# With NumPy arrays
pip install hexz[numpy]

# ML bundle (PyTorch + NumPy)
pip install hexz[ml]

# Everything
pip install hexz[full]

# Development
pip install hexz[dev]

From Source (Development)

Build and install from the repository root using the Makefile:

# One-time setup (creates venv, installs tools)
make setup

# Install in editable mode (recommended for development)
make develop

# Or build a wheel for distribution
make python
pip install target/wheels/*.whl

Note: Requires Rust toolchain and Python 3.8+. Run make setup-check to verify dependencies.

Custom Feature Selection

For advanced users who want to control binary size and compile-time features:

# Build with minimal features (no S3, zstd compression only)
maturin build --release --no-default-features --features compression-zstd

# Build with S3 but no compression-zstd (LZ4 only)
maturin build --release --no-default-features --features s3

# Build with all features
maturin build --release --features full

# Install custom build
pip install target/wheels/*.whl

Binary Size Comparison (release, stripped):

  • Minimal build (no default features): 12MB
  • Default (S3 + zstd + signing): 12MB
  • Full features: 12MB

Note: Binary size is dominated by PyO3 runtime and Tokio async runtime. The main benefits of feature gates are:

  • Reduced dependency complexity and faster compilation
  • Smaller dependency tree (fewer crates to audit/update)
  • Cleaner runtime without unused functionality

Quick Start

PyTorch Integration

Drop-in replacement for standard PyTorch datasets:

import torch
from hexz import Loader

# Open a compressed dataset (local or remote)
dataset = Loader("s3://my-bucket/imagenet.hxz")

# Standard PyTorch DataLoader
# Hexz handles prefetching in Rust background threads
loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=64,
    num_workers=4
)

for batch in loader:
    # GPU is fed instantly with zero-copy overhead
    train_step(batch)

Reading Snapshots

Simple file-like interface for reading Hexz files:

import hexz

# Open a snapshot
reader = hexz.open("path/to/snapshot.hxz")

# Read entire file
data = reader.read()

# Read specific range
chunk = reader.read_at(offset=1024, length=512)

# File-like seek/read
reader.seek(0)
header = reader.read(100)

Async I/O

Async context manager for asyncio integration:

import asyncio
import hexz

async def main():
    async with hexz.AsyncReader("path/to/snapshot.hxz") as reader:
        data = await reader.read_at(0, 1024)

asyncio.run(main())

Key Features

  • Zero-Copy Reads: Direct memory access without Python overhead
  • Background Prefetching: Rust threads handle I/O while Python/GPU computes
  • PyTorch Integration: Dataset implements PyTorch's Dataset interface
  • Remote Streaming: Stream from S3/HTTP without downloading entire files
  • NumPy Integration: Read directly into NumPy arrays
  • Encryption Support: Transparent decryption of encrypted snapshots
  • GIL-Free: Critical paths run in Rust without Python GIL contention

Feature Matrix

Hexz is designed with modularity in mind. Install only what you need:

Feature Default Description Size Impact
LZ4 Compression Fast compression (always included) ~1MB
S3 Storage Stream from AWS S3, MinIO, Cloudflare R2 ~3MB
Zstd Compression High-ratio compression ~2MB
Encryption AES-GCM encryption for snapshots ~1MB
Signing Ed25519 cryptographic signatures ~500KB

Python Extras

Extra Includes Use Case
[torch] PyTorch ≥2.0 ML training with PyTorch DataLoader
[tensorflow] TensorFlow ≥2.13 ML training with TensorFlow Dataset
[numpy] NumPy ≥1.20 Scientific computing, array operations
[ml] NumPy + PyTorch Common ML stack
[full] All ML frameworks Everything for ML workflows
[dev] Testing + linting tools Development and contribution

Compile-Time Features

Control Rust features at build time for minimal deployments:

# Minimal: local files only, LZ4 compression
maturin build --no-default-features

# Add S3 support
maturin build --no-default-features --features s3

# Add encryption
maturin build --no-default-features --features encryption,s3

# Everything
maturin build --features full

Use Cases:

  • Edge Deployments: Disable S3 to reduce binary size for IoT/embedded
  • Air-Gapped Systems: Build without network features for secure environments
  • Size-Constrained Containers: Minimal builds for Lambda/Cloud Run

Architecture

hexz-loader/
├── src/                    # Rust source (PyO3 bindings)
│   ├── lib.rs             # Main Python module
│   ├── reader.rs          # Reader bindings
│   ├── writer.rs          # Writer bindings
│   └── utils.rs           # Helper functions
├── python/hexz/         # Python wrapper code
│   ├── __init__.py        # Public API
│   ├── dataset.py         # PyTorch Dataset integration
│   ├── reader.py          # High-level reader interface
│   ├── writer.py          # High-level writer interface
│   ├── array.py           # NumPy integration
│   ├── torch/             # PyTorch utilities
│   └── ml/                # ML-specific helpers
├── tests/                 # Python tests (pytest)
└── examples/              # Usage examples

Usage Examples

Creating Snapshots

Create snapshots from Python:

import hexz

# From a file
with hexz.open("output.hxz", mode="w", compression="lz4") as w:
    w.add("source_disk.raw")

# Or use Writer directly
with hexz.Writer("output.hxz", compression="lz4") as w:
    w.add_file("source_disk.raw")
    w.add_bytes(b"additional data")

NumPy Integration

Read data directly into NumPy arrays without extra copies:

import hexz
import numpy as np

reader = hexz.open("data.hxz")

# Zero-copy read into NumPy array
array = hexz.read_array(
    reader,
    offset=0,
    shape=(100, 100),
    dtype=np.float32
)

Mounting Snapshots

Mount as a read-only filesystem (requires FUSE):

import hexz

with hexz.mount("snapshot.hxz") as mp:
    print(f"Mounted at {mp.path}")
    # Access files in mp.path/disk

Remote Streaming

Stream from S3 or HTTP:

import hexz

# S3 streaming
dataset = hexz.open("s3://bucket/dataset.hxz")

# HTTP streaming
dataset = hexz.open("https://example.com/data.hxz")

# Read on-demand (only fetches needed blocks)
chunk = dataset.read_at(1024 * 1024, 4096)

Development

All development commands use the project Makefile from the repository root.

Building

# Install in editable mode (development)
make develop

# Build wheel for distribution
make python

# Build with specific Python version
PYTHON=python3.11 make develop

Testing

# Run all tests (Rust + Python)
make test

# Run only Python tests
make test-python

# Run with filter
make test-python test_reader

# Or use pytest directly
pytest crates/loader/tests/ -v

Linting & Formatting

# Format all code (Rust + Python)
make fmt

# Lint (includes ruff for Python)
make lint

# Python-specific linting
ruff check crates/loader/python/

See make help for all available commands.

API Reference

Core Types

  • Reader: Read snapshots with file-like interface
  • AsyncReader: Async I/O reader
  • Writer: Create new snapshots
  • Dataset: PyTorch Dataset implementation
  • Loader: High-level loader (alias for Dataset)

Functions

  • open(path, mode='r', **kwargs): Open a snapshot (reader or writer)
  • read_array(reader, offset, shape, dtype): Zero-copy read into NumPy
  • mount(path): Mount snapshot as FUSE filesystem

See the Python API documentation for complete reference.

Performance

Optimized for ML training workloads:

Metric Value
Sequential Read ~2-3 GB/s
Random Access ~1ms (cold), ~0.08ms (warm)
Prefetch Threads Configurable (default: 4)
Memory Overhead <150 MB per reader
Zero-Copy Yes (via PyO3 buffer protocol)

PyTorch Integration

The Dataset class implements PyTorch's Dataset interface:

from hexz import Dataset
from torch.utils.data import DataLoader

# Create dataset
dataset = Dataset(
    "s3://bucket/train.hxz",
    transform=None,  # Optional transform function
    cache_size=1024  # Cache 1024 blocks in memory
)

# Use with DataLoader
loader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,
    shuffle=True
)

Requirements

  • Python: 3.8+ (ABI3 compatible)
  • Rust: Latest stable (for building from source)
  • System: Linux, macOS, or Windows
  • Optional: FUSE (for mounting)

See Also

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hexz-0.1.1.tar.gz (390.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hexz-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.3 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

hexz-0.1.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.3 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

hexz-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

hexz-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

hexz-0.1.1-cp38-abi3-macosx_11_0_arm64.whl (3.8 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

hexz-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file hexz-0.1.1.tar.gz.

File metadata

  • Download URL: hexz-0.1.1.tar.gz
  • Upload date:
  • Size: 390.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hexz-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fec9fb481bcfec813acc1f6dfd3e690ded20e972e80edcb64cbc976a5edfe0d9
MD5 0792cf2c921d496050abcb740dce98e0
BLAKE2b-256 0b72be783740959cb25a83cdc1a58a763de53f32c5d319b37d4b20debf23b69d

See more details on using hashes here.

Provenance

The following attestation bundles were made for hexz-0.1.1.tar.gz:

Publisher: release.yml on hexz-org/hexz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hexz-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for hexz-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 84f292fa503a127624f68af41c54de12c99b5e8176c913d0b233f76a10f1de49
MD5 11875d0ae5c2ae72fef7349b851207ae
BLAKE2b-256 489eb553ba66e945fb1cec69b98ba1ae9ae1751eeec27a8d9b18343db0bb438a

See more details on using hashes here.

Provenance

The following attestation bundles were made for hexz-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on hexz-org/hexz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hexz-0.1.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for hexz-0.1.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8d15b678731dc67294aa90940f13cee25538191fc4ae23006ff7d8414b1a28b8
MD5 91cfa2bb112ed0f3799188ec79bb801a
BLAKE2b-256 6e292134f9196905e9d744ea3e0dfc0bc57e6b7d52ef9c153d2484dd70fc6d81

See more details on using hashes here.

Provenance

The following attestation bundles were made for hexz-0.1.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on hexz-org/hexz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hexz-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hexz-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4bec5982485c19c1fe7be47c287d2d173604d6bfd8f235f48d305bacd0a9e0c7
MD5 805282018c73f5a8f0094a7814850174
BLAKE2b-256 c34f071ee11a3b99b6e09f4137b18c76eec27c56bd6b07346b429d05adbb6f9e

See more details on using hashes here.

Provenance

The following attestation bundles were made for hexz-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on hexz-org/hexz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hexz-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for hexz-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 93167af89fd1c55614b3b8ad6bfbc125ba42063816691c4fc85ccedbc285229a
MD5 5e1a09f61b4d17de1d6905bb3f65c6c6
BLAKE2b-256 8d16c89e7d2cf41793e02e4e5865c213efedb6920387638b7eb69e2097785aa6

See more details on using hashes here.

Provenance

The following attestation bundles were made for hexz-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on hexz-org/hexz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hexz-0.1.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

  • Download URL: hexz-0.1.1-cp38-abi3-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 3.8 MB
  • Tags: CPython 3.8+, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hexz-0.1.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dfa9fb305226d67b6287e4a8750214a6eb941628b703abef68ba8ac112250e19
MD5 c3df203ffabab5ece6dd7599ea47aa4b
BLAKE2b-256 a59c88b08c9dc0a78d901479dd3e0a0c2891507e53beabbec62c3bbcbd5fe08a

See more details on using hashes here.

Provenance

The following attestation bundles were made for hexz-0.1.1-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on hexz-org/hexz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hexz-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for hexz-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 80000852ca94b4ba75f294405e79204f57a8ef699f96045d7c85cde04255580f
MD5 a3f05fdea4448be38b124eebcfaae471
BLAKE2b-256 75f6c52b6c770031930404206d0c4f8be236c2041dad19d85c929475decc65ea

See more details on using hashes here.

Provenance

The following attestation bundles were made for hexz-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on hexz-org/hexz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page