High-performance snapshot storage library with compression and encryption
Project description
hexz-loader
Python bindings for Hexz - high-performance ML data loading with zero-copy reads and background prefetching.
Overview
hexz-loader provides Python bindings to the Hexz engine via PyO3. It's designed for AI/ML training workflows where you need to stream massive datasets directly from compressed storage (local files, S3, HTTP) into GPU memory without Python GIL overhead.
The loader bypasses Python's multiprocessing by handling prefetching in lightweight Rust threads, eliminating "GPU starvation" during training.
Installation
From PyPI (Coming Soon)
# Minimal installation (core features only, ~5MB)
pip install hexz
# With PyTorch support
pip install hexz[torch]
# With TensorFlow support
pip install hexz[tensorflow]
# With NumPy arrays
pip install hexz[numpy]
# ML bundle (PyTorch + NumPy)
pip install hexz[ml]
# Everything
pip install hexz[full]
# Development
pip install hexz[dev]
From Source (Development)
Build and install from the repository root using the Makefile:
# One-time setup (creates venv, installs tools)
make setup
# Install in editable mode (recommended for development)
make develop
# Or build a wheel for distribution
make python
pip install target/wheels/*.whl
Note: Requires Rust toolchain and Python 3.8+. Run make setup-check to verify dependencies.
Custom Feature Selection
For advanced users who want to control binary size and compile-time features:
# Build with minimal features (no S3, zstd compression only)
maturin build --release --no-default-features --features compression-zstd
# Build with S3 but no compression-zstd (LZ4 only)
maturin build --release --no-default-features --features s3
# Build with all features
maturin build --release --features full
# Install custom build
pip install target/wheels/*.whl
Binary Size Comparison (release, stripped):
- Minimal build (no default features): 12MB
- Default (S3 + zstd + signing): 12MB
- Full features: 12MB
Note: Binary size is dominated by PyO3 runtime and Tokio async runtime. The main benefits of feature gates are:
- Reduced dependency complexity and faster compilation
- Smaller dependency tree (fewer crates to audit/update)
- Cleaner runtime without unused functionality
Quick Start
PyTorch Integration
Drop-in replacement for standard PyTorch datasets:
import torch
from hexz import Loader
# Open a compressed dataset (local or remote)
dataset = Loader("s3://my-bucket/imagenet.hxz")
# Standard PyTorch DataLoader
# Hexz handles prefetching in Rust background threads
loader = torch.utils.data.DataLoader(
dataset,
batch_size=64,
num_workers=4
)
for batch in loader:
# GPU is fed instantly with zero-copy overhead
train_step(batch)
Reading Snapshots
Simple file-like interface for reading Hexz files:
import hexz
# Open a snapshot
reader = hexz.open("path/to/snapshot.hxz")
# Read entire file
data = reader.read()
# Read specific range
chunk = reader.read_at(offset=1024, length=512)
# File-like seek/read
reader.seek(0)
header = reader.read(100)
Async I/O
Async context manager for asyncio integration:
import asyncio
import hexz
async def main():
async with hexz.AsyncReader("path/to/snapshot.hxz") as reader:
data = await reader.read_at(0, 1024)
asyncio.run(main())
Key Features
- Zero-Copy Reads: Direct memory access without Python overhead
- Background Prefetching: Rust threads handle I/O while Python/GPU computes
- PyTorch Integration:
Datasetimplements PyTorch's Dataset interface - Remote Streaming: Stream from S3/HTTP without downloading entire files
- NumPy Integration: Read directly into NumPy arrays
- Encryption Support: Transparent decryption of encrypted snapshots
- GIL-Free: Critical paths run in Rust without Python GIL contention
Feature Matrix
Hexz is designed with modularity in mind. Install only what you need:
| Feature | Default | Description | Size Impact |
|---|---|---|---|
| LZ4 Compression | ✅ | Fast compression (always included) | ~1MB |
| S3 Storage | ✅ | Stream from AWS S3, MinIO, Cloudflare R2 | ~3MB |
| Zstd Compression | ✅ | High-ratio compression | ~2MB |
| Encryption | ❌ | AES-GCM encryption for snapshots | ~1MB |
| Signing | ❌ | Ed25519 cryptographic signatures | ~500KB |
Python Extras
| Extra | Includes | Use Case |
|---|---|---|
[torch] |
PyTorch ≥2.0 | ML training with PyTorch DataLoader |
[tensorflow] |
TensorFlow ≥2.13 | ML training with TensorFlow Dataset |
[numpy] |
NumPy ≥1.20 | Scientific computing, array operations |
[ml] |
NumPy + PyTorch | Common ML stack |
[full] |
All ML frameworks | Everything for ML workflows |
[dev] |
Testing + linting tools | Development and contribution |
Compile-Time Features
Control Rust features at build time for minimal deployments:
# Minimal: local files only, LZ4 compression
maturin build --no-default-features
# Add S3 support
maturin build --no-default-features --features s3
# Add encryption
maturin build --no-default-features --features encryption,s3
# Everything
maturin build --features full
Use Cases:
- Edge Deployments: Disable S3 to reduce binary size for IoT/embedded
- Air-Gapped Systems: Build without network features for secure environments
- Size-Constrained Containers: Minimal builds for Lambda/Cloud Run
Architecture
hexz-loader/
├── src/ # Rust source (PyO3 bindings)
│ ├── lib.rs # Main Python module
│ ├── reader.rs # Reader bindings
│ ├── writer.rs # Writer bindings
│ └── utils.rs # Helper functions
├── python/hexz/ # Python wrapper code
│ ├── __init__.py # Public API
│ ├── dataset.py # PyTorch Dataset integration
│ ├── reader.py # High-level reader interface
│ ├── writer.py # High-level writer interface
│ ├── array.py # NumPy integration
│ ├── torch/ # PyTorch utilities
│ └── ml/ # ML-specific helpers
├── tests/ # Python tests (pytest)
└── examples/ # Usage examples
Usage Examples
Creating Snapshots
Create snapshots from Python:
import hexz
# From a file
with hexz.open("output.hxz", mode="w", compression="lz4") as w:
w.add("source_disk.raw")
# Or use Writer directly
with hexz.Writer("output.hxz", compression="lz4") as w:
w.add_file("source_disk.raw")
w.add_bytes(b"additional data")
NumPy Integration
Read data directly into NumPy arrays without extra copies:
import hexz
import numpy as np
reader = hexz.open("data.hxz")
# Zero-copy read into NumPy array
array = hexz.read_array(
reader,
offset=0,
shape=(100, 100),
dtype=np.float32
)
Mounting Snapshots
Mount as a read-only filesystem (requires FUSE):
import hexz
with hexz.mount("snapshot.hxz") as mp:
print(f"Mounted at {mp.path}")
# Access files in mp.path/disk
Remote Streaming
Stream from S3 or HTTP:
import hexz
# S3 streaming
dataset = hexz.open("s3://bucket/dataset.hxz")
# HTTP streaming
dataset = hexz.open("https://example.com/data.hxz")
# Read on-demand (only fetches needed blocks)
chunk = dataset.read_at(1024 * 1024, 4096)
Development
All development commands use the project Makefile from the repository root.
Building
# Install in editable mode (development)
make develop
# Build wheel for distribution
make python
# Build with specific Python version
PYTHON=python3.11 make develop
Testing
# Run all tests (Rust + Python)
make test
# Run only Python tests
make test-python
# Run with filter
make test-python test_reader
# Or use pytest directly
pytest crates/loader/tests/ -v
Linting & Formatting
# Format all code (Rust + Python)
make fmt
# Lint (includes ruff for Python)
make lint
# Python-specific linting
ruff check crates/loader/python/
See make help for all available commands.
API Reference
Core Types
Reader: Read snapshots with file-like interfaceAsyncReader: Async I/O readerWriter: Create new snapshotsDataset: PyTorch Dataset implementationLoader: High-level loader (alias forDataset)
Functions
open(path, mode='r', **kwargs): Open a snapshot (reader or writer)read_array(reader, offset, shape, dtype): Zero-copy read into NumPymount(path): Mount snapshot as FUSE filesystem
See the Python API documentation for complete reference.
Performance
Optimized for ML training workloads:
| Metric | Value |
|---|---|
| Sequential Read | ~2-3 GB/s |
| Random Access | ~1ms (cold), ~0.08ms (warm) |
| Prefetch Threads | Configurable (default: 4) |
| Memory Overhead | <150 MB per reader |
| Zero-Copy | Yes (via PyO3 buffer protocol) |
PyTorch Integration
The Dataset class implements PyTorch's Dataset interface:
from hexz import Dataset
from torch.utils.data import DataLoader
# Create dataset
dataset = Dataset(
"s3://bucket/train.hxz",
transform=None, # Optional transform function
cache_size=1024 # Cache 1024 blocks in memory
)
# Use with DataLoader
loader = DataLoader(
dataset,
batch_size=32,
num_workers=4,
shuffle=True
)
Requirements
- Python: 3.8+ (ABI3 compatible)
- Rust: Latest stable (for building from source)
- System: Linux, macOS, or Windows
- Optional: FUSE (for mounting)
See Also
- User Documentation - Tutorials and guides
- Python API Reference - Complete API docs
- hexz-core - Core Rust engine
- CLI Tool - Command-line interface
- Project README - Main project overview
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hexz-0.1.1.tar.gz.
File metadata
- Download URL: hexz-0.1.1.tar.gz
- Upload date:
- Size: 390.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fec9fb481bcfec813acc1f6dfd3e690ded20e972e80edcb64cbc976a5edfe0d9
|
|
| MD5 |
0792cf2c921d496050abcb740dce98e0
|
|
| BLAKE2b-256 |
0b72be783740959cb25a83cdc1a58a763de53f32c5d319b37d4b20debf23b69d
|
Provenance
The following attestation bundles were made for hexz-0.1.1.tar.gz:
Publisher:
release.yml on hexz-org/hexz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hexz-0.1.1.tar.gz -
Subject digest:
fec9fb481bcfec813acc1f6dfd3e690ded20e972e80edcb64cbc976a5edfe0d9 - Sigstore transparency entry: 953553191
- Sigstore integration time:
-
Permalink:
hexz-org/hexz@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/hexz-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file hexz-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: hexz-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 4.3 MB
- Tags: PyPy, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84f292fa503a127624f68af41c54de12c99b5e8176c913d0b233f76a10f1de49
|
|
| MD5 |
11875d0ae5c2ae72fef7349b851207ae
|
|
| BLAKE2b-256 |
489eb553ba66e945fb1cec69b98ba1ae9ae1751eeec27a8d9b18343db0bb438a
|
Provenance
The following attestation bundles were made for hexz-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
release.yml on hexz-org/hexz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hexz-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
84f292fa503a127624f68af41c54de12c99b5e8176c913d0b233f76a10f1de49 - Sigstore transparency entry: 953553196
- Sigstore integration time:
-
Permalink:
hexz-org/hexz@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/hexz-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file hexz-0.1.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: hexz-0.1.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 4.3 MB
- Tags: PyPy, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d15b678731dc67294aa90940f13cee25538191fc4ae23006ff7d8414b1a28b8
|
|
| MD5 |
91cfa2bb112ed0f3799188ec79bb801a
|
|
| BLAKE2b-256 |
6e292134f9196905e9d744ea3e0dfc0bc57e6b7d52ef9c153d2484dd70fc6d81
|
Provenance
The following attestation bundles were made for hexz-0.1.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
release.yml on hexz-org/hexz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hexz-0.1.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
8d15b678731dc67294aa90940f13cee25538191fc4ae23006ff7d8414b1a28b8 - Sigstore transparency entry: 953553194
- Sigstore integration time:
-
Permalink:
hexz-org/hexz@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/hexz-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file hexz-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: hexz-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.3 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4bec5982485c19c1fe7be47c287d2d173604d6bfd8f235f48d305bacd0a9e0c7
|
|
| MD5 |
805282018c73f5a8f0094a7814850174
|
|
| BLAKE2b-256 |
c34f071ee11a3b99b6e09f4137b18c76eec27c56bd6b07346b429d05adbb6f9e
|
Provenance
The following attestation bundles were made for hexz-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on hexz-org/hexz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hexz-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
4bec5982485c19c1fe7be47c287d2d173604d6bfd8f235f48d305bacd0a9e0c7 - Sigstore transparency entry: 953553200
- Sigstore integration time:
-
Permalink:
hexz-org/hexz@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/hexz-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file hexz-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: hexz-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 4.3 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93167af89fd1c55614b3b8ad6bfbc125ba42063816691c4fc85ccedbc285229a
|
|
| MD5 |
5e1a09f61b4d17de1d6905bb3f65c6c6
|
|
| BLAKE2b-256 |
8d16c89e7d2cf41793e02e4e5865c213efedb6920387638b7eb69e2097785aa6
|
Provenance
The following attestation bundles were made for hexz-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
release.yml on hexz-org/hexz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hexz-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
93167af89fd1c55614b3b8ad6bfbc125ba42063816691c4fc85ccedbc285229a - Sigstore transparency entry: 953553198
- Sigstore integration time:
-
Permalink:
hexz-org/hexz@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/hexz-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file hexz-0.1.1-cp38-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: hexz-0.1.1-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.8 MB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfa9fb305226d67b6287e4a8750214a6eb941628b703abef68ba8ac112250e19
|
|
| MD5 |
c3df203ffabab5ece6dd7599ea47aa4b
|
|
| BLAKE2b-256 |
a59c88b08c9dc0a78d901479dd3e0a0c2891507e53beabbec62c3bbcbd5fe08a
|
Provenance
The following attestation bundles were made for hexz-0.1.1-cp38-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on hexz-org/hexz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hexz-0.1.1-cp38-abi3-macosx_11_0_arm64.whl -
Subject digest:
dfa9fb305226d67b6287e4a8750214a6eb941628b703abef68ba8ac112250e19 - Sigstore transparency entry: 953553199
- Sigstore integration time:
-
Permalink:
hexz-org/hexz@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/hexz-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file hexz-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: hexz-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 4.0 MB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80000852ca94b4ba75f294405e79204f57a8ef699f96045d7c85cde04255580f
|
|
| MD5 |
a3f05fdea4448be38b124eebcfaae471
|
|
| BLAKE2b-256 |
75f6c52b6c770031930404206d0c4f8be236c2041dad19d85c929475decc65ea
|
Provenance
The following attestation bundles were made for hexz-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl:
Publisher:
release.yml on hexz-org/hexz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hexz-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl -
Subject digest:
80000852ca94b4ba75f294405e79204f57a8ef699f96045d7c85cde04255580f - Sigstore transparency entry: 953553195
- Sigstore integration time:
-
Permalink:
hexz-org/hexz@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/hexz-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8565596af94f7c3e709871f9942cf09ddc1281a9 -
Trigger Event:
push
-
Statement type: