High-performance Object Storage Library for Pytorch, Jax and Tensorflow: Object support inlcudes S3, Azure, GCS, and file system operations for AI/ML

Project description

s3dlio - Universal Storage I/O Library

High-performance, multi-protocol storage library for AI/ML workloads with universal copy operations across S3, Azure, GCS, local file systems, and DirectIO.

v0.9.102 — SDK error-chain diagnostics + cold-start timeout / retry knobs (mlcommons/storage#506)

Every SdkError on the S3 GET / HEAD / PUT / range / multipart paths now reaches Python with its full connector-error chain (I/O / TLS / DNS / timeout / refused) plus an actionable hint string — no more bare RuntimeError: dispatch failure. Two new wired env vars: S3DLIO_CONNECT_TIMEOUT_SECS (default 20 s, was effectively hardcoded 5 s) and S3DLIO_MAX_RETRY_ATTEMPTS (default 3, set to 1 for fast-fail debugging). See docs/Changelog.md for full details.

v0.9.100 (prior): General-purpose object data loader — PyDataset.from_uris(), items(), collect_batch(), skip_head HEAD optimisation. See docs/Python_Data-Loader.md.

v0.9.98 (prior): ParquetRowGroupDataset — epoch-aware Parquet DataLoader with epoch-2 fast path and Arrow IPC decode. See docs/Changelog.md for full history.

📦 Installation

Quick Install (Python)

# If using uv package manager + uv virtual environment:
uv pip install s3dlio

# If using pip without uv:
pip install s3dlio

Python Backend Profiles (PyPI vs Full Build)

If using uv package manager + uv virtual environment: uv pip install s3dlio.
If using standard pip without uv: pip install s3dlio.
The default published wheel is now S3-focused (Azure Blob and GCS are excluded).
If you want full backends (S3 + Azure Blob + GCS), build from source with:

# uv workflow:
uv pip install s3dlio --no-binary s3dlio --config-settings "cargo-extra-args=--features extension-module,full-backends"

# pip-only workflow:
pip install s3dlio --no-binary s3dlio --config-settings "cargo-extra-args=--features extension-module,full-backends"

You can still add a separate package name (for example s3dlio-full) later if you want a dedicated prebuilt full wheel distribution.

Maintainer note: for PyPI uploads, publish the default (./build_pyo3.sh) wheel unless intentionally releasing a separate distribution. full-backends is currently source-build only via the command above.

Building from Source (Rust)

System Dependencies

s3dlio requires some system libraries to build. Only OpenSSL and pkg-config are required by default. HDF5 and hwloc are optional and improve functionality but are not needed for the core library:

Ubuntu/Debian:

# Quick install - run our helper script
./scripts/install-system-deps.sh

# Or manually (required only):
sudo apt-get install -y build-essential pkg-config libssl-dev

# Optional - for NUMA topology support (--features numa):
sudo apt-get install -y libhwloc-dev

# Optional - for HDF5 data format support (--features hdf5):
sudo apt-get install -y libhdf5-dev

# All optional libraries at once:
sudo apt-get install -y libhdf5-dev libhwloc-dev cmake

RHEL/CentOS/Fedora/Rocky/AlmaLinux:

# Quick install
./scripts/install-system-deps.sh

# Or manually (required only):
sudo dnf install -y gcc gcc-c++ make pkg-config openssl-devel

# Optional - for NUMA topology support:
sudo dnf install -y hwloc-devel

# Optional - for HDF5 data format support:
sudo dnf install -y hdf5-devel

# All optional libraries at once:
sudo dnf install -y hdf5-devel hwloc-devel cmake

macOS:

# Quick install
./scripts/install-system-deps.sh

# Or manually (required only):
brew install pkg-config openssl@3

# Optional - for NUMA/HDF5 support:
brew install hdf5 hwloc cmake

# Set environment variables (add to ~/.zshrc or ~/.bash_profile):
export PKG_CONFIG_PATH="$(brew --prefix openssl@3)/lib/pkgconfig:$PKG_CONFIG_PATH"
export OPENSSL_DIR="$(brew --prefix openssl@3)"

Arch Linux:

# Quick install
./scripts/install-system-deps.sh

# Or manually (required only):
sudo pacman -S base-devel pkg-config openssl

# Optional - for NUMA/HDF5 support:
sudo pacman -S hdf5 hwloc cmake

WSL (Windows Subsystem for Linux) / Minimal Environments:

If you are building on WSL or any environment where libhdf5 or libhwloc may not be available, s3dlio builds without them by default. No extra libraries are required:

# Just the basics - works on WSL, Docker, CI, and minimal installs:
sudo apt-get install -y build-essential pkg-config libssl-dev
cargo build --release
# install Python package (no system HDF5/hwloc needed):
# uv workflow:
uv pip install s3dlio
# pip-only workflow:
pip install s3dlio

Install Rust (if not already installed)

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

Build s3dlio

# Clone the repository
git clone https://github.com/russfellows/s3dlio.git
cd s3dlio

# Build with default features (no HDF5 or NUMA required)
cargo build --release

# Build s3-cli with all cloud backends enabled (AWS + Azure + GCS)
cargo build --release --bin s3-cli --features full-backends

# Build s3-cli with GCS enabled only (plus default backends)
cargo build --release --bin s3-cli --features backend-gcs

# Build with NUMA topology support (requires libhwloc-dev)
cargo build --release --features numa

# Build with HDF5 data format support (requires libhdf5-dev)
cargo build --release --features hdf5

# Build with all optional features
cargo build --release --features numa,hdf5

# Run tests
cargo test

# Build Python bindings (optional)
./build_pyo3.sh

# Build Python bindings with full backends (S3 + Azure + GCS)
./build_pyo3.sh full

# Named profile form is also supported:
./build_pyo3.sh --profile full
./build_pyo3.sh --profile default

# Show profile/help usage
./build_pyo3.sh --help

Build Profile Quick Reference

Rust backend feature profiles:

Default build (cargo build --release): S3-focused default backend set.
GCS-enabled build (--features backend-gcs): enables GCS in addition to default set.
Full cloud build (--features full-backends): enables AWS + Azure + GCS.

Python wheel build profiles via build_pyo3.sh:

default or slim: AWS + file/direct; excludes Azure and GCS.
full: AWS + Azure + GCS + file/direct.
Positional and named forms are equivalent:
- ./build_pyo3.sh full
- ./build_pyo3.sh -p full
- ./build_pyo3.sh --profile full

Optional extra Rust features for wheel builds can still be passed with EXTRA_FEATURES. Example: EXTRA_FEATURES="numa,hdf5" ./build_pyo3.sh full.

Note: NUMA support (--features numa) improves multi-socket performance but requires the hwloc2 C library. HDF5 support (--features hdf5) enables HDF5 data format generation but requires libhdf5. Both are optional and s3dlio is fully functional without them.

Platform support: s3dlio builds natively on Linux (x86_64, aarch64), macOS (x86_64 and Apple Silicon arm64), and WSL. Making numa and hdf5 optional was the key change for broad platform support — all remaining dependencies are pure Rust or use platform-independent system libraries (OpenSSL). To cross-compile Python wheels for Linux ARM64 from an x86_64 host, see build_pyo3.sh for instructions using the --zig linker. For macOS universal2 (fat binary covering both architectures), see the commented section in build_pyo3.sh.

✨ Key Features

High Performance: High-throughput multi GB/s reads and writes on platforms with sufficient network and storage capabilities
Zero-Copy Architecture: bytes::Bytes throughout for minimal memory overhead
Multi-Protocol: S3, Azure Blob, GCS, file://, direct:// (O_DIRECT)
HTTP/2 Support (opt-in): HTTP/2 is available but HTTP/1.1 is the default — HTTP/2 is almost always slower for bulk storage workloads. Opt in by setting S3DLIO_H2C=1. Both TLS ALPN (https://) and cleartext h2c (http://) are supported when enabled. See docs/HTTP2_ALPN_INVESTIGATION.md.
Python & Rust: Native Rust library with zero-copy Python bindings (PyO3), bytearray support for efficient memory management
Multi-Endpoint Load Balancing: RoundRobin/LeastConnections across storage endpoints
AI/ML Ready: PyTorch DataLoader integration, TFRecord/NPZ format support
Parquet DataLoader: Per-row-group epoch-aware DataLoader for any s3dlio storage backend (S3, Azure Blob, GCS, file://, direct://). Epoch-2 zero re-fetch speedup (2.5×+), Raw + Arrow IPC decode modes, 8-worker concurrency with shared metadata caches — see guide
High-Speed Data Generation: 50+ GB/s test data with configurable compression/dedup

🌟 Latest Release

v0.9.98 (May 2026) — Parquet DataLoader with epoch-2 fast path and Arrow IPC decode. See docs/Changelog.md.

Recent highlights:

v0.9.98 - Parquet DataLoader (ParquetRowGroupDataset): per-row-group Dataset, epoch-2 zero-re-fetch (2.5× speedup proven), Raw + ArrowIpc decode modes, 8-worker shared caches; 648 tests passing
v0.9.97 - XorStream (dedup-safe, ~15 GB/s/core); S3DLIO_UNSIGNED_PAYLOAD opt-in for private S3-compatible endpoints; 613 tests passing
v0.9.92 - MPU coordinator task, auto-scale, async write/flush/finish safety fixes, MAX_MULTIPART_PARTS guard; 580 tests passing
v0.9.90 - Full NVIDIA AIStore support (S3DLIO_FOLLOW_REDIRECTS=1) with all TLS security policies; HTTP/2 available (opt-in via S3DLIO_H2C=1, not the default); 5 issues closed (#126, #133, #134, #135, #136); 559 tests passing
v0.9.86 - Redirect follower for NVIDIA AIStore (S3 path); HTTPS→HTTP downgrade prevention; 21 new redirect tests; redirect security analysis documented
v0.9.84 - HEAD elimination (ObjectSizeCache); OnceLock env-var caching; lock-free range assembly; AWS_CA_BUNDLE_PATH → AWS_CA_BUNDLE; structured tracing
v0.9.80 - Python list hang fix (IMDSv2 legacy call removed); tracing deadlock fix (tokio::spawn → inline stream); async S3 delete/bucket helpers; deprecated Python APIs cleaned up

📖 Complete Changelog - Full version history, migration guides, API details

📚 Version History

For detailed release notes and migration guides, see the Complete Changelog.

Storage Backend Support

Universal Backend Architecture

s3dlio provides unified storage operations across all backends with consistent URI patterns:

🗄️ Amazon S3: s3://bucket/prefix/ - High-performance S3 operations (5+ GB/s reads, 2.5+ GB/s writes) with built-in concurrent range GETs (on by default)
☁️ Azure Blob Storage: az://container/prefix/ - Complete Azure integration with RangeEngine (30-50% faster for large blobs)
🌐 Google Cloud Storage: gs://bucket/prefix/ or gcs://bucket/prefix/ - Production ready with RangeEngine and full ObjectStore integration
📁 Local File System: file:///path/to/directory/ - High-speed local file operations with RangeEngine support
⚡ DirectIO: direct:///path/to/directory/ - Bypass OS cache for maximum I/O performance with RangeEngine

Concurrent Range GET Performance Features (v0.9.3+, Updated v0.9.60)

Concurrent range downloads hide network latency by parallelizing HTTP range requests.

All backends support concurrent range GETs — but via two different mechanisms:

Mechanism 1 — S3 built-in (on by default, v0.9.60+)

✅ Amazon S3: Concurrent range splitting enabled by default via S3DLIO_ENABLE_RANGE_OPTIMIZATION (default: on). Uses get_object_concurrent_range_async() — fires parallel GetObject(Range: bytes=N-M) requests via the AWS SDK with lock-free chunk assembly. Controlled by S3DLIO_RANGE_THRESHOLD_MB (default: 32 MiB) and S3DLIO_RANGE_CONCURRENCY (default: auto-scaled). Disable with S3DLIO_ENABLE_RANGE_OPTIMIZATION=0.

Mechanism 2 — RangeEngine (per-store config flag, must enable explicitly)

✅ Azure Blob Storage: 30-50% faster for large files (enable_range_engine: true in AzureConfig)
✅ Google Cloud Storage: 30-50% faster for large files (enable_range_engine: true in GcsConfig)
⚠️ Local File System: Rarely beneficial due to seek overhead (disabled by default)
⚠️ DirectIO: Rarely beneficial due to O_DIRECT overhead (disabled by default)

RangeEngine config flag defaults (v0.9.6+):

Status: enable_range_engine: false by default in all per-store config structs
Reason: Extra HEAD request on every GET causes ~50% slowdown for small-object workloads
Threshold: 32 MiB default (tunable per-store via RangeEngineConfig::min_split_size)

How to Enable for Large-File Workloads:

use s3dlio::object_store::{AzureObjectStore, AzureConfig};

let config = AzureConfig {
    enable_range_engine: true,  // Explicitly enable for large files
    ..Default::default()
};
let store = AzureObjectStore::with_config(config);

When to Enable:

✅ Large-file workloads (average size >= 32 MiB)
✅ High-bandwidth, high-latency networks
❌ Mixed or small-object workloads
❌ Local file systems

S3 Backend Options

s3dlio supports two S3 backend implementations. Native AWS SDK is the default and recommended for production use:

# Default: Native AWS SDK backend (RECOMMENDED for production)
cargo build --release
# or explicitly:
cargo build --no-default-features --features native-backends

# Experimental: Apache Arrow object_store backend (optional, for testing)
cargo build --no-default-features --features arrow-backend

Why native-backends is default:

Proven performance in production workloads
Optimized for high-throughput S3 operations (5+ GB/s reads, 2.5+ GB/s writes)
Well-tested with MinIO, Vast, and AWS S3

About arrow-backend:

Experimental alternative implementation
No proven performance advantage over native backend
Useful for comparison testing and development
Not recommended for production use

GCS Backend Options (Current)

GCS is now optional at build time.

Default build (cargo build --release) does not include GCS.
To include GCS, enable backend-gcs (or full-backends).
When enabled, s3dlio uses the official Google crates (google-cloud-storage + gax) from a patched fork maintained for s3dlio.

# Default build (S3-focused; no GCS)
cargo build --release

# Enable GCS explicitly
cargo build --release --features backend-gcs

# Enable all cloud backends (AWS + Azure + GCS)
cargo build --release --features full-backends

Patched official GCS fork used by s3dlio:

Repository: https://github.com/russfellows/google-cloud-rust
Integration in this repo is pinned in Cargo.toml (currently via release tag from that fork).

Legacy note: gcs-community remains as a legacy opt-in path, but the primary supported path is the official Google crates from the patched russfellows/google-cloud-rust fork.

Quick Start

Installation

Rust CLI:

git clone https://github.com/russfellows/s3dlio.git
cd s3dlio
cargo build --release

# Full cloud backend CLI build:
cargo build --release --bin s3-cli --features full-backends

Python Library:

# uv workflow:
uv pip install s3dlio

# pip-only workflow:
pip install s3dlio

# or build from source:
./build_pyo3.sh && ./install_pyo3_wheel.sh

# build from source with full cloud backends:
./build_pyo3.sh --profile full && ./install_pyo3_wheel.sh

Documentation

CLI Guide - Complete command-line interface reference with examples
Python API Guide - Complete Python library reference with examples
Parquet DataLoader Guide - Epoch-aware per-row-group Parquet DataLoader (v0.9.98+)
Multi-Endpoint Guide - Load balancing across multiple storage endpoints (v0.9.14+)
Rust API Guide v0.9.0 - Complete Rust library reference with migration guide
Changelog - Version history and release notes
Adaptive Tuning Guide - Optional performance auto-tuning
Testing Guide - Test suite documentation

Core Capabilities

🚀 Universal Copy Operations

s3dlio treats upload and download as enhanced versions of the Unix cp command, working across all storage backends:

CLI Usage:

# Upload to any backend with real-time progress
s3-cli upload /local/data/*.log s3://mybucket/logs/
s3-cli upload /local/files/* az://container/data/  
s3-cli upload /local/models/* gs://ml-bucket/models/
s3-cli upload /local/backup/* file:///remote-mount/backup/
s3-cli upload /local/cache/* direct:///nvme-storage/cache/

# Download from any backend  
s3-cli download s3://bucket/data/ ./local-data/
s3-cli download az://container/logs/ ./logs/
s3-cli download gs://ml-bucket/datasets/ ./datasets/
s3-cli download file:///network-storage/data/ ./data/

# Cross-backend copying workflow
s3-cli download s3://source-bucket/data/ ./temp/
s3-cli upload ./temp/* gs://dest-bucket/data/

Advanced Pattern Matching:

# Glob patterns for file selection (upload)
s3-cli upload "/data/*.log" s3://bucket/logs/
s3-cli upload "/files/data_*.csv" az://container/data/

# Regex patterns for listing (use single quotes to prevent shell expansion)
s3-cli ls -r s3://bucket/ -p '.*\.txt$'           # Only .txt files
s3-cli ls -r gs://bucket/ -p '.*\.(csv|json)$'    # CSV or JSON files
s3-cli ls -r az://acct/cont/ -p '.*/data_.*'      # Files with "data_" in path

# Count objects matching pattern (with progress indicator)
s3-cli ls -rc gs://bucket/data/ -p '.*\.npz$'
# Output: ⠙ [00:00:05] 71,305 objects (14,261 obj/s)
#         Total objects: 142,610 (10.0s, rate: 14,261 objects/s)

# Delete only matching files
s3-cli delete -r s3://bucket/logs/ -p '.*\.log$'

See CLI Guide for complete command reference and pattern syntax.

🐍 Python Integration

High-Performance Data Operations:

import s3dlio

# Universal upload/download across all backends
s3dlio.upload(['/local/data.csv'], 's3://bucket/data/')
s3dlio.upload(['/local/logs/*.log'], 'az://container/logs/')  
s3dlio.upload(['/local/models/*.pt'], 'gs://ml-bucket/models/')
s3dlio.download('s3://bucket/data/', './local-data/')
s3dlio.download('gs://ml-bucket/datasets/', './datasets/')

# High-level AI/ML operations
dataset = s3dlio.create_dataset("s3://bucket/training-data/")
loader = s3dlio.create_async_loader("gs://ml-bucket/data/", {"batch_size": 32})

# PyTorch integration
from s3dlio.torch import S3IterableDataset
from torch.utils.data import DataLoader

dataset = S3IterableDataset("gs://bucket/data/", loader_opts={})
dataloader = DataLoader(dataset, batch_size=16)

Streaming & Compression:

# High-performance streaming with compression
options = s3dlio.PyWriterOptions()
options.compression = "zstd"
options.compression_level = 3

writer = s3dlio.create_s3_writer('s3://bucket/data.zst', options)
writer.write_chunk(large_data_bytes)
stats = writer.finalize()  # Returns (bytes_written, compressed_bytes)

# Data generation with configurable modes
s3dlio.put("s3://bucket/test-data-{}.bin", num=1000, size=4194304, 
          data_gen_mode="streaming")  # 2.6-3.5x faster for most cases

XorStream — dedup-safe data generation (v0.9.97+):

XorStream generates unique, incompressible data at ~15 GB/s per core without Rayon thread management overhead. Every fill() / generate() call is guaranteed to produce a different 512-byte-block-level fingerprint. Ideal for PUT-heavy benchmarks where many worker threads share a single generator.

import s3dlio
import numpy as np

stream = s3dlio.XorStream()

# --- Fastest path: in-place fill into pre-allocated bytearray ---
buf = bytearray(8 * 1024 * 1024)   # 8 MiB working buffer
stream.fill(buf)                    # fill once — unique payload
stream.fill(buf)                    # fill again — different payload, guaranteed

print(stream.objects_generated)     # == 2

# --- Convenience path: allocate + fill in one call ---
data = stream.generate(8 * 1024 * 1024)  # returns BytesView
view = memoryview(data)                   # zero-copy Python view
arr  = np.frombuffer(view, dtype=np.uint8)  # zero-copy numpy array

# --- PUT with XorStream data (benchmark pattern) ---
import threading

def worker(stream, uri_template, n):
    buf = bytearray(8 * 1024 * 1024)
    for i in range(n):
        stream.fill(buf)       # reuse buffer, unique data each time
        s3dlio.put_bytes(buf, uri_template.format(i))

threads = [threading.Thread(target=worker, args=(stream, "s3://bucket/obj-{}.bin", 100))
           for _ in range(32)]
for t in threads: t.start()
for t in threads: t.join()

Scenario	Best choice
High concurrency PUT (≥ 32 workers)	`XorStream` — no Rayon scheduling
Medium objects 1–32 MiB	`XorStream` — no per-call allocation
Controllable compress/dedup ratios	`generate_data()` / `Generator`
Very large objects (≥ 256 MiB)	`Generator.fill_chunk()`

Multi-Endpoint Load Balancing (v0.9.14+):

# Distribute I/O across multiple storage endpoints
store = s3dlio.create_multi_endpoint_store(
    uris=[
        "s3://bucket-1/data",
        "s3://bucket-2/data", 
        "s3://bucket-3/data",
    ],
    strategy="least_connections"  # or "round_t robin"
)

# Zero-copy data access (memoryview compatible)
data = store.get("s3://bucket-1/file.bin")
array = np.frombuffer(memoryview(data), dtype=np.float32)

# Monitor load distribution
stats = store.get_endpoint_stats()
for i, s in enumerate(stats):
    print(f"Endpoint {i}: {s['requests']} requests, {s['bytes_transferred']} bytes")

📖 Complete Multi-Endpoint Guide - Load balancing, configuration, use cases

📦 Parquet DataLoader — Epoch-Aware Training (v0.9.98+)

The Parquet DataLoader provides per-row-group streaming for Parquet files on any s3dlio-accessible storage — S3, Azure Blob, GCS, file://, and direct:// (O_DIRECT; tested and working). Only the URI prefix changes; no code changes needed to switch backends. Two decode modes: Raw (Python decodes) and ArrowIpc (Rust decodes to Arrow IPC bytes). Zero footer re-fetches on epoch 2+ (row-group byte ranges cached in a process-global DashMap after epoch 1; backend-agnostic).

import s3dlio

# Works with any s3dlio URI — just change the prefix:
#   "s3://bucket/train/"           Amazon S3 / MinIO / Ceph
#   "az://container/train/"        Azure Blob Storage
#   "gs://bucket/train/"           Google Cloud Storage
#   "file:///mnt/data/train/"      Local filesystem
#   "direct:///mnt/nvme/train/"    Local O_DIRECT (bypass page cache)

# Raw mode — Python decodes with PyArrow (default)
loader = s3dlio.create_async_loader(
    "s3://bucket/train/",
    {"format": "parquet", "prefetch": 32}
)
for item in loader:
    # item["data"]: bytes, item["uri"]: str, item["rg_idx"]: int
    table = pyarrow.parquet.read_table(io.BytesIO(item["data"]))

# Arrow IPC mode — Rust decodes, Python gets ready-to-use RecordBatch bytes
loader = s3dlio.create_async_loader(
    "direct:///mnt/nvme/train/",   # same API, different backend
    {"format": "parquet", "decode": "arrow", "prefetch": 32}
)
for item in loader:
    batch = pa.ipc.open_stream(pa.py_buffer(item["data"])).read_next_batch()

Option	Type	Default	Description
`"format"`	`str`	—	Must be `"parquet"` to activate Parquet mode
`"decode"`	`str`	`"raw"`	`"raw"` or `"arrow"` (Rust-side decode)
`"columns"`	`list[int]`	`None`	Column subset; `None` = all columns
`"footer_cap"`	`int`	4 MiB	Bytes from file tail for footer parsing
`"prefetch"`	`int`	`32`	Concurrent in-flight row-group GETs

Epoch-2+ speedup — measured against live MinIO:

Epoch	Construction time	Notes
1	20.4 ms	`list_objects` + footer GETs
2+	8.3 ms	`list_objects` only — 2.5× faster

Memory: only metadata in RAM; 8 concurrent workers share process-global caches (no 8× duplication).

📖 Complete Parquet DataLoader Guide

Performance

Benchmark Results

s3dlio delivers world-class performance across all operations:

Operation	Performance	Notes
S3 PUT	Up to 3.089 GB/s	Exceeds steady-state baseline by 17.8%
S3 GET	Up to 4.826 GB/s	Near line-speed performance
Multi-Process	2-3x faster	Improvement over single process
Streaming Mode	2.6-3.5x faster	For 1-8MB objects vs single-pass

Optimization Features

HTTP/2 Support (opt-in): HTTP/2 is supported but not the default — HTTP/1.1 is used unless you set S3DLIO_H2C=1. HTTP/2 is available for scenarios that benefit from multiplexing, but is typically slower for bulk storage workloads.
Intelligent Defaults: Streaming mode automatically selected based on benchmarks
Multi-Process Architecture: Massive parallelism for maximum performance
Zero-Copy Streaming: Memory-efficient operations for large datasets
Configurable Chunk Sizes: Fine-tune performance for your workload

Checkpoint system for model states

store = s3dlio.PyCheckpointStore('file:///tmp/checkpoints/') store.save('model_state', your_model_data) loaded_data = store.load('model_state')


**Ready for Production**: All core functionality validated, comprehensive test suite, and honest documentation matching actual capabilities.

## Configuration & Tuning

### Environment Variables
s3dlio supports comprehensive configuration through environment variables:

- **NVIDIA AIStore**: `S3DLIO_FOLLOW_REDIRECTS=1` - Enable HTTP 307 redirect following for AIStore (opt-in, disabled by default); `S3DLIO_REDIRECT_MAX=5` - Maximum redirect hops per request
- **HTTP/2 mode**: `S3DLIO_H2C=1` - Force h2c (HTTP/2 cleartext) on http:// endpoints; `S3DLIO_H2C=0` - Force HTTP/1.1; unset = auto-probe (default)
- **Runtime Scaling**: `S3DLIO_RT_THREADS=32` - Tokio worker threads
- **Connection Pool**: `S3DLIO_POOL_MAX_IDLE_PER_HOST=32` - Max idle connections per host (default: 32)
- **S3 Range GET**: `S3DLIO_ENABLE_RANGE_OPTIMIZATION=0` - Disable concurrent range splitting (enabled by default); `S3DLIO_RANGE_THRESHOLD_MB=64` - Size threshold in MiB (default: 32); `S3DLIO_RANGE_CONCURRENCY=64` - Max concurrent range requests
- **Operation Logging**: `S3DLIO_OPLOG_LEVEL=2` - S3 operation tracking

📖 [Environment Variables Reference](docs/Environment_Variables.md)

### Operation Logging (Op-Log)
Universal operation trace logging across all backends with zstd-compressed TSV format, warp-replay compatible.

```python
import s3dlio
s3dlio.init_op_log("operations.tsv.zst")
# All operations automatically logged
s3dlio.finalize_op_log()

See S3DLIO OpLog Implementation for detailed usage.

Building from Source

Prerequisites

Rust: Install Rust toolchain
Python 3.12+: For Python library development
UV (recommended): Install UV
OpenSSL: Required (libssl-dev on Ubuntu)
HDF5 (optional): Only needed with --features hdf5 (libhdf5-dev on Ubuntu, brew install hdf5 on macOS)
hwloc (optional): Only needed with --features numa (libhwloc-dev on Ubuntu)

Build Steps

# Python environment
uv venv && source .venv/bin/activate

# Rust CLI
cargo build --release

# Python library
./build_pyo3.sh && ./install_pyo3_wheel.sh

Configuration

Environment Setup

# Required for S3 operations
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
AWS_ENDPOINT_URL=https://your-s3-endpoint
AWS_REGION=us-east-1

Enable comprehensive S3 operation logging compatible with MinIO warp format:

Advanced Features

CPU Profiling & Analysis

cargo build --release --features profiling
cargo run --example simple_flamegraph_test --features profiling

Compression & Streaming

import s3dlio
options = s3dlio.PyWriterOptions()
options.compression = "zstd"
writer = s3dlio.create_s3_writer('s3://bucket/data.zst', options)
writer.write_chunk(large_data)
stats = writer.finalize()

Container Deployment

# Use pre-built container
podman pull quay.io/russfellows-sig65/s3dlio
podman run --net=host --rm -it quay.io/russfellows-sig65/s3dlio

# Or build locally
podman build -t s3dlio .

Note: Always use --net=host for storage backend connectivity.

Documentation & Support

🖥️ CLI Guide: docs/CLI_GUIDE.md - Complete command-line reference
🐍 Python API: docs/PYTHON_API_GUIDE.md - Python library reference
� Parquet DataLoader: docs/Parquet_Data-Loader.md - Epoch-aware Parquet DataLoader guide (v0.9.98+)
�📚 API Documentation: docs/api/
📝 Changelog: docs/Changelog.md
🧪 Testing Guide: docs/TESTING-GUIDE.md
🚀 Performance: docs/performance/

🔗 Related Projects

sai3-bench - Multi-protocol I/O benchmarking suite built on s3dlio
polarWarp - Op-log analysis tool for parsing and visualizing s3dlio operation logs
google-cloud-rust (s3dlio patched fork) - Official Google Cloud Rust client fork used by s3dlio for patched GCS support

License

Licensed under the Apache License 2.0 - see LICENSE file.

🚀 Ready to get started? Check out the Quick Start section above or explore our example scripts for common use cases!

Project details

Release history Release notifications | RSS feed

0.9.106

Jul 1, 2026

0.9.104

Jun 30, 2026

This version

0.9.102

Jun 25, 2026

0.9.100

May 13, 2026

0.9.98

May 9, 2026

0.9.96

Apr 28, 2026

0.9.95

Apr 27, 2026

0.9.94

Apr 26, 2026

0.9.92

Apr 23, 2026

0.9.90

Apr 18, 2026

0.9.86

Mar 23, 2026

0.9.84

Mar 21, 2026

0.9.82

Mar 20, 2026

0.9.80

Mar 19, 2026

0.9.76

Mar 18, 2026

0.9.75

Mar 17, 2026

0.9.70

Mar 15, 2026

0.9.65

Mar 2, 2026

0.9.60

Mar 1, 2026

0.9.50

Feb 18, 2026

0.9.40

Feb 12, 2026

0.9.34

Jan 13, 2026

0.9.31

Dec 26, 2025

0.9.30

Dec 15, 2025

0.9.27

Dec 14, 2025

0.9.26 yanked

Dec 14, 2025

Reason this release was yanked:

Broken import - use 0.9.27 instead

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3dlio-0.9.102.tar.gz (1.6 MB view details)

Uploaded Jun 25, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

s3dlio-0.9.102-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view details)

Uploaded Jun 25, 2026 CPython 3.13manylinux: glibc 2.17+ x86-64

s3dlio-0.9.102-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (12.4 MB view details)

Uploaded Jun 25, 2026 CPython 3.13manylinux: glibc 2.17+ ARM64

s3dlio-0.9.102-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view details)

Uploaded Jun 25, 2026 CPython 3.12manylinux: glibc 2.17+ x86-64

s3dlio-0.9.102-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (12.4 MB view details)

Uploaded Jun 25, 2026 CPython 3.12manylinux: glibc 2.17+ ARM64

s3dlio-0.9.102-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view details)

Uploaded Jun 25, 2026 CPython 3.11manylinux: glibc 2.17+ x86-64

s3dlio-0.9.102-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (12.4 MB view details)

Uploaded Jun 25, 2026 CPython 3.11manylinux: glibc 2.17+ ARM64

File details

Details for the file s3dlio-0.9.102.tar.gz.

File metadata

Download URL: s3dlio-0.9.102.tar.gz
Upload date: Jun 25, 2026
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for s3dlio-0.9.102.tar.gz
Algorithm	Hash digest
SHA256	`306d68e4403cdbcde421e5559923b697cdbc021951aa1ded66af953d77642ba1`
MD5	`68784f17300fe6dd2bad695234e7c934`
BLAKE2b-256	`706849468354c9ac4e957c23914ea20e014d160d82c4ebee3d5272f59e5ec3d7`

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.102-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: s3dlio-0.9.102-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 25, 2026
Size: 12.8 MB
Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for s3dlio-0.9.102-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`8dd1619c0448538385c905990f970783f912b3a0b2dba5a2f8ab589c510883c6`
MD5	`a179be8e8837583a533ef82b85de7371`
BLAKE2b-256	`2e690016405d4df57c295c1a154522164dea0cc80c5c27a2b60fe60f2b2f53a4`

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.102-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: s3dlio-0.9.102-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Jun 25, 2026
Size: 12.4 MB
Tags: CPython 3.13, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for s3dlio-0.9.102-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`5852510ac08bdca8bd1821ac95ae9b3080003e0e7da3d682f71c295666e6e90d`
MD5	`d82760edb982c482a0ed729bb4cae68f`
BLAKE2b-256	`02cd2ffa1dc2ef1ba3285240d59278228a1cc1967eeaa404df0701afbbf19678`

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.102-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: s3dlio-0.9.102-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 25, 2026
Size: 12.8 MB
Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for s3dlio-0.9.102-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`2f3bdbe9709f2ef2ecb8a59fe728919894500c2897f782ced9e22cbaf6bc69a5`
MD5	`0a03e1d8f9adb136607d71c1ead96bd8`
BLAKE2b-256	`fdccec8d681a5da7de73fb521bdba5eea248948caf2f763f44aecef1c4c5406d`

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.102-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: s3dlio-0.9.102-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Jun 25, 2026
Size: 12.4 MB
Tags: CPython 3.12, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for s3dlio-0.9.102-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`d605e2ff84cc859fd76dc64d05538b9b1d210b6f4d1a33b7cbe2f14d1a64152a`
MD5	`d9df4f6f2147a0dbc7faf6dfce979fc9`
BLAKE2b-256	`b425d870f111cc99ba740a2f0bb6fbcab821553de57ad0e3fec952ea68d754e4`

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.102-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: s3dlio-0.9.102-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 25, 2026
Size: 12.8 MB
Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for s3dlio-0.9.102-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`5844301f00a8dbc58fc8b879d1be524e4482bf512bd45d453daa22b1045f9db5`
MD5	`3909fa1b351314f3c15cd2aee5b234ae`
BLAKE2b-256	`31e3bea5af5e80272c01f662d36257d1cee3719a9d6f73d7e07501d61d9f853c`

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.102-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: s3dlio-0.9.102-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Jun 25, 2026
Size: 12.4 MB
Tags: CPython 3.11, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for s3dlio-0.9.102-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`6af4b34cf581892ea7891a138bff277da8570cea786ec31e0d00fbdfebba7fbd`
MD5	`6407e2678c3dc18386cb92495ae0f29d`
BLAKE2b-256	`695b68417270752d272f1fbd4748b5b4a51d542b63ffdc0cac42d837279086eb`

See more details on using hashes here.

s3dlio 0.9.102

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Project description

s3dlio - Universal Storage I/O Library

📦 Installation

Quick Install (Python)

Python Backend Profiles (PyPI vs Full Build)

Building from Source (Rust)

System Dependencies

Install Rust (if not already installed)

Build s3dlio

Build Profile Quick Reference

✨ Key Features

🌟 Latest Release

📚 Version History

Storage Backend Support

Universal Backend Architecture

Concurrent Range GET Performance Features (v0.9.3+, Updated v0.9.60)

S3 Backend Options

GCS Backend Options (Current)

Quick Start

Installation

Documentation

Core Capabilities

🚀 Universal Copy Operations

🐍 Python Integration

📦 Parquet DataLoader — Epoch-Aware Training (v0.9.98+)

Performance

Benchmark Results

Optimization Features

Checkpoint system for model states

Building from Source

Prerequisites

Build Steps

Configuration

Environment Setup

Advanced Features

CPU Profiling & Analysis

Compression & Streaming

Container Deployment

Documentation & Support

🔗 Related Projects

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata