Skip to main content

High-performance Object Storage Library for Pytorch, Jax and Tensorflow: Object support inlcudes S3, Azure, GCS, and file system operations for AI/ML

Project description

s3dlio - Universal Storage I/O Library

Build Status Rust Tests Version PyPI License Rust Python

High-performance, multi-protocol storage library for AI/ML workloads with universal copy operations across S3, Azure, GCS, local file systems, and DirectIO.

v0.9.102 — SDK error-chain diagnostics + cold-start timeout / retry knobs (mlcommons/storage#506)

Every SdkError on the S3 GET / HEAD / PUT / range / multipart paths now reaches Python with its full connector-error chain (I/O / TLS / DNS / timeout / refused) plus an actionable hint string — no more bare RuntimeError: dispatch failure. Two new wired env vars: S3DLIO_CONNECT_TIMEOUT_SECS (default 20 s, was effectively hardcoded 5 s) and S3DLIO_MAX_RETRY_ATTEMPTS (default 3, set to 1 for fast-fail debugging). See docs/Changelog.md for full details.

v0.9.100 (prior): General-purpose object data loader — PyDataset.from_uris(), items(), collect_batch(), skip_head HEAD optimisation. See docs/Python_Data-Loader.md.

v0.9.98 (prior): ParquetRowGroupDataset — epoch-aware Parquet DataLoader with epoch-2 fast path and Arrow IPC decode. See docs/Changelog.md for full history.

📦 Installation

Quick Install (Python)

# If using uv package manager + uv virtual environment:
uv pip install s3dlio

# If using pip without uv:
pip install s3dlio

Python Backend Profiles (PyPI vs Full Build)

  • If using uv package manager + uv virtual environment: uv pip install s3dlio.
  • If using standard pip without uv: pip install s3dlio.
  • The default published wheel is now S3-focused (Azure Blob and GCS are excluded).
  • If you want full backends (S3 + Azure Blob + GCS), build from source with:
# uv workflow:
uv pip install s3dlio --no-binary s3dlio --config-settings "cargo-extra-args=--features extension-module,full-backends"

# pip-only workflow:
pip install s3dlio --no-binary s3dlio --config-settings "cargo-extra-args=--features extension-module,full-backends"

You can still add a separate package name (for example s3dlio-full) later if you want a dedicated prebuilt full wheel distribution.

Maintainer note: for PyPI uploads, publish the default (./build_pyo3.sh) wheel unless intentionally releasing a separate distribution. full-backends is currently source-build only via the command above.

Building from Source (Rust)

System Dependencies

s3dlio requires some system libraries to build. Only OpenSSL and pkg-config are required by default. HDF5 and hwloc are optional and improve functionality but are not needed for the core library:

Ubuntu/Debian:

# Quick install - run our helper script
./scripts/install-system-deps.sh

# Or manually (required only):
sudo apt-get install -y build-essential pkg-config libssl-dev

# Optional - for NUMA topology support (--features numa):
sudo apt-get install -y libhwloc-dev

# Optional - for HDF5 data format support (--features hdf5):
sudo apt-get install -y libhdf5-dev

# All optional libraries at once:
sudo apt-get install -y libhdf5-dev libhwloc-dev cmake

RHEL/CentOS/Fedora/Rocky/AlmaLinux:

# Quick install
./scripts/install-system-deps.sh

# Or manually (required only):
sudo dnf install -y gcc gcc-c++ make pkg-config openssl-devel

# Optional - for NUMA topology support:
sudo dnf install -y hwloc-devel

# Optional - for HDF5 data format support:
sudo dnf install -y hdf5-devel

# All optional libraries at once:
sudo dnf install -y hdf5-devel hwloc-devel cmake

macOS:

# Quick install
./scripts/install-system-deps.sh

# Or manually (required only):
brew install pkg-config openssl@3

# Optional - for NUMA/HDF5 support:
brew install hdf5 hwloc cmake

# Set environment variables (add to ~/.zshrc or ~/.bash_profile):
export PKG_CONFIG_PATH="$(brew --prefix openssl@3)/lib/pkgconfig:$PKG_CONFIG_PATH"
export OPENSSL_DIR="$(brew --prefix openssl@3)"

Arch Linux:

# Quick install
./scripts/install-system-deps.sh

# Or manually (required only):
sudo pacman -S base-devel pkg-config openssl

# Optional - for NUMA/HDF5 support:
sudo pacman -S hdf5 hwloc cmake

WSL (Windows Subsystem for Linux) / Minimal Environments:

If you are building on WSL or any environment where libhdf5 or libhwloc may not be available, s3dlio builds without them by default. No extra libraries are required:

# Just the basics - works on WSL, Docker, CI, and minimal installs:
sudo apt-get install -y build-essential pkg-config libssl-dev
cargo build --release
# install Python package (no system HDF5/hwloc needed):
# uv workflow:
uv pip install s3dlio
# pip-only workflow:
pip install s3dlio

Install Rust (if not already installed)

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

Build s3dlio

# Clone the repository
git clone https://github.com/russfellows/s3dlio.git
cd s3dlio

# Build with default features (no HDF5 or NUMA required)
cargo build --release

# Build s3-cli with all cloud backends enabled (AWS + Azure + GCS)
cargo build --release --bin s3-cli --features full-backends

# Build s3-cli with GCS enabled only (plus default backends)
cargo build --release --bin s3-cli --features backend-gcs

# Build with NUMA topology support (requires libhwloc-dev)
cargo build --release --features numa

# Build with HDF5 data format support (requires libhdf5-dev)
cargo build --release --features hdf5

# Build with all optional features
cargo build --release --features numa,hdf5

# Run tests
cargo test

# Build Python bindings (optional)
./build_pyo3.sh

# Build Python bindings with full backends (S3 + Azure + GCS)
./build_pyo3.sh full

# Named profile form is also supported:
./build_pyo3.sh --profile full
./build_pyo3.sh --profile default

# Show profile/help usage
./build_pyo3.sh --help

Build Profile Quick Reference

Rust backend feature profiles:

  • Default build (cargo build --release): S3-focused default backend set.
  • GCS-enabled build (--features backend-gcs): enables GCS in addition to default set.
  • Full cloud build (--features full-backends): enables AWS + Azure + GCS.

Python wheel build profiles via build_pyo3.sh:

  • default or slim: AWS + file/direct; excludes Azure and GCS.
  • full: AWS + Azure + GCS + file/direct.
  • Positional and named forms are equivalent:
    • ./build_pyo3.sh full
    • ./build_pyo3.sh -p full
    • ./build_pyo3.sh --profile full

Optional extra Rust features for wheel builds can still be passed with EXTRA_FEATURES. Example: EXTRA_FEATURES="numa,hdf5" ./build_pyo3.sh full.

Note: NUMA support (--features numa) improves multi-socket performance but requires the hwloc2 C library. HDF5 support (--features hdf5) enables HDF5 data format generation but requires libhdf5. Both are optional and s3dlio is fully functional without them.

Platform support: s3dlio builds natively on Linux (x86_64, aarch64), macOS (x86_64 and Apple Silicon arm64), and WSL. Making numa and hdf5 optional was the key change for broad platform support — all remaining dependencies are pure Rust or use platform-independent system libraries (OpenSSL). To cross-compile Python wheels for Linux ARM64 from an x86_64 host, see build_pyo3.sh for instructions using the --zig linker. For macOS universal2 (fat binary covering both architectures), see the commented section in build_pyo3.sh.

✨ Key Features

  • High Performance: High-throughput multi GB/s reads and writes on platforms with sufficient network and storage capabilities
  • Zero-Copy Architecture: bytes::Bytes throughout for minimal memory overhead
  • Multi-Protocol: S3, Azure Blob, GCS, file://, direct:// (O_DIRECT)
  • HTTP/2 Support (opt-in): HTTP/2 is available but HTTP/1.1 is the default — HTTP/2 is almost always slower for bulk storage workloads. Opt in by setting S3DLIO_H2C=1. Both TLS ALPN (https://) and cleartext h2c (http://) are supported when enabled. See docs/HTTP2_ALPN_INVESTIGATION.md.
  • Python & Rust: Native Rust library with zero-copy Python bindings (PyO3), bytearray support for efficient memory management
  • Multi-Endpoint Load Balancing: RoundRobin/LeastConnections across storage endpoints
  • AI/ML Ready: PyTorch DataLoader integration, TFRecord/NPZ format support
  • Parquet DataLoader: Per-row-group epoch-aware DataLoader for any s3dlio storage backend (S3, Azure Blob, GCS, file://, direct://). Epoch-2 zero re-fetch speedup (2.5×+), Raw + Arrow IPC decode modes, 8-worker concurrency with shared metadata caches — see guide
  • High-Speed Data Generation: 50+ GB/s test data with configurable compression/dedup

🌟 Latest Release

v0.9.98 (May 2026) — Parquet DataLoader with epoch-2 fast path and Arrow IPC decode. See docs/Changelog.md.

Recent highlights:

  • v0.9.98 - Parquet DataLoader (ParquetRowGroupDataset): per-row-group Dataset, epoch-2 zero-re-fetch (2.5× speedup proven), Raw + ArrowIpc decode modes, 8-worker shared caches; 648 tests passing
  • v0.9.97 - XorStream (dedup-safe, ~15 GB/s/core); S3DLIO_UNSIGNED_PAYLOAD opt-in for private S3-compatible endpoints; 613 tests passing
  • v0.9.92 - MPU coordinator task, auto-scale, async write/flush/finish safety fixes, MAX_MULTIPART_PARTS guard; 580 tests passing
  • v0.9.90 - Full NVIDIA AIStore support (S3DLIO_FOLLOW_REDIRECTS=1) with all TLS security policies; HTTP/2 available (opt-in via S3DLIO_H2C=1, not the default); 5 issues closed (#126, #133, #134, #135, #136); 559 tests passing
  • v0.9.86 - Redirect follower for NVIDIA AIStore (S3 path); HTTPS→HTTP downgrade prevention; 21 new redirect tests; redirect security analysis documented
  • v0.9.84 - HEAD elimination (ObjectSizeCache); OnceLock env-var caching; lock-free range assembly; AWS_CA_BUNDLE_PATHAWS_CA_BUNDLE; structured tracing
  • v0.9.80 - Python list hang fix (IMDSv2 legacy call removed); tracing deadlock fix (tokio::spawn → inline stream); async S3 delete/bucket helpers; deprecated Python APIs cleaned up

📖 Complete Changelog - Full version history, migration guides, API details


📚 Version History

For detailed release notes and migration guides, see the Complete Changelog.


Storage Backend Support

Universal Backend Architecture

s3dlio provides unified storage operations across all backends with consistent URI patterns:

  • 🗄️ Amazon S3: s3://bucket/prefix/ - High-performance S3 operations (5+ GB/s reads, 2.5+ GB/s writes) with built-in concurrent range GETs (on by default)
  • ☁️ Azure Blob Storage: az://container/prefix/ - Complete Azure integration with RangeEngine (30-50% faster for large blobs)
  • 🌐 Google Cloud Storage: gs://bucket/prefix/ or gcs://bucket/prefix/ - Production ready with RangeEngine and full ObjectStore integration
  • 📁 Local File System: file:///path/to/directory/ - High-speed local file operations with RangeEngine support
  • ⚡ DirectIO: direct:///path/to/directory/ - Bypass OS cache for maximum I/O performance with RangeEngine

Concurrent Range GET Performance Features (v0.9.3+, Updated v0.9.60)

Concurrent range downloads hide network latency by parallelizing HTTP range requests.

All backends support concurrent range GETs — but via two different mechanisms:

Mechanism 1 — S3 built-in (on by default, v0.9.60+)

  • Amazon S3: Concurrent range splitting enabled by default via S3DLIO_ENABLE_RANGE_OPTIMIZATION (default: on). Uses get_object_concurrent_range_async() — fires parallel GetObject(Range: bytes=N-M) requests via the AWS SDK with lock-free chunk assembly. Controlled by S3DLIO_RANGE_THRESHOLD_MB (default: 32 MiB) and S3DLIO_RANGE_CONCURRENCY (default: auto-scaled). Disable with S3DLIO_ENABLE_RANGE_OPTIMIZATION=0.

Mechanism 2 — RangeEngine (per-store config flag, must enable explicitly)

  • Azure Blob Storage: 30-50% faster for large files (enable_range_engine: true in AzureConfig)
  • Google Cloud Storage: 30-50% faster for large files (enable_range_engine: true in GcsConfig)
  • ⚠️ Local File System: Rarely beneficial due to seek overhead (disabled by default)
  • ⚠️ DirectIO: Rarely beneficial due to O_DIRECT overhead (disabled by default)

RangeEngine config flag defaults (v0.9.6+):

  • Status: enable_range_engine: false by default in all per-store config structs
  • Reason: Extra HEAD request on every GET causes ~50% slowdown for small-object workloads
  • Threshold: 32 MiB default (tunable per-store via RangeEngineConfig::min_split_size)

How to Enable for Large-File Workloads:

use s3dlio::object_store::{AzureObjectStore, AzureConfig};

let config = AzureConfig {
    enable_range_engine: true,  // Explicitly enable for large files
    ..Default::default()
};
let store = AzureObjectStore::with_config(config);

When to Enable:

  • ✅ Large-file workloads (average size >= 32 MiB)
  • ✅ High-bandwidth, high-latency networks
  • ❌ Mixed or small-object workloads
  • ❌ Local file systems

S3 Backend Options

s3dlio supports two S3 backend implementations. Native AWS SDK is the default and recommended for production use:

# Default: Native AWS SDK backend (RECOMMENDED for production)
cargo build --release
# or explicitly:
cargo build --no-default-features --features native-backends

# Experimental: Apache Arrow object_store backend (optional, for testing)
cargo build --no-default-features --features arrow-backend

Why native-backends is default:

  • Proven performance in production workloads
  • Optimized for high-throughput S3 operations (5+ GB/s reads, 2.5+ GB/s writes)
  • Well-tested with MinIO, Vast, and AWS S3

About arrow-backend:

  • Experimental alternative implementation
  • No proven performance advantage over native backend
  • Useful for comparison testing and development
  • Not recommended for production use

GCS Backend Options (Current)

GCS is now optional at build time.

  • Default build (cargo build --release) does not include GCS.
  • To include GCS, enable backend-gcs (or full-backends).
  • When enabled, s3dlio uses the official Google crates (google-cloud-storage + gax) from a patched fork maintained for s3dlio.
# Default build (S3-focused; no GCS)
cargo build --release

# Enable GCS explicitly
cargo build --release --features backend-gcs

# Enable all cloud backends (AWS + Azure + GCS)
cargo build --release --features full-backends

Patched official GCS fork used by s3dlio:

Legacy note: gcs-community remains as a legacy opt-in path, but the primary supported path is the official Google crates from the patched russfellows/google-cloud-rust fork.

Quick Start

Installation

Rust CLI:

git clone https://github.com/russfellows/s3dlio.git
cd s3dlio
cargo build --release

# Full cloud backend CLI build:
cargo build --release --bin s3-cli --features full-backends

Python Library:

# uv workflow:
uv pip install s3dlio

# pip-only workflow:
pip install s3dlio

# or build from source:
./build_pyo3.sh && ./install_pyo3_wheel.sh

# build from source with full cloud backends:
./build_pyo3.sh --profile full && ./install_pyo3_wheel.sh

Documentation

Core Capabilities

🚀 Universal Copy Operations

s3dlio treats upload and download as enhanced versions of the Unix cp command, working across all storage backends:

CLI Usage:

# Upload to any backend with real-time progress
s3-cli upload /local/data/*.log s3://mybucket/logs/
s3-cli upload /local/files/* az://container/data/  
s3-cli upload /local/models/* gs://ml-bucket/models/
s3-cli upload /local/backup/* file:///remote-mount/backup/
s3-cli upload /local/cache/* direct:///nvme-storage/cache/

# Download from any backend  
s3-cli download s3://bucket/data/ ./local-data/
s3-cli download az://container/logs/ ./logs/
s3-cli download gs://ml-bucket/datasets/ ./datasets/
s3-cli download file:///network-storage/data/ ./data/

# Cross-backend copying workflow
s3-cli download s3://source-bucket/data/ ./temp/
s3-cli upload ./temp/* gs://dest-bucket/data/

Advanced Pattern Matching:

# Glob patterns for file selection (upload)
s3-cli upload "/data/*.log" s3://bucket/logs/
s3-cli upload "/files/data_*.csv" az://container/data/

# Regex patterns for listing (use single quotes to prevent shell expansion)
s3-cli ls -r s3://bucket/ -p '.*\.txt$'           # Only .txt files
s3-cli ls -r gs://bucket/ -p '.*\.(csv|json)$'    # CSV or JSON files
s3-cli ls -r az://acct/cont/ -p '.*/data_.*'      # Files with "data_" in path

# Count objects matching pattern (with progress indicator)
s3-cli ls -rc gs://bucket/data/ -p '.*\.npz$'
# Output: ⠙ [00:00:05] 71,305 objects (14,261 obj/s)
#         Total objects: 142,610 (10.0s, rate: 14,261 objects/s)

# Delete only matching files
s3-cli delete -r s3://bucket/logs/ -p '.*\.log$'

See CLI Guide for complete command reference and pattern syntax.

🐍 Python Integration

High-Performance Data Operations:

import s3dlio

# Universal upload/download across all backends
s3dlio.upload(['/local/data.csv'], 's3://bucket/data/')
s3dlio.upload(['/local/logs/*.log'], 'az://container/logs/')  
s3dlio.upload(['/local/models/*.pt'], 'gs://ml-bucket/models/')
s3dlio.download('s3://bucket/data/', './local-data/')
s3dlio.download('gs://ml-bucket/datasets/', './datasets/')

# High-level AI/ML operations
dataset = s3dlio.create_dataset("s3://bucket/training-data/")
loader = s3dlio.create_async_loader("gs://ml-bucket/data/", {"batch_size": 32})

# PyTorch integration
from s3dlio.torch import S3IterableDataset
from torch.utils.data import DataLoader

dataset = S3IterableDataset("gs://bucket/data/", loader_opts={})
dataloader = DataLoader(dataset, batch_size=16)

Streaming & Compression:

# High-performance streaming with compression
options = s3dlio.PyWriterOptions()
options.compression = "zstd"
options.compression_level = 3

writer = s3dlio.create_s3_writer('s3://bucket/data.zst', options)
writer.write_chunk(large_data_bytes)
stats = writer.finalize()  # Returns (bytes_written, compressed_bytes)

# Data generation with configurable modes
s3dlio.put("s3://bucket/test-data-{}.bin", num=1000, size=4194304, 
          data_gen_mode="streaming")  # 2.6-3.5x faster for most cases

XorStream — dedup-safe data generation (v0.9.97+):

XorStream generates unique, incompressible data at ~15 GB/s per core without Rayon thread management overhead. Every fill() / generate() call is guaranteed to produce a different 512-byte-block-level fingerprint. Ideal for PUT-heavy benchmarks where many worker threads share a single generator.

import s3dlio
import numpy as np

stream = s3dlio.XorStream()

# --- Fastest path: in-place fill into pre-allocated bytearray ---
buf = bytearray(8 * 1024 * 1024)   # 8 MiB working buffer
stream.fill(buf)                    # fill once — unique payload
stream.fill(buf)                    # fill again — different payload, guaranteed

print(stream.objects_generated)     # == 2

# --- Convenience path: allocate + fill in one call ---
data = stream.generate(8 * 1024 * 1024)  # returns BytesView
view = memoryview(data)                   # zero-copy Python view
arr  = np.frombuffer(view, dtype=np.uint8)  # zero-copy numpy array

# --- PUT with XorStream data (benchmark pattern) ---
import threading

def worker(stream, uri_template, n):
    buf = bytearray(8 * 1024 * 1024)
    for i in range(n):
        stream.fill(buf)       # reuse buffer, unique data each time
        s3dlio.put_bytes(buf, uri_template.format(i))

threads = [threading.Thread(target=worker, args=(stream, "s3://bucket/obj-{}.bin", 100))
           for _ in range(32)]
for t in threads: t.start()
for t in threads: t.join()
Scenario Best choice
High concurrency PUT (≥ 32 workers) XorStream — no Rayon scheduling
Medium objects 1–32 MiB XorStream — no per-call allocation
Controllable compress/dedup ratios generate_data() / Generator
Very large objects (≥ 256 MiB) Generator.fill_chunk()

Multi-Endpoint Load Balancing (v0.9.14+):

# Distribute I/O across multiple storage endpoints
store = s3dlio.create_multi_endpoint_store(
    uris=[
        "s3://bucket-1/data",
        "s3://bucket-2/data", 
        "s3://bucket-3/data",
    ],
    strategy="least_connections"  # or "round_t robin"
)

# Zero-copy data access (memoryview compatible)
data = store.get("s3://bucket-1/file.bin")
array = np.frombuffer(memoryview(data), dtype=np.float32)

# Monitor load distribution
stats = store.get_endpoint_stats()
for i, s in enumerate(stats):
    print(f"Endpoint {i}: {s['requests']} requests, {s['bytes_transferred']} bytes")

📖 Complete Multi-Endpoint Guide - Load balancing, configuration, use cases

📦 Parquet DataLoader — Epoch-Aware Training (v0.9.98+)

The Parquet DataLoader provides per-row-group streaming for Parquet files on any s3dlio-accessible storage — S3, Azure Blob, GCS, file://, and direct:// (O_DIRECT; tested and working). Only the URI prefix changes; no code changes needed to switch backends. Two decode modes: Raw (Python decodes) and ArrowIpc (Rust decodes to Arrow IPC bytes). Zero footer re-fetches on epoch 2+ (row-group byte ranges cached in a process-global DashMap after epoch 1; backend-agnostic).

import s3dlio

# Works with any s3dlio URI — just change the prefix:
#   "s3://bucket/train/"           Amazon S3 / MinIO / Ceph
#   "az://container/train/"        Azure Blob Storage
#   "gs://bucket/train/"           Google Cloud Storage
#   "file:///mnt/data/train/"      Local filesystem
#   "direct:///mnt/nvme/train/"    Local O_DIRECT (bypass page cache)

# Raw mode — Python decodes with PyArrow (default)
loader = s3dlio.create_async_loader(
    "s3://bucket/train/",
    {"format": "parquet", "prefetch": 32}
)
for item in loader:
    # item["data"]: bytes, item["uri"]: str, item["rg_idx"]: int
    table = pyarrow.parquet.read_table(io.BytesIO(item["data"]))

# Arrow IPC mode — Rust decodes, Python gets ready-to-use RecordBatch bytes
loader = s3dlio.create_async_loader(
    "direct:///mnt/nvme/train/",   # same API, different backend
    {"format": "parquet", "decode": "arrow", "prefetch": 32}
)
for item in loader:
    batch = pa.ipc.open_stream(pa.py_buffer(item["data"])).read_next_batch()
Option Type Default Description
"format" str Must be "parquet" to activate Parquet mode
"decode" str "raw" "raw" or "arrow" (Rust-side decode)
"columns" list[int] None Column subset; None = all columns
"footer_cap" int 4 MiB Bytes from file tail for footer parsing
"prefetch" int 32 Concurrent in-flight row-group GETs

Epoch-2+ speedup — measured against live MinIO:

Epoch Construction time Notes
1 20.4 ms list_objects + footer GETs
2+ 8.3 ms list_objects only — 2.5× faster

Memory: only metadata in RAM; 8 concurrent workers share process-global caches (no 8× duplication).

📖 Complete Parquet DataLoader Guide

Performance

Benchmark Results

s3dlio delivers world-class performance across all operations:

Operation Performance Notes
S3 PUT Up to 3.089 GB/s Exceeds steady-state baseline by 17.8%
S3 GET Up to 4.826 GB/s Near line-speed performance
Multi-Process 2-3x faster Improvement over single process
Streaming Mode 2.6-3.5x faster For 1-8MB objects vs single-pass

Optimization Features

  • HTTP/2 Support (opt-in): HTTP/2 is supported but not the default — HTTP/1.1 is used unless you set S3DLIO_H2C=1. HTTP/2 is available for scenarios that benefit from multiplexing, but is typically slower for bulk storage workloads.
  • Intelligent Defaults: Streaming mode automatically selected based on benchmarks
  • Multi-Process Architecture: Massive parallelism for maximum performance
  • Zero-Copy Streaming: Memory-efficient operations for large datasets
  • Configurable Chunk Sizes: Fine-tune performance for your workload

Checkpoint system for model states

store = s3dlio.PyCheckpointStore('file:///tmp/checkpoints/') store.save('model_state', your_model_data) loaded_data = store.load('model_state')


**Ready for Production**: All core functionality validated, comprehensive test suite, and honest documentation matching actual capabilities.

## Configuration & Tuning

### Environment Variables
s3dlio supports comprehensive configuration through environment variables:

- **NVIDIA AIStore**: `S3DLIO_FOLLOW_REDIRECTS=1` - Enable HTTP 307 redirect following for AIStore (opt-in, disabled by default); `S3DLIO_REDIRECT_MAX=5` - Maximum redirect hops per request
- **HTTP/2 mode**: `S3DLIO_H2C=1` - Force h2c (HTTP/2 cleartext) on http:// endpoints; `S3DLIO_H2C=0` - Force HTTP/1.1; unset = auto-probe (default)
- **Runtime Scaling**: `S3DLIO_RT_THREADS=32` - Tokio worker threads
- **Connection Pool**: `S3DLIO_POOL_MAX_IDLE_PER_HOST=32` - Max idle connections per host (default: 32)
- **S3 Range GET**: `S3DLIO_ENABLE_RANGE_OPTIMIZATION=0` - Disable concurrent range splitting (enabled by default); `S3DLIO_RANGE_THRESHOLD_MB=64` - Size threshold in MiB (default: 32); `S3DLIO_RANGE_CONCURRENCY=64` - Max concurrent range requests
- **Operation Logging**: `S3DLIO_OPLOG_LEVEL=2` - S3 operation tracking

📖 [Environment Variables Reference](docs/Environment_Variables.md)

### Operation Logging (Op-Log)
Universal operation trace logging across all backends with zstd-compressed TSV format, warp-replay compatible.

```python
import s3dlio
s3dlio.init_op_log("operations.tsv.zst")
# All operations automatically logged
s3dlio.finalize_op_log()

See S3DLIO OpLog Implementation for detailed usage.

Building from Source

Prerequisites

  • Rust: Install Rust toolchain
  • Python 3.12+: For Python library development
  • UV (recommended): Install UV
  • OpenSSL: Required (libssl-dev on Ubuntu)
  • HDF5 (optional): Only needed with --features hdf5 (libhdf5-dev on Ubuntu, brew install hdf5 on macOS)
  • hwloc (optional): Only needed with --features numa (libhwloc-dev on Ubuntu)

Build Steps

# Python environment
uv venv && source .venv/bin/activate

# Rust CLI
cargo build --release

# Python library
./build_pyo3.sh && ./install_pyo3_wheel.sh

Configuration

Environment Setup

# Required for S3 operations
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
AWS_ENDPOINT_URL=https://your-s3-endpoint
AWS_REGION=us-east-1

Enable comprehensive S3 operation logging compatible with MinIO warp format:

Advanced Features

CPU Profiling & Analysis

cargo build --release --features profiling
cargo run --example simple_flamegraph_test --features profiling

Compression & Streaming

import s3dlio
options = s3dlio.PyWriterOptions()
options.compression = "zstd"
writer = s3dlio.create_s3_writer('s3://bucket/data.zst', options)
writer.write_chunk(large_data)
stats = writer.finalize()

Container Deployment

# Use pre-built container
podman pull quay.io/russfellows-sig65/s3dlio
podman run --net=host --rm -it quay.io/russfellows-sig65/s3dlio

# Or build locally
podman build -t s3dlio .

Note: Always use --net=host for storage backend connectivity.

Documentation & Support

🔗 Related Projects

License

Licensed under the Apache License 2.0 - see LICENSE file.


🚀 Ready to get started? Check out the Quick Start section above or explore our example scripts for common use cases!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3dlio-0.9.102.tar.gz (1.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

s3dlio-0.9.102-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

s3dlio-0.9.102-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (12.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ ARM64

s3dlio-0.9.102-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

s3dlio-0.9.102-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (12.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

s3dlio-0.9.102-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

s3dlio-0.9.102-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (12.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

File details

Details for the file s3dlio-0.9.102.tar.gz.

File metadata

  • Download URL: s3dlio-0.9.102.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for s3dlio-0.9.102.tar.gz
Algorithm Hash digest
SHA256 306d68e4403cdbcde421e5559923b697cdbc021951aa1ded66af953d77642ba1
MD5 68784f17300fe6dd2bad695234e7c934
BLAKE2b-256 706849468354c9ac4e957c23914ea20e014d160d82c4ebee3d5272f59e5ec3d7

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.102-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for s3dlio-0.9.102-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8dd1619c0448538385c905990f970783f912b3a0b2dba5a2f8ab589c510883c6
MD5 a179be8e8837583a533ef82b85de7371
BLAKE2b-256 2e690016405d4df57c295c1a154522164dea0cc80c5c27a2b60fe60f2b2f53a4

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.102-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for s3dlio-0.9.102-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5852510ac08bdca8bd1821ac95ae9b3080003e0e7da3d682f71c295666e6e90d
MD5 d82760edb982c482a0ed729bb4cae68f
BLAKE2b-256 02cd2ffa1dc2ef1ba3285240d59278228a1cc1967eeaa404df0701afbbf19678

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.102-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for s3dlio-0.9.102-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2f3bdbe9709f2ef2ecb8a59fe728919894500c2897f782ced9e22cbaf6bc69a5
MD5 0a03e1d8f9adb136607d71c1ead96bd8
BLAKE2b-256 fdccec8d681a5da7de73fb521bdba5eea248948caf2f763f44aecef1c4c5406d

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.102-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for s3dlio-0.9.102-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d605e2ff84cc859fd76dc64d05538b9b1d210b6f4d1a33b7cbe2f14d1a64152a
MD5 d9df4f6f2147a0dbc7faf6dfce979fc9
BLAKE2b-256 b425d870f111cc99ba740a2f0bb6fbcab821553de57ad0e3fec952ea68d754e4

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.102-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for s3dlio-0.9.102-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5844301f00a8dbc58fc8b879d1be524e4482bf512bd45d453daa22b1045f9db5
MD5 3909fa1b351314f3c15cd2aee5b234ae
BLAKE2b-256 31e3bea5af5e80272c01f662d36257d1cee3719a9d6f73d7e07501d61d9f853c

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.102-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for s3dlio-0.9.102-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6af4b34cf581892ea7891a138bff277da8570cea786ec31e0d00fbdfebba7fbd
MD5 6407e2678c3dc18386cb92495ae0f29d
BLAKE2b-256 695b68417270752d272f1fbd4748b5b4a51d542b63ffdc0cac42d837279086eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page