Skip to main content

High-performance Object Storage Library for Pytorch, Jax and Tensorflow: Object support inlcudes S3, Azure, GCS, and file system operations for AI/ML

Project description

s3dlio - Universal Storage I/O Library

Build Status Rust Tests Version PyPI License Rust Python

High-performance, multi-protocol storage library for AI/ML workloads with universal copy operations across S3, Azure, GCS, local file systems, and DirectIO.

v0.9.86 — Redirect-following connector (S3DLIO_FOLLOW_REDIRECTS=1) for tacit NVIDIA AIStore support via S3; scheme-downgrade (HTTPS→HTTP) prevention active; 21 new redirect unit tests. Note: direct AIStore end-to-end testing has not been performed; cert-pinning security is pending (see docs/security/HTTPS_Redirect_Security_Issues.md).

📦 Installation

Quick Install (Python)

# If using uv package manager + uv virtual environment:
uv pip install s3dlio

# If using pip without uv:
pip install s3dlio

Python Backend Profiles (PyPI vs Full Build)

  • If using uv package manager + uv virtual environment: uv pip install s3dlio.
  • If using standard pip without uv: pip install s3dlio.
  • The default published wheel is now S3-focused (Azure Blob and GCS are excluded).
  • If you want full backends (S3 + Azure Blob + GCS), build from source with:
# uv workflow:
uv pip install s3dlio --no-binary s3dlio --config-settings "cargo-extra-args=--features extension-module,full-backends"

# pip-only workflow:
pip install s3dlio --no-binary s3dlio --config-settings "cargo-extra-args=--features extension-module,full-backends"

You can still add a separate package name (for example s3dlio-full) later if you want a dedicated prebuilt full wheel distribution.

Maintainer note: for PyPI uploads, publish the default (./build_pyo3.sh) wheel unless intentionally releasing a separate distribution. full-backends is currently source-build only via the command above.

Building from Source (Rust)

System Dependencies

s3dlio requires some system libraries to build. Only OpenSSL and pkg-config are required by default. HDF5 and hwloc are optional and improve functionality but are not needed for the core library:

Ubuntu/Debian:

# Quick install - run our helper script
./scripts/install-system-deps.sh

# Or manually (required only):
sudo apt-get install -y build-essential pkg-config libssl-dev

# Optional - for NUMA topology support (--features numa):
sudo apt-get install -y libhwloc-dev

# Optional - for HDF5 data format support (--features hdf5):
sudo apt-get install -y libhdf5-dev

# All optional libraries at once:
sudo apt-get install -y libhdf5-dev libhwloc-dev cmake

RHEL/CentOS/Fedora/Rocky/AlmaLinux:

# Quick install
./scripts/install-system-deps.sh

# Or manually (required only):
sudo dnf install -y gcc gcc-c++ make pkg-config openssl-devel

# Optional - for NUMA topology support:
sudo dnf install -y hwloc-devel

# Optional - for HDF5 data format support:
sudo dnf install -y hdf5-devel

# All optional libraries at once:
sudo dnf install -y hdf5-devel hwloc-devel cmake

macOS:

# Quick install
./scripts/install-system-deps.sh

# Or manually (required only):
brew install pkg-config openssl@3

# Optional - for NUMA/HDF5 support:
brew install hdf5 hwloc cmake

# Set environment variables (add to ~/.zshrc or ~/.bash_profile):
export PKG_CONFIG_PATH="$(brew --prefix openssl@3)/lib/pkgconfig:$PKG_CONFIG_PATH"
export OPENSSL_DIR="$(brew --prefix openssl@3)"

Arch Linux:

# Quick install
./scripts/install-system-deps.sh

# Or manually (required only):
sudo pacman -S base-devel pkg-config openssl

# Optional - for NUMA/HDF5 support:
sudo pacman -S hdf5 hwloc cmake

WSL (Windows Subsystem for Linux) / Minimal Environments:

If you are building on WSL or any environment where libhdf5 or libhwloc may not be available, s3dlio builds without them by default. No extra libraries are required:

# Just the basics - works on WSL, Docker, CI, and minimal installs:
sudo apt-get install -y build-essential pkg-config libssl-dev
cargo build --release
# install Python package (no system HDF5/hwloc needed):
# uv workflow:
uv pip install s3dlio
# pip-only workflow:
pip install s3dlio

Install Rust (if not already installed)

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

Build s3dlio

# Clone the repository
git clone https://github.com/russfellows/s3dlio.git
cd s3dlio

# Build with default features (no HDF5 or NUMA required)
cargo build --release

# Build s3-cli with all cloud backends enabled (AWS + Azure + GCS)
cargo build --release --bin s3-cli --features full-backends

# Build s3-cli with GCS enabled only (plus default backends)
cargo build --release --bin s3-cli --features backend-gcs

# Build with NUMA topology support (requires libhwloc-dev)
cargo build --release --features numa

# Build with HDF5 data format support (requires libhdf5-dev)
cargo build --release --features hdf5

# Build with all optional features
cargo build --release --features numa,hdf5

# Run tests
cargo test

# Build Python bindings (optional)
./build_pyo3.sh

# Build Python bindings with full backends (S3 + Azure + GCS)
./build_pyo3.sh full

# Named profile form is also supported:
./build_pyo3.sh --profile full
./build_pyo3.sh --profile default

# Show profile/help usage
./build_pyo3.sh --help

Build Profile Quick Reference

Rust backend feature profiles:

  • Default build (cargo build --release): S3-focused default backend set.
  • GCS-enabled build (--features backend-gcs): enables GCS in addition to default set.
  • Full cloud build (--features full-backends): enables AWS + Azure + GCS.

Python wheel build profiles via build_pyo3.sh:

  • default or slim: AWS + file/direct; excludes Azure and GCS.
  • full: AWS + Azure + GCS + file/direct.
  • Positional and named forms are equivalent:
    • ./build_pyo3.sh full
    • ./build_pyo3.sh -p full
    • ./build_pyo3.sh --profile full

Optional extra Rust features for wheel builds can still be passed with EXTRA_FEATURES. Example: EXTRA_FEATURES="numa,hdf5" ./build_pyo3.sh full.

Note: NUMA support (--features numa) improves multi-socket performance but requires the hwloc2 C library. HDF5 support (--features hdf5) enables HDF5 data format generation but requires libhdf5. Both are optional and s3dlio is fully functional without them.

Platform support: s3dlio builds natively on Linux (x86_64, aarch64), macOS (x86_64 and Apple Silicon arm64), and WSL. Making numa and hdf5 optional was the key change for broad platform support — all remaining dependencies are pure Rust or use platform-independent system libraries (OpenSSL). To cross-compile Python wheels for Linux ARM64 from an x86_64 host, see build_pyo3.sh for instructions using the --zig linker. For macOS universal2 (fat binary covering both architectures), see the commented section in build_pyo3.sh.

✨ Key Features

  • High Performance: High-throughput multi GB/s reads and writes on platforms with sufficient network and storage capabilities
  • Zero-Copy Architecture: bytes::Bytes throughout for minimal memory overhead
  • Multi-Protocol: S3, Azure Blob, GCS, file://, direct:// (O_DIRECT)
  • Python & Rust: Native Rust library with zero-copy Python bindings (PyO3), bytearray support for efficient memory management
  • Multi-Endpoint Load Balancing: RoundRobin/LeastConnections across storage endpoints
  • AI/ML Ready: PyTorch DataLoader integration, TFRecord/NPZ format support
  • High-Speed Data Generation: 50+ GB/s test data with configurable compression/dedup

🌟 Latest Release

v0.9.86 (March 2026) - Redirect-following connector for tacit NVIDIA AIStore S3 support (S3DLIO_FOLLOW_REDIRECTS=1); HTTPS→HTTP scheme-downgrade protection active; 21 new redirect unit tests. Note: tested against the AIStore protocol spec but not against a live AIStore cluster. Certificate pinning is pending (see security doc).

Recent highlights:

  • v0.9.86 - Redirect follower for NVIDIA AIStore (S3 path); HTTPS→HTTP downgrade prevention; 21 new redirect tests; redirect security analysis documented
  • v0.9.84 - HEAD elimination (ObjectSizeCache); OnceLock env-var caching; lock-free range assembly; AWS_CA_BUNDLE_PATHAWS_CA_BUNDLE; structured tracing
  • v0.9.80 - Python list hang fix (IMDSv2 legacy call removed); tracing deadlock fix (tokio::spawn → inline stream); async S3 delete/bucket helpers; deprecated Python APIs cleaned up

📖 Complete Changelog - Full version history, migration guides, API details


📚 Version History

For detailed release notes and migration guides, see the Complete Changelog.


Storage Backend Support

Universal Backend Architecture

s3dlio provides unified storage operations across all backends with consistent URI patterns:

  • 🗄️ Amazon S3: s3://bucket/prefix/ - High-performance S3 operations (5+ GB/s reads, 2.5+ GB/s writes)
  • ☁️ Azure Blob Storage: az://container/prefix/ - Complete Azure integration with RangeEngine (30-50% faster for large blobs)
  • 🌐 Google Cloud Storage: gs://bucket/prefix/ or gcs://bucket/prefix/ - Production ready with RangeEngine and full ObjectStore integration
  • 📁 Local File System: file:///path/to/directory/ - High-speed local file operations with RangeEngine support
  • ⚡ DirectIO: direct:///path/to/directory/ - Bypass OS cache for maximum I/O performance with RangeEngine

RangeEngine Performance Features (v0.9.3+, Updated v0.9.6)

Concurrent range downloads hide network latency by parallelizing HTTP range requests.

Backends with RangeEngine Support:

  • Azure Blob Storage: 30-50% faster for large files (must enable explicitly)
  • Google Cloud Storage: 30-50% faster for large files (must enable explicitly)
  • Local File System: Rarely beneficial due to seek overhead (disabled by default)
  • DirectIO: Rarely beneficial due to O_DIRECT overhead (disabled by default)
  • 🔄 S3: Implemented by default in version 0.9.70 and later.

Default Configuration (v0.9.6+):

  • Status: Disabled by default (was: enabled in v0.9.5)
  • Reason: Extra HEAD request on every GET causes 50% slowdown for typical workloads
  • Threshold: 32MB Default threshold for automatic range GET enablement (can be tuned with S3DLIO_RANGE_THRESHOLD)

How to Enable for Large-File Workloads:

use s3dlio::object_store::{AzureObjectStore, AzureConfig};

let config = AzureConfig {
    enable_range_engine: true,  // Explicitly enable for large files
    ..Default::default()
};
let store = AzureObjectStore::with_config(config);

When to Enable:

  • ✅ Large-file workloads (average size >= 32 MiB)
  • ✅ High-bandwidth, high-latency networks
  • ❌ Mixed or small-object workloads
  • ❌ Local file systems

S3 Backend Options

s3dlio supports two S3 backend implementations. Native AWS SDK is the default and recommended for production use:

# Default: Native AWS SDK backend (RECOMMENDED for production)
cargo build --release
# or explicitly:
cargo build --no-default-features --features native-backends

# Experimental: Apache Arrow object_store backend (optional, for testing)
cargo build --no-default-features --features arrow-backend

Why native-backends is default:

  • Proven performance in production workloads
  • Optimized for high-throughput S3 operations (5+ GB/s reads, 2.5+ GB/s writes)
  • Well-tested with MinIO, Vast, and AWS S3

About arrow-backend:

  • Experimental alternative implementation
  • No proven performance advantage over native backend
  • Useful for comparison testing and development
  • Not recommended for production use

GCS Backend Options (Current)

GCS is now optional at build time.

  • Default build (cargo build --release) does not include GCS.
  • To include GCS, enable backend-gcs (or full-backends).
  • When enabled, s3dlio uses the official Google crates (google-cloud-storage + gax) from a patched fork maintained for s3dlio.
# Default build (S3-focused; no GCS)
cargo build --release

# Enable GCS explicitly
cargo build --release --features backend-gcs

# Enable all cloud backends (AWS + Azure + GCS)
cargo build --release --features full-backends

Patched official GCS fork used by s3dlio:

Legacy note: gcs-community remains as a legacy opt-in path, but the primary supported path is the official Google crates from the patched russfellows/google-cloud-rust fork.

Quick Start

Installation

Rust CLI:

git clone https://github.com/russfellows/s3dlio.git
cd s3dlio
cargo build --release

# Full cloud backend CLI build:
cargo build --release --bin s3-cli --features full-backends

Python Library:

# uv workflow:
uv pip install s3dlio

# pip-only workflow:
pip install s3dlio

# or build from source:
./build_pyo3.sh && ./install_pyo3_wheel.sh

# build from source with full cloud backends:
./build_pyo3.sh --profile full && ./install_pyo3_wheel.sh

Documentation

Core Capabilities

🚀 Universal Copy Operations

s3dlio treats upload and download as enhanced versions of the Unix cp command, working across all storage backends:

CLI Usage:

# Upload to any backend with real-time progress
s3-cli upload /local/data/*.log s3://mybucket/logs/
s3-cli upload /local/files/* az://container/data/  
s3-cli upload /local/models/* gs://ml-bucket/models/
s3-cli upload /local/backup/* file:///remote-mount/backup/
s3-cli upload /local/cache/* direct:///nvme-storage/cache/

# Download from any backend  
s3-cli download s3://bucket/data/ ./local-data/
s3-cli download az://container/logs/ ./logs/
s3-cli download gs://ml-bucket/datasets/ ./datasets/
s3-cli download file:///network-storage/data/ ./data/

# Cross-backend copying workflow
s3-cli download s3://source-bucket/data/ ./temp/
s3-cli upload ./temp/* gs://dest-bucket/data/

Advanced Pattern Matching:

# Glob patterns for file selection (upload)
s3-cli upload "/data/*.log" s3://bucket/logs/
s3-cli upload "/files/data_*.csv" az://container/data/

# Regex patterns for listing (use single quotes to prevent shell expansion)
s3-cli ls -r s3://bucket/ -p '.*\.txt$'           # Only .txt files
s3-cli ls -r gs://bucket/ -p '.*\.(csv|json)$'    # CSV or JSON files
s3-cli ls -r az://acct/cont/ -p '.*/data_.*'      # Files with "data_" in path

# Count objects matching pattern (with progress indicator)
s3-cli ls -rc gs://bucket/data/ -p '.*\.npz$'
# Output: ⠙ [00:00:05] 71,305 objects (14,261 obj/s)
#         Total objects: 142,610 (10.0s, rate: 14,261 objects/s)

# Delete only matching files
s3-cli delete -r s3://bucket/logs/ -p '.*\.log$'

See CLI Guide for complete command reference and pattern syntax.

🐍 Python Integration

High-Performance Data Operations:

import s3dlio

# Universal upload/download across all backends
s3dlio.upload(['/local/data.csv'], 's3://bucket/data/')
s3dlio.upload(['/local/logs/*.log'], 'az://container/logs/')  
s3dlio.upload(['/local/models/*.pt'], 'gs://ml-bucket/models/')
s3dlio.download('s3://bucket/data/', './local-data/')
s3dlio.download('gs://ml-bucket/datasets/', './datasets/')

# High-level AI/ML operations
dataset = s3dlio.create_dataset("s3://bucket/training-data/")
loader = s3dlio.create_async_loader("gs://ml-bucket/data/", {"batch_size": 32})

# PyTorch integration
from s3dlio.torch import S3IterableDataset
from torch.utils.data import DataLoader

dataset = S3IterableDataset("gs://bucket/data/", loader_opts={})
dataloader = DataLoader(dataset, batch_size=16)

Streaming & Compression:

# High-performance streaming with compression
options = s3dlio.PyWriterOptions()
options.compression = "zstd"
options.compression_level = 3

writer = s3dlio.create_s3_writer('s3://bucket/data.zst', options)
writer.write_chunk(large_data_bytes)
stats = writer.finalize()  # Returns (bytes_written, compressed_bytes)

# Data generation with configurable modes
s3dlio.put("s3://bucket/test-data-{}.bin", num=1000, size=4194304, 
          data_gen_mode="streaming")  # 2.6-3.5x faster for most cases

Multi-Endpoint Load Balancing (v0.9.14+):

# Distribute I/O across multiple storage endpoints
store = s3dlio.create_multi_endpoint_store(
    uris=[
        "s3://bucket-1/data",
        "s3://bucket-2/data", 
        "s3://bucket-3/data",
    ],
    strategy="least_connections"  # or "round_t robin"
)

# Zero-copy data access (memoryview compatible)
data = store.get("s3://bucket-1/file.bin")
array = np.frombuffer(memoryview(data), dtype=np.float32)

# Monitor load distribution
stats = store.get_endpoint_stats()
for i, s in enumerate(stats):
    print(f"Endpoint {i}: {s['requests']} requests, {s['bytes_transferred']} bytes")

📖 Complete Multi-Endpoint Guide - Load balancing, configuration, use cases

Performance

Benchmark Results

s3dlio delivers world-class performance across all operations:

Operation Performance Notes
S3 PUT Up to 3.089 GB/s Exceeds steady-state baseline by 17.8%
S3 GET Up to 4.826 GB/s Near line-speed performance
Multi-Process 2-3x faster Improvement over single process
Streaming Mode 2.6-3.5x faster For 1-8MB objects vs single-pass

Optimization Features

  • HTTP/2 Support: Modern multiplexing for enhanced throughput (with Apache Arrow backend only)
  • Intelligent Defaults: Streaming mode automatically selected based on benchmarks
  • Multi-Process Architecture: Massive parallelism for maximum performance
  • Zero-Copy Streaming: Memory-efficient operations for large datasets
  • Configurable Chunk Sizes: Fine-tune performance for your workload

Checkpoint system for model states

store = s3dlio.PyCheckpointStore('file:///tmp/checkpoints/') store.save('model_state', your_model_data) loaded_data = store.load('model_state')


**Ready for Production**: All core functionality validated, comprehensive test suite, and honest documentation matching actual capabilities.

## Configuration & Tuning

### Environment Variables
s3dlio supports comprehensive configuration through environment variables:

- **HTTP Client Optimization**: `S3DLIO_USE_OPTIMIZED_HTTP=true` - Enhanced connection pooling
- **Runtime Scaling**: `S3DLIO_RT_THREADS=32` - Tokio worker threads  
- **Connection Pool**: `S3DLIO_MAX_HTTP_CONNECTIONS=400` - Max connections per host
- **Range GET**: `S3DLIO_RANGE_CONCURRENCY=64` - Large object optimization
- **Operation Logging**: `S3DLIO_OPLOG_LEVEL=2` - S3 operation tracking

📖 [Environment Variables Reference](docs/api/Environment_Variables.md)

### Operation Logging (Op-Log)
Universal operation trace logging across all backends with zstd-compressed TSV format, warp-replay compatible.

```python
import s3dlio
s3dlio.init_op_log("operations.tsv.zst")
# All operations automatically logged
s3dlio.finalize_op_log()

See S3DLIO OpLog Implementation for detailed usage.

Building from Source

Prerequisites

  • Rust: Install Rust toolchain
  • Python 3.12+: For Python library development
  • UV (recommended): Install UV
  • OpenSSL: Required (libssl-dev on Ubuntu)
  • HDF5 (optional): Only needed with --features hdf5 (libhdf5-dev on Ubuntu, brew install hdf5 on macOS)
  • hwloc (optional): Only needed with --features numa (libhwloc-dev on Ubuntu)

Build Steps

# Python environment
uv venv && source .venv/bin/activate

# Rust CLI
cargo build --release

# Python library
./build_pyo3.sh && ./install_pyo3_wheel.sh

Configuration

Environment Setup

# Required for S3 operations
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
AWS_ENDPOINT_URL=https://your-s3-endpoint
AWS_REGION=us-east-1

Enable comprehensive S3 operation logging compatible with MinIO warp format:

Advanced Features

CPU Profiling & Analysis

cargo build --release --features profiling
cargo run --example simple_flamegraph_test --features profiling

Compression & Streaming

import s3dlio
options = s3dlio.PyWriterOptions()
options.compression = "zstd"
writer = s3dlio.create_s3_writer('s3://bucket/data.zst', options)
writer.write_chunk(large_data)
stats = writer.finalize()

Container Deployment

# Use pre-built container
podman pull quay.io/russfellows-sig65/s3dlio
podman run --net=host --rm -it quay.io/russfellows-sig65/s3dlio

# Or build locally
podman build -t s3dlio .

Note: Always use --net=host for storage backend connectivity.

Documentation & Support

🔗 Related Projects

License

Licensed under the Apache License 2.0 - see LICENSE file.


🚀 Ready to get started? Check out the Quick Start section above or explore our example scripts for common use cases!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3dlio-0.9.86.tar.gz (1.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

s3dlio-0.9.86-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

s3dlio-0.9.86-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

File details

Details for the file s3dlio-0.9.86.tar.gz.

File metadata

  • Download URL: s3dlio-0.9.86.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for s3dlio-0.9.86.tar.gz
Algorithm Hash digest
SHA256 48f8a5d11dd8ecec4c4d554e6021d51b84424d7bf9d8257d15bd972cd06ba361
MD5 e6b6446f0614cb9b687c0a5f8f596194
BLAKE2b-256 1dc48673945333cae9d3535ea1a5026dc59595daae8131ecf156c461a48c0096

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.86-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for s3dlio-0.9.86-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 69ec307587f7b093d961eb70f3e14fbb463fe11050c79c19eefbbb3a399b9f6c
MD5 061cab06494771b67ed387bca711bb8a
BLAKE2b-256 a2bb1a9648a69a3641b8cb5b53bba5db6ffd18225847f4a97a85ece1a463ac09

See more details on using hashes here.

File details

Details for the file s3dlio-0.9.86-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for s3dlio-0.9.86-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bacb7605d343a960aadc1aecece0a79e5505fa777b2efae9439eb6cf2087a1ef
MD5 af7540309194d168604966e130822f81
BLAKE2b-256 684075fdddf60851e436b97595bc93dea6504792ca724b8fc3db2cfa3adaa249

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page