Skip to main content

A high-performance distributed data loader powered by Rust. Load files from S3 and local disk with blazing speed.

Project description

hyperload

PyPI version License: MIT Python 3.8+

High-performance distributed data loader powered by Rust.

Load files from S3 and local disk with blazing speed. Perfect for ML training pipelines.

Features

  • 🚀 Blazing Fast - Rust-powered async I/O with 50x parallel file reads
  • ☁️ S3 Native - First-class Amazon S3 support
  • 💾 Local Disk - Seamless local file system access
  • 🐍 Pythonic API - Simple, intuitive interface
  • 🔒 Type Safe - Built with Rust's safety guarantees

Installation

pip install hyperload

Quick Start

from hyperload import DataLoader

# Initialize with local file system
loader = DataLoader("file://./data")

# Read a single file
content = loader.read_file("sample.txt")
print(content)

Usage Examples

Reading Files from Local Disk

from hyperload import DataLoader

# Create loader pointing to current directory
loader = DataLoader("file://.")

# Read a single file
content = loader.read_file("path/to/file.txt")
print(content)

# Read from subdirectory
data = loader.read_file("data/train/sample.json")

Listing Files in a Directory

from hyperload import DataLoader

loader = DataLoader("file://.")

# Get all files in a folder
files = loader.list_files("my_dataset/")
print(f"Found {len(files)} files")

for file_path in files:
    print(file_path)

Batch Reading (Parallel I/O)

from hyperload import DataLoader

loader = DataLoader("file://.")

# List all training files
files = loader.list_files("data/training/")

# Read ALL files in parallel (50 concurrent reads!)
contents = loader.read_batch(files)

# Process the data
for i, content in enumerate(contents):
    print(f"File {i}: {len(content)} bytes")

Loading from Amazon S3

import os
from hyperload import DataLoader

# Set AWS credentials (or use IAM roles)
os.environ["AWS_ACCESS_KEY_ID"] = "your-key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your-secret"
os.environ["AWS_REGION"] = "us-east-1"

# Connect to S3 bucket
loader = DataLoader("s3://my-bucket")

# Read a file from S3
content = loader.read_file("path/to/file.txt")

# List files with prefix
files = loader.list_files("data/2024/")

# Batch download (50 concurrent S3 requests!)
contents = loader.read_batch(files)

ML Training Pipeline Example

from hyperload import DataLoader
import json

def load_training_data(data_path: str):
    """Load and parse all training samples."""
    loader = DataLoader(f"file://{data_path}")
    
    # Discover all JSON files
    files = loader.list_files("train/")
    json_files = [f for f in files if f.endswith(".json")]
    
    # Parallel load all files
    raw_data = loader.read_batch(json_files)
    
    # Parse JSON
    samples = [json.loads(content) for content in raw_data]
    
    print(f"Loaded {len(samples)} training samples")
    return samples

# Usage
data = load_training_data("./dataset")

API Reference

DataLoader(url: str)

Create a new data loader instance.

Parameters:

  • url (str): Base URL for data loading
    • file://./path - Local file system (relative path)
    • file:///absolute/path - Local file system (absolute path)
    • s3://bucket-name - Amazon S3

Example:

# Local - current directory
loader = DataLoader("file://.")

# Local - specific path
loader = DataLoader("file://./data")

# S3 bucket
loader = DataLoader("s3://my-ml-bucket")

Methods

Method Description
read_file(path) Read a single file, returns string
list_files(prefix) List files under prefix, returns list of paths
read_batch(paths) Read multiple files in parallel, returns list of strings

Performance

hyperload uses Rust's async runtime with buffered parallel execution:

Feature Specification
Concurrent reads 50 simultaneous I/O operations
Memory Zero-copy where possible
Large files Streaming support
S3 optimization Connection pooling & keep-alive

Benchmark: Loading 1000 JSON Files

Method Time
Python open() loop 12.3s
Python ThreadPool 4.1s
hyperload 0.8s

Development

# Clone the repo
git clone https://github.com/DuhanJishnu/hyperload.git
cd hyperload

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
.\.venv\Scripts\activate   # Windows

# Install dev dependencies
pip install maturin pytest

# Build and install locally
maturin develop

# Run tests
pytest tests/ -v

# Build release wheel
maturin build --release

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hyperload-0.1.0-cp312-cp312-win_amd64.whl (2.6 MB view details)

Uploaded CPython 3.12Windows x86-64

hyperload-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (2.8 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

hyperload-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

hyperload-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file hyperload-0.1.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for hyperload-0.1.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 79a85be057d0fa0ecb957a6e4a89c97a398587a673d64062dad30061f800fb5e
MD5 c2ae67302fa40242b87b72e8ed8e6b82
BLAKE2b-256 1728adcf67d02f892bf8c407e44b38a46e89bac71d83b50787f48abc904ef838

See more details on using hashes here.

File details

Details for the file hyperload-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hyperload-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0c3effce4d2219f2f499aaef5db433a844f44d6a25c9e53baf3a39a59634543b
MD5 76a086f5afaf540ffb6e60f90d8fd06b
BLAKE2b-256 59f435cccc82f0abe5d27eb8d18672e67d57ca11668d82a6900700aa0dcfe01e

See more details on using hashes here.

File details

Details for the file hyperload-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for hyperload-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9d800525d224278e4c4e80fa0344cd6a5200bce1515c7ab62418b6f9116b7c14
MD5 0e6b21113531b6f4e666a2476ed6ab68
BLAKE2b-256 77feb5ba0b28e3c2422da8fe65f34d4e831d1942095e1c620cbce314123aff97

See more details on using hashes here.

File details

Details for the file hyperload-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hyperload-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b0999bf1b4708a58513efbe3aeda3585c48598f63f7e66bc012fe41ab5dfc87f
MD5 34a065c7943ee491b2ab7d90c5097432
BLAKE2b-256 571f6a1b4407083fc6fa40e938bf804eb8bcef62ed4aadc018bcf6cb94e67100

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page