A high-performance distributed data loader powered by Rust. Load files from S3 and local disk with blazing speed.
Project description
hyperload
High-performance distributed data loader powered by Rust.
Load files from S3 and local disk with blazing speed. Perfect for ML training pipelines.
Features
- 🚀 Blazing Fast - Rust-powered async I/O with 50x parallel file reads
- ☁️ S3 Native - First-class Amazon S3 support
- 💾 Local Disk - Seamless local file system access
- 🐍 Pythonic API - Simple, intuitive interface
- 🔒 Type Safe - Built with Rust's safety guarantees
Installation
pip install hyperload
Quick Start
from hyperload import DataLoader
# Initialize with local file system
loader = DataLoader("file://./data")
# Read a single file
content = loader.read_file("sample.txt")
print(content)
Usage Examples
Reading Files from Local Disk
from hyperload import DataLoader
# Create loader pointing to current directory
loader = DataLoader("file://.")
# Read a single file
content = loader.read_file("path/to/file.txt")
print(content)
# Read from subdirectory
data = loader.read_file("data/train/sample.json")
Listing Files in a Directory
from hyperload import DataLoader
loader = DataLoader("file://.")
# Get all files in a folder
files = loader.list_files("my_dataset/")
print(f"Found {len(files)} files")
for file_path in files:
print(file_path)
Batch Reading (Parallel I/O)
from hyperload import DataLoader
loader = DataLoader("file://.")
# List all training files
files = loader.list_files("data/training/")
# Read ALL files in parallel (50 concurrent reads!)
contents = loader.read_batch(files)
# Process the data
for i, content in enumerate(contents):
print(f"File {i}: {len(content)} bytes")
Loading from Amazon S3
import os
from hyperload import DataLoader
# Set AWS credentials (or use IAM roles)
os.environ["AWS_ACCESS_KEY_ID"] = "your-key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your-secret"
os.environ["AWS_REGION"] = "us-east-1"
# Connect to S3 bucket
loader = DataLoader("s3://my-bucket")
# Read a file from S3
content = loader.read_file("path/to/file.txt")
# List files with prefix
files = loader.list_files("data/2024/")
# Batch download (50 concurrent S3 requests!)
contents = loader.read_batch(files)
ML Training Pipeline Example
from hyperload import DataLoader
import json
def load_training_data(data_path: str):
"""Load and parse all training samples."""
loader = DataLoader(f"file://{data_path}")
# Discover all JSON files
files = loader.list_files("train/")
json_files = [f for f in files if f.endswith(".json")]
# Parallel load all files
raw_data = loader.read_batch(json_files)
# Parse JSON
samples = [json.loads(content) for content in raw_data]
print(f"Loaded {len(samples)} training samples")
return samples
# Usage
data = load_training_data("./dataset")
API Reference
DataLoader(url: str)
Create a new data loader instance.
Parameters:
url(str): Base URL for data loadingfile://./path- Local file system (relative path)file:///absolute/path- Local file system (absolute path)s3://bucket-name- Amazon S3
Example:
# Local - current directory
loader = DataLoader("file://.")
# Local - specific path
loader = DataLoader("file://./data")
# S3 bucket
loader = DataLoader("s3://my-ml-bucket")
Methods
| Method | Description |
|---|---|
read_file(path) |
Read a single file, returns string |
list_files(prefix) |
List files under prefix, returns list of paths |
read_batch(paths) |
Read multiple files in parallel, returns list of strings |
Performance
hyperload uses Rust's async runtime with buffered parallel execution:
| Feature | Specification |
|---|---|
| Concurrent reads | 50 simultaneous I/O operations |
| Memory | Zero-copy where possible |
| Large files | Streaming support |
| S3 optimization | Connection pooling & keep-alive |
Benchmark: Loading 1000 JSON Files
| Method | Time |
|---|---|
Python open() loop |
12.3s |
| Python ThreadPool | 4.1s |
| hyperload | 0.8s |
Development
# Clone the repo
git clone https://github.com/DuhanJishnu/hyperload.git
cd hyperload
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/Mac
.\.venv\Scripts\activate # Windows
# Install dev dependencies
pip install maturin pytest
# Build and install locally
maturin develop
# Run tests
pytest tests/ -v
# Build release wheel
maturin build --release
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hyperload-0.1.0-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: hyperload-0.1.0-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 2.6 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79a85be057d0fa0ecb957a6e4a89c97a398587a673d64062dad30061f800fb5e
|
|
| MD5 |
c2ae67302fa40242b87b72e8ed8e6b82
|
|
| BLAKE2b-256 |
1728adcf67d02f892bf8c407e44b38a46e89bac71d83b50787f48abc904ef838
|
File details
Details for the file hyperload-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: hyperload-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.8 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c3effce4d2219f2f499aaef5db433a844f44d6a25c9e53baf3a39a59634543b
|
|
| MD5 |
76a086f5afaf540ffb6e60f90d8fd06b
|
|
| BLAKE2b-256 |
59f435cccc82f0abe5d27eb8d18672e67d57ca11668d82a6900700aa0dcfe01e
|
File details
Details for the file hyperload-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl.
File metadata
- Download URL: hyperload-0.1.0-cp312-cp312-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.9 MB
- Tags: CPython 3.12, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d800525d224278e4c4e80fa0344cd6a5200bce1515c7ab62418b6f9116b7c14
|
|
| MD5 |
0e6b21113531b6f4e666a2476ed6ab68
|
|
| BLAKE2b-256 |
77feb5ba0b28e3c2422da8fe65f34d4e831d1942095e1c620cbce314123aff97
|
File details
Details for the file hyperload-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: hyperload-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.1 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0999bf1b4708a58513efbe3aeda3585c48598f63f7e66bc012fe41ab5dfc87f
|
|
| MD5 |
34a065c7943ee491b2ab7d90c5097432
|
|
| BLAKE2b-256 |
571f6a1b4407083fc6fa40e938bf804eb8bcef62ed4aadc018bcf6cb94e67100
|