Decentralized AI dataset pipeline built on Shelby Protocol

These details have not been verified by PyPI

Project links

Project description

ShelbyTrain

Streaming Dataset Shards from Shelby into ML Workflows

ShelbyTrain is an experimental ML data pipeline built to evaluate Shelby as a high-performance dataset storage and delivery layer for AI training workflows.

The project demonstrates how image datasets can be:

prepared locally,
converted into dataset shards,
uploaded to Shelby,
downloaded and cached,
streamed into PyTorch,
benchmarked against local storage performance. It is a developer/research layer built on top of Shelby to test how Shelby performs under repeated ML dataset access workloads.

Project Goal

The goal of this MVP is to answer a simple question:

Can Shelby function as a practical remote dataset layer for machine learning workflows?

The project focuses specifically on:

repeated dataset access,
caching efficiency,
initialization latency,
throughput during training,
and integration into PyTorch pipelines.

Current MVP Features

MNIST image dataset preparation
Dataset sharding system
Manifest generation
Shelby blob upload/download
Local shard caching
PyTorch dataset integration
Benchmarking system
Cold vs cached performance comparison

Architecture Overview

Images
   ↓
Dataset Shards (.tar)
   ↓
Manifest.json
   ↓
Upload to Shelby
   ↓
Shelby Blob Storage
   ↓
Download + Cache
   ↓
PyTorch Dataset Loader
   ↓
Benchmark + Training Workflow

Project Structure

ShelbyTrain/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
│
├── shelbytrain/
│   ├── __init__.py
│   ├── benchmark.py
│   ├── cache.py
│   ├── client.py
│   ├── dataset.py
│   └── sharder.py
│
├── scripts/
│   ├── benchmark.py
│   ├── create_shards.py
│   ├── prepare_sample_dataset.py
│   ├── test_loader.py
│   ├── test_loader_shelby.py
│   └── upload_with_cli.py
│
└── data/

System Requirements

Recommended environment:

Ubuntu / WSL / Linux / macOS
Python 3.9+
Node.js 20+
npm
Git
Shelby CLI

Install required system packages:

sudo apt update
sudo apt install python3 python3-venv python3-pip git curl unzip -y

Python Packages

Create requirements.txt:

torch
torchvision
pillow
requests
tqdm
python-dotenv

Environment Setup

Clone the repository:

git clone YOUR_REPO_URL
cd ShelbyTrain

Create a virtual environment:

python3 -m venv .venv

Activate the environment:

source .venv/bin/activate

Upgrade pip:

pip install --upgrade pip

Install dependencies:

pip install -r requirements.txt

If PyTorch install fails:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install pillow requests tqdm python-dotenv

Verify installation:

python -c "import torch; import torchvision; print('OK')"

Fix Python Import Path

If you encounter:

ModuleNotFoundError: No module named 'shelbytrain'

Run:

export PYTHONPATH=$PWD

Optional permanent fix:

echo 'export PYTHONPATH=$PWD' >> ~/.bashrc
source ~/.bashrc

Shelby CLI Setup

ShelbyTrain uses Shelby CLI to upload dataset shards.

Install Node.js 20 using NVM

Install NVM:

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
source ~/.bashrc

Install Node.js:

nvm install 20
nvm use 20
nvm alias default 20

Verify:

node -v
npm -v

Install Shelby CLI

npm i -g @shelby-protocol/cli

Verify installation:

shelby --version

Initialize Shelby:

shelby init

Recommended context:

shelbynet

Verify contexts:

shelby context list

Verify account:

shelby account balance

Environment Variables

Create .env:

touch .env

Add:

SHELBY_ACCOUNT=0xyour_account_address
SHELBY_API_KEY=your_api_key_here

Create .env.example:

SHELBY_ACCOUNT=0xyour_account_address
SHELBY_API_KEY=your_api_key_here

Run Local Test

Prepare sample dataset:

python scripts/prepare_sample_dataset.py

Create shards:

python scripts/create_shards.py

Test local loader:

python scripts/test_loader.py

Expected output:

Indexed 5000 samples
Images shape: torch.Size([32, 1, 28, 28])
Labels shape: torch.Size([32])

Upload Shards to Shelby

Upload dataset shards:

python scripts/upload_with_cli.py

Verify uploaded blobs:

shelby account blobs

Test Shelby Loader

Run Shelby-backed loader:

python scripts/test_loader_shelby.py

First run:

downloads shards from Shelby

Second run:

uses local cache

Run Benchmark

Clear cache before cold benchmark:

rm -rf .shelby-cache

Run benchmark:

python scripts/benchmark.py

View benchmark results:

cat benchmark-results.json

Benchmark Results

Example benchmark from MVP test:

{
  "local": {
    "batches": 50,
    "batch_size": 32,
    "samples": 1600,
    "time_to_first_batch_sec": 0.1105,
    "total_time_sec": 0.6918,
    "samples_per_sec": 2312.66
  },
  "shelby_cold": {
    "batches": 50,
    "batch_size": 32,
    "samples": 1600,
    "time_to_first_batch_sec": 0.0196,
    "total_time_sec": 0.7025,
    "samples_per_sec": 2277.53,
    "dataset_init_download_sec": 16.4761
  },
  "shelby_cached": {
    "batches": 50,
    "batch_size": 32,
    "samples": 1600,
    "time_to_first_batch_sec": 0.0139,
    "total_time_sec": 0.6019,
    "samples_per_sec": 2658.17,
    "dataset_init_cache_sec": 1.7567
  }
}

Benchmark Interpretation

Mode	Init Time	Time to First Batch	Samples/sec
Local	~0s	0.1105s	2312.66
Shelby Cold	16.48s	0.0196s	2277.53
Shelby Cached	1.76s	0.0139s	2658.17

Key Findings

Shelby cold start introduced a ~16.48 second initialization cost.
Cached startup reduced initialization time by ~9x.
Cached throughput exceeded Shelby cold throughput by ~16.7%.
Cached throughput exceeded local throughput by ~14.9% in this experiment.

Conclusion

Shelby behaves like a remote dataset layer with a one-time initialization cost.

Once shards are downloaded and cached locally, training throughput becomes comparable to or faster than local dataset reads.

This makes Shelby particularly promising for:

repeated training workflows,
reusable AI datasets,
distributed dataset delivery,
and large-scale media/data pipelines.

Limitations

This project is currently an MVP prototype intended for benchmarking and architectural validation.

Current limitations include:

Uses MNIST-scale image datasets only
Dataset loader downloads full shards before training begins
No true range-based or partial sample loading implemented
Cold-start performance varies with network conditions and shard count
Not optimized for distributed or multi-node training workloads
No parallel shard prefetching pipeline
No GPU-specific optimization or acceleration
Benchmark scope currently focused on repeated-read training scenarios

Recommended Next Improvements

Add range-request based loading
Add concurrent shard downloading
Add retry/failure analytics
Add larger dataset benchmarks
Add distributed training support
Add dataset metadata indexing
Add benchmark visualization dashboard
Add CLI commands:
- shelbytrain prepare
- shelbytrain shard
- shelbytrain upload
- shelbytrain benchmark

Contribution Summary

ShelbyTrain is an experimental ML integration layer built on top of Shelby.

It extends Shelby’s AI use case by adding:

dataset sharding,
manifest generation,
PyTorch integration,
caching,
benchmarking,
and ML workflow validation.

The project demonstrates how Shelby can function as a reusable dataset layer for repeated AI training workloads.

Future Direction

The next stage of ShelbyTrain is a lightweight dashboard/dApp that will allow users to:

upload datasets,
visualize shards,
monitor benchmarks,
preview samples,
and test dataset delivery performance directly from the browser.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 25, 2026

This version

0.1.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shelbytrain-0.1.0.tar.gz (17.3 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

shelbytrain-0.1.0-py3-none-any.whl (16.1 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file shelbytrain-0.1.0.tar.gz.

File metadata

Download URL: shelbytrain-0.1.0.tar.gz
Upload date: May 22, 2026
Size: 17.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for shelbytrain-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4c7f039735a4818048e3fa33a382af79827deb3ac0d98183a33572aab75e8f53`
MD5	`681d4f7bc3c3661f3018ccf35f8fdf1f`
BLAKE2b-256	`590fea4894242fb9db239c2cf5a766a9bfe3a70f4cf8a6e21365dd9170787e94`

See more details on using hashes here.

File details

Details for the file shelbytrain-0.1.0-py3-none-any.whl.

File metadata

Download URL: shelbytrain-0.1.0-py3-none-any.whl
Upload date: May 22, 2026
Size: 16.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for shelbytrain-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1b3f2555662a0c341303553b5a021528ffaa16e8a6d9cb74b5440eac7af62dc9`
MD5	`40aa3ddcbd77006a28e434c1162b2fa1`
BLAKE2b-256	`da63b868ce746efe188336ef70bba6bbf1c3756ac53182a26f3c20f4d3f06556`

See more details on using hashes here.

shelbytrain 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ShelbyTrain

Project Goal

Current MVP Features

Architecture Overview

Project Structure

System Requirements

Python Packages

Environment Setup

Fix Python Import Path

Shelby CLI Setup

Install Node.js 20 using NVM

Install Shelby CLI

Environment Variables

Run Local Test

Upload Shards to Shelby

Test Shelby Loader

Run Benchmark

Benchmark Results

Benchmark Interpretation

Key Findings

Conclusion

Limitations

Recommended Next Improvements

Contribution Summary

Future Direction

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes