Skip to main content

Decentralized AI dataset pipeline built on Shelby Protocol

Project description

ShelbyTrain

Streaming Dataset Shards from Shelby into ML Workflows

ShelbyTrain is an experimental ML data pipeline built to evaluate Shelby as a high-performance dataset storage and delivery layer for AI training workflows.

The project demonstrates how image datasets can be:

  1. prepared locally,
  2. converted into dataset shards,
  3. uploaded to Shelby,
  4. downloaded and cached,
  5. streamed into PyTorch,
  6. benchmarked against local storage performance. It is a developer/research layer built on top of Shelby to test how Shelby performs under repeated ML dataset access workloads.

Project Goal

The goal of this MVP is to answer a simple question:

Can Shelby function as a practical remote dataset layer for machine learning workflows?

The project focuses specifically on:

  • repeated dataset access,
  • caching efficiency,
  • initialization latency,
  • throughput during training,
  • and integration into PyTorch pipelines.

Current MVP Features

  • MNIST image dataset preparation
  • Dataset sharding system
  • Manifest generation
  • Shelby blob upload/download
  • Local shard caching
  • PyTorch dataset integration
  • Benchmarking system
  • Cold vs cached performance comparison

Architecture Overview

Images
   ↓
Dataset Shards (.tar)
   ↓
Manifest.json
   ↓
Upload to Shelby
   ↓
Shelby Blob Storage
   ↓
Download + Cache
   ↓
PyTorch Dataset Loader
   ↓
Benchmark + Training Workflow

Project Structure

ShelbyTrain/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
│
├── shelbytrain/
│   ├── __init__.py
│   ├── benchmark.py
│   ├── cache.py
│   ├── client.py
│   ├── dataset.py
│   └── sharder.py
│
├── scripts/
│   ├── benchmark.py
│   ├── create_shards.py
│   ├── prepare_sample_dataset.py
│   ├── test_loader.py
│   ├── test_loader_shelby.py
│   └── upload_with_cli.py
│
└── data/

System Requirements

Recommended environment:

  • Ubuntu / WSL / Linux / macOS
  • Python 3.9+
  • Node.js 20+
  • npm
  • Git
  • Shelby CLI

Install required system packages:

sudo apt update
sudo apt install python3 python3-venv python3-pip git curl unzip -y

Python Packages

Create requirements.txt:

torch
torchvision
pillow
requests
tqdm
python-dotenv

Environment Setup

Clone the repository:

git clone YOUR_REPO_URL
cd ShelbyTrain

Create a virtual environment:

python3 -m venv .venv

Activate the environment:

source .venv/bin/activate

Upgrade pip:

pip install --upgrade pip

Install dependencies:

pip install -r requirements.txt

If PyTorch install fails:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install pillow requests tqdm python-dotenv

Verify installation:

python -c "import torch; import torchvision; print('OK')"

Fix Python Import Path

If you encounter:

ModuleNotFoundError: No module named 'shelbytrain'

Run:

export PYTHONPATH=$PWD

Optional permanent fix:

echo 'export PYTHONPATH=$PWD' >> ~/.bashrc
source ~/.bashrc

Shelby CLI Setup

ShelbyTrain uses Shelby CLI to upload dataset shards.

Install Node.js 20 using NVM

Install NVM:

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
source ~/.bashrc

Install Node.js:

nvm install 20
nvm use 20
nvm alias default 20

Verify:

node -v
npm -v

Install Shelby CLI

npm i -g @shelby-protocol/cli

Verify installation:

shelby --version

Initialize Shelby:

shelby init

Recommended context:

shelbynet

Verify contexts:

shelby context list

Verify account:

shelby account balance

Environment Variables

Create .env:

touch .env

Add:

SHELBY_ACCOUNT=0xyour_account_address
SHELBY_API_KEY=your_api_key_here

Create .env.example:

SHELBY_ACCOUNT=0xyour_account_address
SHELBY_API_KEY=your_api_key_here

Run Local Test

Prepare sample dataset:

python scripts/prepare_sample_dataset.py

Create shards:

python scripts/create_shards.py

Test local loader:

python scripts/test_loader.py

Expected output:

Indexed 5000 samples
Images shape: torch.Size([32, 1, 28, 28])
Labels shape: torch.Size([32])

Upload Shards to Shelby

Upload dataset shards:

python scripts/upload_with_cli.py

Verify uploaded blobs:

shelby account blobs

Test Shelby Loader

Run Shelby-backed loader:

python scripts/test_loader_shelby.py

First run:

  • downloads shards from Shelby

Second run:

  • uses local cache

Run Benchmark

Clear cache before cold benchmark:

rm -rf .shelby-cache

Run benchmark:

python scripts/benchmark.py

View benchmark results:

cat benchmark-results.json

Benchmark Results

Example benchmark from MVP test:

{
  "local": {
    "batches": 50,
    "batch_size": 32,
    "samples": 1600,
    "time_to_first_batch_sec": 0.1105,
    "total_time_sec": 0.6918,
    "samples_per_sec": 2312.66
  },
  "shelby_cold": {
    "batches": 50,
    "batch_size": 32,
    "samples": 1600,
    "time_to_first_batch_sec": 0.0196,
    "total_time_sec": 0.7025,
    "samples_per_sec": 2277.53,
    "dataset_init_download_sec": 16.4761
  },
  "shelby_cached": {
    "batches": 50,
    "batch_size": 32,
    "samples": 1600,
    "time_to_first_batch_sec": 0.0139,
    "total_time_sec": 0.6019,
    "samples_per_sec": 2658.17,
    "dataset_init_cache_sec": 1.7567
  }
}

Benchmark Interpretation

Mode Init Time Time to First Batch Samples/sec
Local ~0s 0.1105s 2312.66
Shelby Cold 16.48s 0.0196s 2277.53
Shelby Cached 1.76s 0.0139s 2658.17

Key Findings

  • Shelby cold start introduced a ~16.48 second initialization cost.
  • Cached startup reduced initialization time by ~9x.
  • Cached throughput exceeded Shelby cold throughput by ~16.7%.
  • Cached throughput exceeded local throughput by ~14.9% in this experiment.

Conclusion

Shelby behaves like a remote dataset layer with a one-time initialization cost.

Once shards are downloaded and cached locally, training throughput becomes comparable to or faster than local dataset reads.

This makes Shelby particularly promising for:

  • repeated training workflows,
  • reusable AI datasets,
  • distributed dataset delivery,
  • and large-scale media/data pipelines.

Limitations

This project is currently an MVP prototype intended for benchmarking and architectural validation.

Current limitations include:

  • Uses MNIST-scale image datasets only
  • Dataset loader downloads full shards before training begins
  • No true range-based or partial sample loading implemented
  • Cold-start performance varies with network conditions and shard count
  • Not optimized for distributed or multi-node training workloads
  • No parallel shard prefetching pipeline
  • No GPU-specific optimization or acceleration
  • Benchmark scope currently focused on repeated-read training scenarios

Recommended Next Improvements

  • Add range-request based loading
  • Add concurrent shard downloading
  • Add retry/failure analytics
  • Add larger dataset benchmarks
  • Add distributed training support
  • Add dataset metadata indexing
  • Add benchmark visualization dashboard
  • Add CLI commands:
    • shelbytrain prepare
    • shelbytrain shard
    • shelbytrain upload
    • shelbytrain benchmark

Contribution Summary

ShelbyTrain is an experimental ML integration layer built on top of Shelby.

It extends Shelby’s AI use case by adding:

  • dataset sharding,
  • manifest generation,
  • PyTorch integration,
  • caching,
  • benchmarking,
  • and ML workflow validation.

The project demonstrates how Shelby can function as a reusable dataset layer for repeated AI training workloads.


Future Direction

The next stage of ShelbyTrain is a lightweight dashboard/dApp that will allow users to:

  • upload datasets,
  • visualize shards,
  • monitor benchmarks,
  • preview samples,
  • and test dataset delivery performance directly from the browser.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shelbytrain-0.1.0.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shelbytrain-0.1.0-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file shelbytrain-0.1.0.tar.gz.

File metadata

  • Download URL: shelbytrain-0.1.0.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for shelbytrain-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4c7f039735a4818048e3fa33a382af79827deb3ac0d98183a33572aab75e8f53
MD5 681d4f7bc3c3661f3018ccf35f8fdf1f
BLAKE2b-256 590fea4894242fb9db239c2cf5a766a9bfe3a70f4cf8a6e21365dd9170787e94

See more details on using hashes here.

File details

Details for the file shelbytrain-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: shelbytrain-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for shelbytrain-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1b3f2555662a0c341303553b5a021528ffaa16e8a6d9cb74b5440eac7af62dc9
MD5 40aa3ddcbd77006a28e434c1162b2fa1
BLAKE2b-256 da63b868ce746efe188336ef70bba6bbf1c3756ac53182a26f3c20f4d3f06556

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page