Decentralized AI dataset pipeline built on Shelby Protocol
Project description
ShelbyTrain
Streaming Dataset Shards from Shelby into ML Workflows
ShelbyTrain is an experimental ML data pipeline built to evaluate Shelby as a high-performance dataset storage and delivery layer for AI training workflows.
The project demonstrates how image datasets can be:
- prepared locally,
- converted into dataset shards,
- uploaded to Shelby,
- downloaded and cached,
- streamed into PyTorch,
- benchmarked against local storage performance. It is a developer/research layer built on top of Shelby to test how Shelby performs under repeated ML dataset access workloads.
Project Goal
The goal of this MVP is to answer a simple question:
Can Shelby function as a practical remote dataset layer for machine learning workflows?
The project focuses specifically on:
- repeated dataset access,
- caching efficiency,
- initialization latency,
- throughput during training,
- and integration into PyTorch pipelines.
Current MVP Features
- MNIST image dataset preparation
- Dataset sharding system
- Manifest generation
- Shelby blob upload/download
- Local shard caching
- PyTorch dataset integration
- Benchmarking system
- Cold vs cached performance comparison
Architecture Overview
Images
↓
Dataset Shards (.tar)
↓
Manifest.json
↓
Upload to Shelby
↓
Shelby Blob Storage
↓
Download + Cache
↓
PyTorch Dataset Loader
↓
Benchmark + Training Workflow
Project Structure
ShelbyTrain/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
│
├── shelbytrain/
│ ├── __init__.py
│ ├── benchmark.py
│ ├── cache.py
│ ├── client.py
│ ├── dataset.py
│ └── sharder.py
│
├── scripts/
│ ├── benchmark.py
│ ├── create_shards.py
│ ├── prepare_sample_dataset.py
│ ├── test_loader.py
│ ├── test_loader_shelby.py
│ └── upload_with_cli.py
│
└── data/
System Requirements
Recommended environment:
- Ubuntu / WSL / Linux / macOS
- Python 3.9+
- Node.js 20+
- npm
- Git
- Shelby CLI
Install required system packages:
sudo apt update
sudo apt install python3 python3-venv python3-pip git curl unzip -y
Python Packages
Create requirements.txt:
torch
torchvision
pillow
requests
tqdm
python-dotenv
Environment Setup
Clone the repository:
git clone YOUR_REPO_URL
cd ShelbyTrain
Create a virtual environment:
python3 -m venv .venv
Activate the environment:
source .venv/bin/activate
Upgrade pip:
pip install --upgrade pip
Install dependencies:
pip install -r requirements.txt
If PyTorch install fails:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install pillow requests tqdm python-dotenv
Verify installation:
python -c "import torch; import torchvision; print('OK')"
Fix Python Import Path
If you encounter:
ModuleNotFoundError: No module named 'shelbytrain'
Run:
export PYTHONPATH=$PWD
Optional permanent fix:
echo 'export PYTHONPATH=$PWD' >> ~/.bashrc
source ~/.bashrc
Shelby CLI Setup
ShelbyTrain uses Shelby CLI to upload dataset shards.
Install Node.js 20 using NVM
Install NVM:
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
source ~/.bashrc
Install Node.js:
nvm install 20
nvm use 20
nvm alias default 20
Verify:
node -v
npm -v
Install Shelby CLI
npm i -g @shelby-protocol/cli
Verify installation:
shelby --version
Initialize Shelby:
shelby init
Recommended context:
shelbynet
Verify contexts:
shelby context list
Verify account:
shelby account balance
Environment Variables
Create .env:
touch .env
Add:
SHELBY_ACCOUNT=0xyour_account_address
SHELBY_API_KEY=your_api_key_here
Create .env.example:
SHELBY_ACCOUNT=0xyour_account_address
SHELBY_API_KEY=your_api_key_here
Run Local Test
Prepare sample dataset:
python scripts/prepare_sample_dataset.py
Create shards:
python scripts/create_shards.py
Test local loader:
python scripts/test_loader.py
Expected output:
Indexed 5000 samples
Images shape: torch.Size([32, 1, 28, 28])
Labels shape: torch.Size([32])
Upload Shards to Shelby
Upload dataset shards:
python scripts/upload_with_cli.py
Verify uploaded blobs:
shelby account blobs
Test Shelby Loader
Run Shelby-backed loader:
python scripts/test_loader_shelby.py
First run:
- downloads shards from Shelby
Second run:
- uses local cache
Run Benchmark
Clear cache before cold benchmark:
rm -rf .shelby-cache
Run benchmark:
python scripts/benchmark.py
View benchmark results:
cat benchmark-results.json
Benchmark Results
Example benchmark from MVP test:
{
"local": {
"batches": 50,
"batch_size": 32,
"samples": 1600,
"time_to_first_batch_sec": 0.1105,
"total_time_sec": 0.6918,
"samples_per_sec": 2312.66
},
"shelby_cold": {
"batches": 50,
"batch_size": 32,
"samples": 1600,
"time_to_first_batch_sec": 0.0196,
"total_time_sec": 0.7025,
"samples_per_sec": 2277.53,
"dataset_init_download_sec": 16.4761
},
"shelby_cached": {
"batches": 50,
"batch_size": 32,
"samples": 1600,
"time_to_first_batch_sec": 0.0139,
"total_time_sec": 0.6019,
"samples_per_sec": 2658.17,
"dataset_init_cache_sec": 1.7567
}
}
Benchmark Interpretation
| Mode | Init Time | Time to First Batch | Samples/sec |
|---|---|---|---|
| Local | ~0s | 0.1105s | 2312.66 |
| Shelby Cold | 16.48s | 0.0196s | 2277.53 |
| Shelby Cached | 1.76s | 0.0139s | 2658.17 |
Key Findings
- Shelby cold start introduced a ~16.48 second initialization cost.
- Cached startup reduced initialization time by ~9x.
- Cached throughput exceeded Shelby cold throughput by ~16.7%.
- Cached throughput exceeded local throughput by ~14.9% in this experiment.
Conclusion
Shelby behaves like a remote dataset layer with a one-time initialization cost.
Once shards are downloaded and cached locally, training throughput becomes comparable to or faster than local dataset reads.
This makes Shelby particularly promising for:
- repeated training workflows,
- reusable AI datasets,
- distributed dataset delivery,
- and large-scale media/data pipelines.
Limitations
This project is currently an MVP prototype intended for benchmarking and architectural validation.
Current limitations include:
- Uses MNIST-scale image datasets only
- Dataset loader downloads full shards before training begins
- No true range-based or partial sample loading implemented
- Cold-start performance varies with network conditions and shard count
- Not optimized for distributed or multi-node training workloads
- No parallel shard prefetching pipeline
- No GPU-specific optimization or acceleration
- Benchmark scope currently focused on repeated-read training scenarios
Recommended Next Improvements
- Add range-request based loading
- Add concurrent shard downloading
- Add retry/failure analytics
- Add larger dataset benchmarks
- Add distributed training support
- Add dataset metadata indexing
- Add benchmark visualization dashboard
- Add CLI commands:
shelbytrain prepareshelbytrain shardshelbytrain uploadshelbytrain benchmark
Contribution Summary
ShelbyTrain is an experimental ML integration layer built on top of Shelby.
It extends Shelby’s AI use case by adding:
- dataset sharding,
- manifest generation,
- PyTorch integration,
- caching,
- benchmarking,
- and ML workflow validation.
The project demonstrates how Shelby can function as a reusable dataset layer for repeated AI training workloads.
Future Direction
The next stage of ShelbyTrain is a lightweight dashboard/dApp that will allow users to:
- upload datasets,
- visualize shards,
- monitor benchmarks,
- preview samples,
- and test dataset delivery performance directly from the browser.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file shelbytrain-0.1.0.tar.gz.
File metadata
- Download URL: shelbytrain-0.1.0.tar.gz
- Upload date:
- Size: 17.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c7f039735a4818048e3fa33a382af79827deb3ac0d98183a33572aab75e8f53
|
|
| MD5 |
681d4f7bc3c3661f3018ccf35f8fdf1f
|
|
| BLAKE2b-256 |
590fea4894242fb9db239c2cf5a766a9bfe3a70f4cf8a6e21365dd9170787e94
|
File details
Details for the file shelbytrain-0.1.0-py3-none-any.whl.
File metadata
- Download URL: shelbytrain-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b3f2555662a0c341303553b5a021528ffaa16e8a6d9cb74b5440eac7af62dc9
|
|
| MD5 |
40aa3ddcbd77006a28e434c1162b2fa1
|
|
| BLAKE2b-256 |
da63b868ce746efe188336ef70bba6bbf1c3756ac53182a26f3c20f4d3f06556
|