Decentralized AI dataset pipeline built on Shelby Protocol
Project description
ShelbyTrain
ShelbyTrain is a decentralized dataset pipeline for AI training.
It helps you take a dataset, split it into training-friendly shards, upload those shards to Shelby decentralized storage, and load them later from anywhere using PyTorch.
The simple idea:
dataset -> shards -> Shelby storage -> manifest -> PyTorch DataLoader
Install
pip install shelbytrain
The Python package provides the sharding, manifest, cache, Shelby download client, and PyTorch dataset loader pieces. The web app in this repository builds on top of those same ideas to make upload, benchmark, and reconstruction easier from a browser.
The Problem
AI datasets are often hard to share and reproduce.
Teams pass around zip files, cloud drive links, private buckets, or local folders that only work on one machine. That creates a few real problems:
- Training data is not portable.
- Large files are expensive to move repeatedly.
- A model run can silently depend on a local folder nobody else has.
- Sharing data with another researcher or team usually means copying the whole dataset again.
- Re-running experiments wastes time downloading the same data over and over.
ShelbyTrain solves this by making the dataset addressable through a manifest.
Instead of sending someone the whole dataset, you send them a small manifest.uploaded.json. That manifest tells ShelbyTrain where the dataset shards live on Shelby and how to load them.
What This Project Provides
ShelbyTrain has two parts:
-
A web app
The app lets users connect a wallet, upload datasets, shard them, push them to Shelby, benchmark local vs Shelby loading, and reconstruct uploaded data from a manifest.
-
A Python/PyTorch loader
The Python package reads a manifest, downloads shards from Shelby when needed, caches them locally, and exposes the data as a PyTorch dataset.
If you are installing from PyPI, the main thing you need is the Python loader/sharder API. If you are using the full repository, you also get the browser app.
Why Shelby?
Shelby provides decentralized storage for data blobs. ShelbyTrain uses that storage layer for dataset shards.
The goal is not just to upload files. The goal is to make datasets easier to:
- share,
- verify,
- cache,
- reload,
- benchmark,
- and use in training workflows.
Real World Use Cases
ShelbyTrain is useful for:
- AI researchers sharing reproducible training datasets.
- Small teams that do not want every member manually downloading the same large dataset.
- Hackathon and demo projects that need portable AI data.
- Open dataset publishing where the dataset should be accessible by manifest.
- Benchmarking local disk loading vs decentralized cold/cached loading.
- Reconstructing a dataset from a manifest someone sent you.
Current Features
- Wallet-gated app experience.
- User-owned uploads through the Shelby browser SDK.
- Dataset sharding before upload.
- Manifest generation.
- Manifest sharing.
- PyTorch-compatible dataset loading.
- Local shard cache for faster repeat runs.
- Benchmark page for local, Shelby cold, and Shelby cached loading.
- Reconstruct page for rebuilding data from a sent manifest.
- Support for image, text, CSV, JSONL, JSON, PDF, DOCX, Parquet, and audio-oriented dataset flows.
Supported Dataset Formats
| Format | Use case | Output |
|---|---|---|
image-tar |
Image datasets | TAR shards with images/ and labels.csv |
text-jsonl |
Text, CSV, PDF, DOCX, JSONL | TAR shards with data.jsonl |
parquet |
Tabular data or embeddings | TAR shards with Parquet data |
audio-tar |
Audio datasets | TAR shards with audio files and labels.csv |
For PDF and DOCX files, ShelbyTrain extracts the text and stores it as JSONL. Reconstructing those uploads returns extracted text, not the original binary PDF or DOCX file.
How The App Works
- Connect an Aptos/Shelby-compatible wallet.
- Upload a dataset or select a dataset format.
- ShelbyTrain creates local shards.
- The browser uploads shards to Shelby using the connected wallet.
- The backend writes a
manifest.uploaded.json. - Anyone with the manifest and the Shelby account/API access can load or reconstruct the dataset.
The connected wallet is the upload authority.
Manifest Example
{
"name": "bitcoinos",
"format": "text-jsonl",
"version": "0.2.0",
"total_samples": 615,
"shard_size": 10000,
"text_field": "text",
"label_field": "label",
"shards": [
{
"index": 0,
"file": "shard-00000.tar",
"samples": 615,
"blob_name": "bitcoinos-mpinvn45/shard-00000.tar",
"size_bytes": 92160,
"sha256": "..."
}
]
}
For best portability, a shared manifest should also include the Shelby owner account:
{
"shelby_account": "0x..."
}
If the manifest does not include shelby_account, the Reconstruct page lets the user enter it manually.
For PyTorch users, the owner account matters because Shelby blob URLs are resolved from:
shelby_account + blob_name
So a manifest with blob_name but no owner account is not fully self-contained. It can still work, but the user must know which account owns the blobs.
PyTorch Usage
from shelbytrain import load_dataset, ShelbyHTTPClient
from torch.utils.data import DataLoader
client = ShelbyHTTPClient(
account="0x...", # Shelby/Aptos account that owns the blobs
api_key="...", # Shelby API key
)
dataset = load_dataset("manifest.uploaded.json", client=client)
loader = DataLoader(dataset, batch_size=32, shuffle=True)
for inputs, labels in loader:
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
On the first run, ShelbyTrain downloads shards from Shelby. After that, it uses the local cache.
Creating Local Shards
Image datasets should be arranged like this:
dataset/
images/
sample-001.png
sample-002.png
labels.csv
labels.csv should contain:
filename,label
sample-001.png,0
sample-002.png,1
Then create shards:
from shelbytrain import create_image_shards
manifest = create_image_shards(
dataset_dir="dataset",
output_dir="data/my_dataset",
shard_size=1000,
dataset_name="my-dataset",
)
For text datasets, use JSONL:
{"text": "hello world", "label": 0}
{"text": "another sample", "label": 1}
The generated manifest.json describes local shards. After upload, manifest.uploaded.json should include Shelby blob_name values for each shard.
Reconstructing From A Sent Manifest
Use the app’s Reconstruct page.
- Upload
manifest.uploaded.json. - Enter the Shelby owner account if the manifest does not include it.
- Click Reconstruct data.
- ShelbyTrain downloads the shards from Shelby and returns the reconstructed file.
Current reconstruct outputs:
| Manifest format | Downloaded output |
|---|---|
text-jsonl |
.txt |
image-tar |
.tar.gz |
parquet |
.parquet |
Project Status
ShelbyTrain is an experimental dataset pipeline. The current focus is proving a practical decentralized workflow for AI data:
- user-owned upload,
- portable manifests,
- PyTorch loading,
- caching,
- benchmarking,
- and reconstruction.
Future improvements could include original-file preservation for PDFs/DOCX, richer label editing, manifest signing, public dataset pages, and a smoother manifest-first PyTorch API.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file shelbytrain-0.2.0.tar.gz.
File metadata
- Download URL: shelbytrain-0.2.0.tar.gz
- Upload date:
- Size: 17.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1113311dd9e8734a997d1d2e5d82e7c78d680856d2571b955d69bdfadc01e3b
|
|
| MD5 |
982f929e6a060c013524eae1eb30576b
|
|
| BLAKE2b-256 |
fb7a1d6cc330b2a273ef1d6441d2fc4f686b8a4c6cabac119658725472c4f380
|
File details
Details for the file shelbytrain-0.2.0-py3-none-any.whl.
File metadata
- Download URL: shelbytrain-0.2.0-py3-none-any.whl
- Upload date:
- Size: 16.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0cfcab4267dbf4bb710b698a9e7bfb109eba8dbab24611b416ff830bcd9591c
|
|
| MD5 |
6b39dfbe4a88303553f47084b3fa89f5
|
|
| BLAKE2b-256 |
ad4d9f05179d0322a2392aabb777058e00c5881ff381096b071bcb34560b300d
|