Decentralized AI dataset pipeline built on Shelby Protocol

These details have not been verified by PyPI

Project links

Project description

ShelbyTrain

ShelbyTrain is a decentralized dataset pipeline for AI training.

It helps you take a dataset, split it into training-friendly shards, upload those shards to Shelby decentralized storage, and load them later from anywhere using PyTorch.

The simple idea:

dataset -> shards -> Shelby storage -> manifest -> PyTorch DataLoader

Install

pip install shelbytrain

The Python package provides the sharding, manifest, cache, Shelby download client, and PyTorch dataset loader pieces. The web app in this repository builds on top of those same ideas to make upload, benchmark, and reconstruction easier from a browser.

The Problem

AI datasets are often hard to share and reproduce.

Teams pass around zip files, cloud drive links, private buckets, or local folders that only work on one machine. That creates a few real problems:

Training data is not portable.
Large files are expensive to move repeatedly.
A model run can silently depend on a local folder nobody else has.
Sharing data with another researcher or team usually means copying the whole dataset again.
Re-running experiments wastes time downloading the same data over and over.

ShelbyTrain solves this by making the dataset addressable through a manifest.

Instead of sending someone the whole dataset, you send them a small manifest.uploaded.json. That manifest tells ShelbyTrain where the dataset shards live on Shelby and how to load them.

What This Project Provides

ShelbyTrain has two parts:

A web app

The app lets users connect a wallet, upload datasets, shard them, push them to Shelby, benchmark local vs Shelby loading, and reconstruct uploaded data from a manifest.
A Python/PyTorch loader

The Python package reads a manifest, downloads shards from Shelby when needed, caches them locally, and exposes the data as a PyTorch dataset.

If you are installing from PyPI, the main thing you need is the Python loader/sharder API. If you are using the full repository, you also get the browser app.

Why Shelby?

Shelby provides decentralized storage for data blobs. ShelbyTrain uses that storage layer for dataset shards.

The goal is not just to upload files. The goal is to make datasets easier to:

share,
verify,
cache,
reload,
benchmark,
and use in training workflows.

Real World Use Cases

ShelbyTrain is useful for:

AI researchers sharing reproducible training datasets.
Small teams that do not want every member manually downloading the same large dataset.
Hackathon and demo projects that need portable AI data.
Open dataset publishing where the dataset should be accessible by manifest.
Benchmarking local disk loading vs decentralized cold/cached loading.
Reconstructing a dataset from a manifest someone sent you.

Current Features

Wallet-gated app experience.
User-owned uploads through the Shelby browser SDK.
Dataset sharding before upload.
Manifest generation.
Manifest sharing.
PyTorch-compatible dataset loading.
Local shard cache for faster repeat runs.
Benchmark page for local, Shelby cold, and Shelby cached loading.
Reconstruct page for rebuilding data from a sent manifest.
Support for image, text, CSV, JSONL, JSON, PDF, DOCX, Parquet, and audio-oriented dataset flows.

Supported Dataset Formats

Format	Use case	Output
`image-tar`	Image datasets	TAR shards with `images/` and `labels.csv`
`text-jsonl`	Text, CSV, PDF, DOCX, JSONL	TAR shards with `data.jsonl`
`parquet`	Tabular data or embeddings	TAR shards with Parquet data
`audio-tar`	Audio datasets	TAR shards with audio files and `labels.csv`

For PDF and DOCX files, ShelbyTrain extracts the text and stores it as JSONL. Reconstructing those uploads returns extracted text, not the original binary PDF or DOCX file.

How The App Works

Connect an Aptos/Shelby-compatible wallet.
Upload a dataset or select a dataset format.
ShelbyTrain creates local shards.
The browser uploads shards to Shelby using the connected wallet.
The backend writes a manifest.uploaded.json.
Anyone with the manifest and the Shelby account/API access can load or reconstruct the dataset.

The connected wallet is the upload authority.

Manifest Example

{
  "name": "bitcoinos",
  "format": "text-jsonl",
  "version": "0.2.0",
  "total_samples": 615,
  "shard_size": 10000,
  "text_field": "text",
  "label_field": "label",
  "shards": [
    {
      "index": 0,
      "file": "shard-00000.tar",
      "samples": 615,
      "blob_name": "bitcoinos-mpinvn45/shard-00000.tar",
      "size_bytes": 92160,
      "sha256": "..."
    }
  ]
}

For best portability, a shared manifest should also include the Shelby owner account:

{
  "shelby_account": "0x..."
}

If the manifest does not include shelby_account, the Reconstruct page lets the user enter it manually.

For PyTorch users, the owner account matters because Shelby blob URLs are resolved from:

shelby_account + blob_name

So a manifest with blob_name but no owner account is not fully self-contained. It can still work, but the user must know which account owns the blobs.

PyTorch Usage

from shelbytrain import load_dataset, ShelbyHTTPClient
from torch.utils.data import DataLoader

client = ShelbyHTTPClient(
    account="0x...",      # Shelby/Aptos account that owns the blobs
    api_key="...",        # Shelby API key
)

dataset = load_dataset("manifest.uploaded.json", client=client)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

for inputs, labels in loader:
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

On the first run, ShelbyTrain downloads shards from Shelby. After that, it uses the local cache.

Creating Local Shards

Image datasets should be arranged like this:

dataset/
  images/
    sample-001.png
    sample-002.png
  labels.csv

labels.csv should contain:

filename,label
sample-001.png,0
sample-002.png,1

Then create shards:

from shelbytrain import create_image_shards

manifest = create_image_shards(
    dataset_dir="dataset",
    output_dir="data/my_dataset",
    shard_size=1000,
    dataset_name="my-dataset",
)

For text datasets, use JSONL:

{"text": "hello world", "label": 0}
{"text": "another sample", "label": 1}

The generated manifest.json describes local shards. After upload, manifest.uploaded.json should include Shelby blob_name values for each shard.

Reconstructing From A Sent Manifest

Use the app’s Reconstruct page.

Upload manifest.uploaded.json.
Enter the Shelby owner account if the manifest does not include it.
Click Reconstruct data.
ShelbyTrain downloads the shards from Shelby and returns the reconstructed file.

Current reconstruct outputs:

Manifest format	Downloaded output
`text-jsonl`	`.txt`
`image-tar`	`.tar.gz`
`parquet`	`.parquet`

Project Status

ShelbyTrain is an experimental dataset pipeline. The current focus is proving a practical decentralized workflow for AI data:

user-owned upload,
portable manifests,
PyTorch loading,
caching,
benchmarking,
and reconstruction.

Future improvements could include original-file preservation for PDFs/DOCX, richer label editing, manifest signing, public dataset pages, and a smoother manifest-first PyTorch API.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

May 25, 2026

0.1.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shelbytrain-0.2.0.tar.gz (17.0 kB view details)

Uploaded May 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

shelbytrain-0.2.0-py3-none-any.whl (16.0 kB view details)

Uploaded May 25, 2026 Python 3

File details

Details for the file shelbytrain-0.2.0.tar.gz.

File metadata

Download URL: shelbytrain-0.2.0.tar.gz
Upload date: May 25, 2026
Size: 17.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for shelbytrain-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`c1113311dd9e8734a997d1d2e5d82e7c78d680856d2571b955d69bdfadc01e3b`
MD5	`982f929e6a060c013524eae1eb30576b`
BLAKE2b-256	`fb7a1d6cc330b2a273ef1d6441d2fc4f686b8a4c6cabac119658725472c4f380`

See more details on using hashes here.

File details

Details for the file shelbytrain-0.2.0-py3-none-any.whl.

File metadata

Download URL: shelbytrain-0.2.0-py3-none-any.whl
Upload date: May 25, 2026
Size: 16.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for shelbytrain-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0cfcab4267dbf4bb710b698a9e7bfb109eba8dbab24611b416ff830bcd9591c`
MD5	`6b39dfbe4a88303553f47084b3fa89f5`
BLAKE2b-256	`ad4d9f05179d0322a2392aabb777058e00c5881ff381096b071bcb34560b300d`

See more details on using hashes here.

shelbytrain 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ShelbyTrain

Install

The Problem

What This Project Provides

Why Shelby?

Real World Use Cases

Current Features

Supported Dataset Formats

How The App Works

Manifest Example

PyTorch Usage

Creating Local Shards

Reconstructing From A Sent Manifest

Project Status

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes