webshart

Fast and memory-efficient webdataset shard reader

These details have not been verified by PyPI

Project links

Project description

Fast dataloader and conversion utility for webdataset tar shards. Rust core with Python bindings.

Built for streaming large video and image datasets, but handles any byte data.

Install

pip install webshart

What is this?

Webshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.

The indexed format provides massive performance benefits:

Random access: Jump to any file instantly
Selective downloads: Only fetch the files you need
True parallelism: Read from multiple shards simultaneously
Cloud-optimized: Works efficiently with HTTP range requests
Aspect bucketing: Optionally include image geometry hints width, height and aspect for the ability to bucket images by shape
Logical sample APIs: Treat image.ext + image.json pairs as one sample while still allowing raw file access
Caption metadata: Store captions in shard metadata under the plural captions key as either a string or a list of strings
Custom DataLoader: Includes state dict methods on the DataLoader so that you can resume training deterministically
Rate-limit friendly: Local caching allows high-frequency random seeking without encountering storage provider rate limits
Instant start-up with pre-sorted aspect buckets

Growing ecosystem: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).

Quick Start

import webshart

# Find your dataset
dataset = webshart.discover_dataset(
    source="laion/conceptual-captions-12m-webdataset",
    # we're able to upload metadata separately so that we reduce load on huggingface infra.
    metadata="webshart/conceptual-captions-12m-webdataset-metadata",
)
print(f"Found {dataset.num_shards} shards")

loader = webshart.TarDataLoader(dataset)

# File-oriented access is still available.
files = dataset.list_files_in_shard(0)

# Sample-oriented access skips paired JSON sidecars.
samples = dataset.list_samples_in_shard(0)
entry = loader.load_sample(0, 0)
print(entry.path, entry.captions, entry.json_metadata)

Common Patterns

For real-world, working examples:

Creating Indices for / Converting Existing Datasets

Any tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:

A command-line tool that auto-discovers tars to process:

% webshart extract-metadata \
    --source laion/conceptual-captions-12m-webdataset \
    --destination laion_output/ \
    --checkpoint-dir ./laion_output/checkpoints \
    --max-workers 2 \
    --include-image-geometry

Or, if you prefer/require direct-integration to an existing Python application, use the API

Uploading Indices to HuggingFace

Once you've generated indices, share them with the community:

# Upload all JSON files to your dataset
huggingface-cli upload --repo-type=dataset \
    username/dataset-name \
    ./indices/ \
    --include "*.json" \
    --path-in-repo "indices/"

Or if you want to contribute to an existing dataset you don't own:

Create a community dataset with indices: username/original-dataset-indices
Upload the JSON files there
Open a discussion on the original dataset suggesting they add the indices

Creating New Indexed Datasets

If you're creating a new dataset, generate indices during creation:

{
  "files": {
    "image_0001.webp": {"offset": 512, "length": 102400},
    "image_0002.webp": {"offset": 102912, "length": 98304},
    ...
  }
}

The JSON index should have the same name as the tar file (e.g., shard_0000.tar → shard_0000.json).

Image + JSON Sidecar Samples

Webshart supports webdataset shards that store each sample as an image-like payload plus a paired JSON sidecar:

sample_0001.webp
sample_0001.json
sample_0002.webp
sample_0002.json

When metadata is extracted or loaded, sidecars are attached to their paired sample entries:

{
  "files": {
    "sample_0001.webp": {
      "offset": 512,
      "length": 102400,
      "width": 1024,
      "height": 1024,
      "aspect": 1.0,
      "json_path": "sample_0001.json",
      "json_offset": 103424,
      "json_length": 128,
      "captions": "a product photo on a white background",
      "json_metadata": {
        "caption": "a product photo on a white background"
      }
    },
    "sample_0001.json": {
      "offset": 103424,
      "length": 128
    }
  }
}

Use file-oriented APIs when you want every archive member, including sidecars:

dataset.list_files_in_shard(0)

reader = dataset.open_shard(0)
raw_file_bytes = reader.read_file(0)

Use sample-oriented APIs when you want training samples:

dataset.list_samples_in_shard(0)
dataset.get_shard_sample_count(0)

reader = dataset.open_shard(0)
image_bytes = reader.read_sample(0)
json_bytes = reader.read_sample_json(0)

entry = loader.load_sample(0, 0)
print(entry.path)
print(entry.captions)
print(entry.json_data)

Captions are canonicalized to the plural captions metadata key. The value may be a single string, a list of strings, or absent.

webshart.write_captions_to_metadata(
    "shard_0000.json",
    {
        "sample_0001.webp": "a short caption",
        "sample_0002": ["caption one", "caption two"],
    },
)

The writer updates existing webshart metadata JSON in place, removes old singular caption keys from updated samples, and leaves paired .json sidecar entries untouched.

Aspect Bucketing Samples

list_shard_aspect_buckets() is file-oriented and buckets any indexed file that has width and height.

For training pipelines, prefer list_shard_sample_aspect_buckets():

loader = webshart.TarDataLoader(dataset)
buckets = loader.list_shard_sample_aspect_buckets(
    [0],
    key="geometry-tuple",
    target_pixel_area=1024**2,
)[0]["buckets"]

for bucket_key, entries in buckets.items():
    for item in entries:
        virtual_id = f"webshart://0/{item['sample_idx']}/{item['filename']}"
        image = loader.load_sample(0, item["sample_idx"])

This uses logical samples from metadata.sample_range() / get_sample_by_index() and excludes paired JSON sidecars before bucketing. Each bucket entry includes sample_idx, so callers can build stable IDs and load images directly with loader.load_sample(shard_idx, sample_idx).

Why is it fast?

Problem: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.

Solution: The indexed format stores byte offsets and sample metadata in a separate JSON file, enabling:

HTTP range requests for any file
True random access over network
Parallel reads from multiple shards
Large scale, aspect-bucketed datasets
No wasted bandwidth

The Rust implementation provides:

Real parallelism (no Python GIL)
Zero-copy operations where possible
Efficient HTTP connection pooling
Optimized tokio async runtime
Optional local caching for metadata and shards
Fast aspect bucketing for image data

Datasets Using This Format

I discovered after creating this library that cheesechaser is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.

NebulaeWis/e621-2024-webp-4Mpixel
picollect/danbooru2 (subfolder: images)
Many picollect image datasets
Your dataset could be next! See "Creating Indices" above

Requirements

Python 3.8+
Linux/macOS/Windows

Roadmap

image decoding is currently not handled by this library, but it will be added with zero-copy.
more informative API for caching and other Rust implementation details
multi-gpu/multi-node friendly dataloader

Projects using webshart

CaptionFlow uses this library to solve memory use and seek performance issues typical to webdatasets

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.6

Apr 27, 2026

0.4.5

Apr 26, 2026

0.4.4

Apr 26, 2026

0.4.3

Sep 6, 2025

0.4.2

Sep 6, 2025

0.4.1

Sep 6, 2025

0.4.0

Sep 3, 2025

0.3.0

Sep 2, 2025

0.2.0

Aug 31, 2025

0.1.0

Aug 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webshart-0.4.6.tar.gz (100.2 kB view details)

Uploaded Apr 27, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webshart-0.4.6-cp38-abi3-win_amd64.whl (2.8 MB view details)

Uploaded Apr 27, 2026 CPython 3.8+Windows x86-64

webshart-0.4.6-cp38-abi3-manylinux_2_39_x86_64.whl (5.0 MB view details)

Uploaded Apr 27, 2026 CPython 3.8+manylinux: glibc 2.39+ x86-64

webshart-0.4.6-cp38-abi3-manylinux_2_39_aarch64.whl (5.1 MB view details)

Uploaded Apr 27, 2026 CPython 3.8+manylinux: glibc 2.39+ ARM64

webshart-0.4.6-cp38-abi3-manylinux_2_35_x86_64.whl (5.0 MB view details)

Uploaded Apr 27, 2026 CPython 3.8+manylinux: glibc 2.35+ x86-64

webshart-0.4.6-cp38-abi3-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded Apr 27, 2026 CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file webshart-0.4.6.tar.gz.

File metadata

Download URL: webshart-0.4.6.tar.gz
Upload date: Apr 27, 2026
Size: 100.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.6.tar.gz
Algorithm	Hash digest
SHA256	`9b61f1918f7efd0b795815cf6431f0fb4d2da3d9b2b89086ab6747df806039b7`
MD5	`0728f8a881b4bb06cca0ed2f26ab8e04`
BLAKE2b-256	`b290db57a0de02c3cabb9ce27069a00686e711a25f437b8f39f3f7e66993246d`

See more details on using hashes here.

File details

Details for the file webshart-0.4.6-cp38-abi3-win_amd64.whl.

File metadata

Download URL: webshart-0.4.6-cp38-abi3-win_amd64.whl
Upload date: Apr 27, 2026
Size: 2.8 MB
Tags: CPython 3.8+, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.6-cp38-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`001286d3e8d9351d3c796855c82b60c7e7b12fd52b2b98f93c0a9dadc2c6727e`
MD5	`860e61d0ce350dc63a1aa1bab854e0c8`
BLAKE2b-256	`7765f93048bbd260169e603bd7bceaba4d35d32fb678634a2f82e9def2c92d8f`

See more details on using hashes here.

File details

Details for the file webshart-0.4.6-cp38-abi3-manylinux_2_39_x86_64.whl.

File metadata

Download URL: webshart-0.4.6-cp38-abi3-manylinux_2_39_x86_64.whl
Upload date: Apr 27, 2026
Size: 5.0 MB
Tags: CPython 3.8+, manylinux: glibc 2.39+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.6-cp38-abi3-manylinux_2_39_x86_64.whl
Algorithm	Hash digest
SHA256	`ae50c07f66d44db33f2e666bc4d8f85c5a69db8bd0d657835316c4b296a4c9fd`
MD5	`d670f515d4d3c68e6c615a23f45ebd11`
BLAKE2b-256	`3d5f1d9cc5df8145a58733c9e746b6000669b1143b08d4967e36fe570ffa64d3`

See more details on using hashes here.

File details

Details for the file webshart-0.4.6-cp38-abi3-manylinux_2_39_aarch64.whl.

File metadata

Download URL: webshart-0.4.6-cp38-abi3-manylinux_2_39_aarch64.whl
Upload date: Apr 27, 2026
Size: 5.1 MB
Tags: CPython 3.8+, manylinux: glibc 2.39+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.6-cp38-abi3-manylinux_2_39_aarch64.whl
Algorithm	Hash digest
SHA256	`69c69c490f6f1bac144b3688a6fcc294c856b69fe6dbf8b3d7177d386894afd8`
MD5	`047f5e174bba8f1e59bd6e14a6ff6ae0`
BLAKE2b-256	`fda02e83b7b286afa153e03d5c91eb7d8308b141810797de485e9332e3ba2f18`

See more details on using hashes here.

File details

Details for the file webshart-0.4.6-cp38-abi3-manylinux_2_35_x86_64.whl.

File metadata

Download URL: webshart-0.4.6-cp38-abi3-manylinux_2_35_x86_64.whl
Upload date: Apr 27, 2026
Size: 5.0 MB
Tags: CPython 3.8+, manylinux: glibc 2.35+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.6-cp38-abi3-manylinux_2_35_x86_64.whl
Algorithm	Hash digest
SHA256	`28a638732037a018bf9044b053c2a80c60b6a8c1846818aef01e0119045d0754`
MD5	`243215b4020504f4bb8cbb774783940f`
BLAKE2b-256	`f98d782b3e1f5c78a87ca250ad845753ffba883ea216804160ce0bad9bb754f2`

See more details on using hashes here.

File details

Details for the file webshart-0.4.6-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: webshart-0.4.6-cp38-abi3-macosx_11_0_arm64.whl
Upload date: Apr 27, 2026
Size: 3.0 MB
Tags: CPython 3.8+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.6-cp38-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`28c4bd8420c381195e45610a880aebd65ab5dc1621a1b49774dab09b16c265aa`
MD5	`7466b84a9544fc269277e2f5abb54397`
BLAKE2b-256	`f9cfc5c630da7f23bae12ddc3c9d8d5d5e5e9aa13a0c91167abb5ba3b8975182`

See more details on using hashes here.

webshart 0.4.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Install

What is this?

Quick Start

Common Patterns

Creating Indices for / Converting Existing Datasets

Uploading Indices to HuggingFace

Creating New Indexed Datasets

Image + JSON Sidecar Samples

Aspect Bucketing Samples

Why is it fast?

Datasets Using This Format

Requirements

Roadmap

Projects using webshart

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes