Skip to main content

Fast and memory-efficient webdataset shard reader

Project description

image

Fast dataloader and conversion utility for webdataset tar shards. Rust core with Python bindings.

Built for streaming large video and image datasets, but handles any byte data.

Install

pip install webshart

What is this?

Webshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.

The indexed format provides massive performance benefits:

  • Random access: Jump to any file instantly
  • Selective downloads: Only fetch the files you need
  • True parallelism: Read from multiple shards simultaneously
  • Cloud-optimized: Works efficiently with HTTP range requests
  • Aspect bucketing: Optionally include image geometry hints width, height and aspect for the ability to bucket by shape
  • Custom DataLoader: Includes state dict methods on the DataLoader so that you can resume training deterministically
  • Rate-limit friendly: Local caching allows high-frequency random seeking without encountering storage provider rate limits
  • Instant start-up with pre-sorted aspect buckets

Growing ecosystem: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).

Quick Start

import webshart

# Find your dataset
dataset = discover_dataset(
    source="laion/conceptual-captions-12m-webdataset",
    # we're able to upload metadata separately so that we reduce load on huggingface infra.
    metadata="webshart/conceptual-captions-12m-webdataset-metadata",
)
print(f"Found {dataset.num_shards} shards")

Common Patterns

For real-world, working examples:

Creating Indices for / Converting Existing Datasets

Any tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:

A command-line tool that auto-discovers tars to process:

% webshart extract-metadata \
    --source laion/conceptual-captions-12m-webdataset \
    --destination laion_output/ \
    --checkpoint-dir ./laion_output/checkpoints \
    --max-workers 2 \
    --include-image-geometry

Or, if you prefer/require direct-integration to an existing Python application, use the API

Uploading Indices to HuggingFace

Once you've generated indices, share them with the community:

# Upload all JSON files to your dataset
huggingface-cli upload --repo-type=dataset \
    username/dataset-name \
    ./indices/ \
    --include "*.json" \
    --path-in-repo "indices/"

Or if you want to contribute to an existing dataset you don't own:

  1. Create a community dataset with indices: username/original-dataset-indices
  2. Upload the JSON files there
  3. Open a discussion on the original dataset suggesting they add the indices

Creating New Indexed Datasets

If you're creating a new dataset, generate indices during creation:

{
  "files": {
    "image_0001.webp": {"offset": 512, "length": 102400},
    "image_0002.webp": {"offset": 102912, "length": 98304},
    ...
  }
}

The JSON index should have the same name as the tar file (e.g., shard_0000.tarshard_0000.json).

Why is it fast?

Problem: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.

Solution: The indexed format stores byte offsets and sample metadata in a separate JSON file, enabling:

  • HTTP range requests for any file
  • True random access over network
  • Parallel reads from multiple shards
  • Large scale, aspect-bucketed datasets
  • No wasted bandwidth

The Rust implementation provides:

  • Real parallelism (no Python GIL)
  • Zero-copy operations where possible
  • Efficient HTTP connection pooling
  • Optimized tokio async runtime
  • Optional local caching for metadata and shards
  • Fast aspect bucketing for image data

Datasets Using This Format

I discovered after creating this library that cheesechaser is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.

  • NebulaeWis/e621-2024-webp-4Mpixel
  • picollect/danbooru2 (subfolder: images)
  • Many picollect image datasets
  • Your dataset could be next! See "Creating Indices" above

Requirements

  • Python 3.8+
  • Linux/macOS/Windows

Roadmap

  • image decoding is currently not handled by this library, but it will be added with zero-copy.
  • more informative API for caching and other Rust implementation details
  • multi-gpu/multi-node friendly dataloader

Projects using webshart

  • CaptionFlow uses this library to solve memory use and seek performance issues typical to webdatasets

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webshart-0.4.5.tar.gz (92.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

webshart-0.4.5-cp38-abi3-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.8+Windows x86-64

webshart-0.4.5-cp38-abi3-manylinux_2_39_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.39+ x86-64

webshart-0.4.5-cp38-abi3-manylinux_2_39_aarch64.whl (5.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.39+ ARM64

webshart-0.4.5-cp38-abi3-manylinux_2_35_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.35+ x86-64

webshart-0.4.5-cp38-abi3-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file webshart-0.4.5.tar.gz.

File metadata

  • Download URL: webshart-0.4.5.tar.gz
  • Upload date:
  • Size: 92.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.5.tar.gz
Algorithm Hash digest
SHA256 a589469f4c36238d5ce547184b95261f72080f3d0e647d011eaa989ea8c19411
MD5 35024a86fa70fb1ea16f990ca003c333
BLAKE2b-256 37ad43afa8b38b97d2d66f1975e4599cd60436d9c478ea76ccb673af0c7a45bc

See more details on using hashes here.

File details

Details for the file webshart-0.4.5-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.5-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.5-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 13ae75a40df0924c281b2742c34f430d87e18c93bfe39cf2cd76c9fc062afc94
MD5 89dce7a29c7b3377b356cda192a3a069
BLAKE2b-256 b1887274e5607e211a061c74564955b9294923214c0a8acbe1903f8d449d78a7

See more details on using hashes here.

File details

Details for the file webshart-0.4.5-cp38-abi3-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.5-cp38-abi3-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 0ee4a246ae0b7fdce2e5d6c6f62e882c0c5f8b1973323362e01f08d3d32c9cc6
MD5 482bfc7399f2b32231770567671c34d0
BLAKE2b-256 4bde87c6a71f25b067d9a11e378dbfae5ed28004bfd0d4ebbcb6ca1d0982aa3d

See more details on using hashes here.

File details

Details for the file webshart-0.4.5-cp38-abi3-manylinux_2_39_aarch64.whl.

File metadata

File hashes

Hashes for webshart-0.4.5-cp38-abi3-manylinux_2_39_aarch64.whl
Algorithm Hash digest
SHA256 81205a2b76ddb2b54bde1c3c50c0004f043d3a5a5a9bab5b8aad20a7770be2be
MD5 1858004eee41efb27688b8ad68c5661e
BLAKE2b-256 bcc5902afa95aa6ca09fad3f17c4cbb05f6db0e7401561f3ea56725cb3141444

See more details on using hashes here.

File details

Details for the file webshart-0.4.5-cp38-abi3-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.5-cp38-abi3-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 520102ace4a8ccf23e114c878055a98ecf38d9e562f753baa85d1c4501f134bf
MD5 d92bba4a5ecf790097a678d29b248c1e
BLAKE2b-256 50dd0f5fac9034b62f14fed7052154b9f8410de8f251707aa912a399ccb98a5d

See more details on using hashes here.

File details

Details for the file webshart-0.4.5-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.5-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d5d18ef2efbbd3cf25716d22b3bf4456b1a050bebb8cf777cfcad1c1772aa400
MD5 a4e9a5eb87375ebe7c773165c46e695f
BLAKE2b-256 a284a29dffd31a24fa794e737cd673056414e6d6ee285b1fc5274c1b7eb45ec7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page