Skip to main content

Fast and memory-efficient webdataset shard reader

Project description

image

Fast dataloader and conversion utility for webdataset tar shards. Rust core with Python bindings.

Built for streaming large video and image datasets, but handles any byte data.

Install

pip install webshart

What is this?

Webshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.

The indexed format provides massive performance benefits:

  • Random access: Jump to any file instantly
  • Selective downloads: Only fetch the files you need
  • True parallelism: Read from multiple shards simultaneously
  • Cloud-optimized: Works efficiently with HTTP range requests
  • Aspect bucketing: Optionally include image geometry hints width, height and aspect for the ability to bucket by shape
  • Custom DataLoader: Includes state dict methods on the DataLoader so that you can resume training deterministically
  • Rate-limit friendly: Local caching allows high-frequency random seeking without encountering storage provider rate limits
  • Instant start-up with pre-sorted aspect buckets

Growing ecosystem: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).

Quick Start

import webshart

# Find your dataset
dataset = discover_dataset(
    source="laion/conceptual-captions-12m-webdataset",
    # we're able to upload metadata separately so that we reduce load on huggingface infra.
    metadata="webshart/conceptual-captions-12m-webdataset-metadata",
)
print(f"Found {dataset.num_shards} shards")

Common Patterns

For real-world, working examples:

Creating Indices for / Converting Existing Datasets

Any tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:

A command-line tool that auto-discovers tars to process:

% webshart extract-metadata \
    --source laion/conceptual-captions-12m-webdataset \
    --destination laion_output/ \
    --checkpoint-dir ./laion_output/checkpoints \
    --max-workers 2 \
    --include-image-geometry

Or, if you prefer/require direct-integration to an existing Python application, use the API

Uploading Indices to HuggingFace

Once you've generated indices, share them with the community:

# Upload all JSON files to your dataset
huggingface-cli upload --repo-type=dataset \
    username/dataset-name \
    ./indices/ \
    --include "*.json" \
    --path-in-repo "indices/"

Or if you want to contribute to an existing dataset you don't own:

  1. Create a community dataset with indices: username/original-dataset-indices
  2. Upload the JSON files there
  3. Open a discussion on the original dataset suggesting they add the indices

Creating New Indexed Datasets

If you're creating a new dataset, generate indices during creation:

{
  "files": {
    "image_0001.webp": {"offset": 512, "length": 102400},
    "image_0002.webp": {"offset": 102912, "length": 98304},
    ...
  }
}

The JSON index should have the same name as the tar file (e.g., shard_0000.tarshard_0000.json).

Why is it fast?

Problem: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.

Solution: The indexed format stores byte offsets and sample metadata in a separate JSON file, enabling:

  • HTTP range requests for any file
  • True random access over network
  • Parallel reads from multiple shards
  • Large scale, aspect-bucketed datasets
  • No wasted bandwidth

The Rust implementation provides:

  • Real parallelism (no Python GIL)
  • Zero-copy operations where possible
  • Efficient HTTP connection pooling
  • Optimized tokio async runtime
  • Optional local caching for metadata and shards
  • Fast aspect bucketing for image data

Datasets Using This Format

I discovered after creating this library that cheesechaser is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.

  • NebulaeWis/e621-2024-webp-4Mpixel
  • picollect/danbooru2 (subfolder: images)
  • Many picollect image datasets
  • Your dataset could be next! See "Creating Indices" above

Requirements

  • Python 3.8+
  • Linux/macOS/Windows

Roadmap

  • image decoding is currently not handled by this library, but it will be added with zero-copy.
  • more informative API for caching and other Rust implementation details
  • multi-gpu/multi-node friendly dataloader

Projects using webshart

  • CaptionFlow uses this library to solve memory use and seek performance issues typical to webdatasets

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webshart-0.4.4.tar.gz (91.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

webshart-0.4.4-cp314-cp314-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.14Windows x86-64

webshart-0.4.4-cp314-cp314-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

webshart-0.4.4-cp313-cp313-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.13Windows x86-64

webshart-0.4.4-cp313-cp313-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

webshart-0.4.4-cp312-cp312-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.12Windows x86-64

webshart-0.4.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

webshart-0.4.4-cp312-cp312-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

webshart-0.4.4-cp311-cp311-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.11Windows x86-64

webshart-0.4.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

webshart-0.4.4-cp311-cp311-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

webshart-0.4.4-cp310-cp310-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.10Windows x86-64

File details

Details for the file webshart-0.4.4.tar.gz.

File metadata

  • Download URL: webshart-0.4.4.tar.gz
  • Upload date:
  • Size: 91.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.4.tar.gz
Algorithm Hash digest
SHA256 586d06523ccda748ca7d06d8775befc57439320a33c953b3e46948165d2dbce9
MD5 1a6c391f99b80ba23a3fa52a67d51e07
BLAKE2b-256 ca17986ba163da0a850cb5e25c86542c2bb2c3f3cdf3246fef6acdbaa31f112a

See more details on using hashes here.

File details

Details for the file webshart-0.4.4-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.4-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.4-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 5fc8471b39a7e6d5efb626354ee87ef5501b298c8b3c1fd7298d8c47cdfcbac1
MD5 3e1edb8fa98a6fe2bc3c9de33c67f08b
BLAKE2b-256 e26b7090cbfd7a7a5e16037159bb0ba6f1613cb6afc23f05fdc4e2dcfc1fb748

See more details on using hashes here.

File details

Details for the file webshart-0.4.4-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.4-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 28858e39355e5f6bc94d3285d362913d283ee5f14af12a000a4fb90b44665cf1
MD5 8b9ac2459d537f3540df2d11a26cb729
BLAKE2b-256 d9e199b38c94b0389fd05949182cb6b4af737fe10b143d9b5765f0debba03b0b

See more details on using hashes here.

File details

Details for the file webshart-0.4.4-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.4-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.4-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 96df7f9e210ec5077492940dbb0407ace0b42552b74b06f5ed2afb9370807662
MD5 18759e375295c248e97fcff8fceb207f
BLAKE2b-256 d5b9d8c8deaa769c1684cb6de57cfeadbe1e901b293d265dc0ae02e0a4a93762

See more details on using hashes here.

File details

Details for the file webshart-0.4.4-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.4-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a34abbd8e642a6828a180ae55fb8e6d53461200c8dd9ae46a252226dbf74e343
MD5 ecfff025772b18d6942bd531bec7f085
BLAKE2b-256 f3281f931f72d341bba7a3ea99b64a7c7f959bae8bd7bdab81f54a968bdde94b

See more details on using hashes here.

File details

Details for the file webshart-0.4.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 4894f788a0ba4de736814dabbe4372c9c51a4be5652c34d832639b64b676b8b3
MD5 6084b6ce5189927a74ee8ca8cad6c68a
BLAKE2b-256 46256416223d802485abd6228f67c109758972fa41bce6ec4130c4e444725482

See more details on using hashes here.

File details

Details for the file webshart-0.4.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b6b826f6cae42247cb27b5586c69d793ee5e6ecf63df76b01b5a4449708de495
MD5 fc0d4cbdb6c7ae5f182da19c8a390bb5
BLAKE2b-256 37f9bab08faeb451eb18038826876d876fa779f2e5626eec67127d348d8e1bbc

See more details on using hashes here.

File details

Details for the file webshart-0.4.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5fa640da2f0936bb070550e60f975cfc8cf88ada0de444bc5edef9a612e7c3ac
MD5 6a261774b2ef21d978b3f23698ca2233
BLAKE2b-256 0bcdbe190128376885dfe81cc61478d2ef8f26cd4a67252276d442ee61e1cf55

See more details on using hashes here.

File details

Details for the file webshart-0.4.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 2431c446bcc793c6b445d04e89928d5cae875a3340ff5efc53d6a3af55642a2a
MD5 ebe013ed74987dacd9b3253bc003ffc9
BLAKE2b-256 0e148558264724b14459a090f74cebda9ed42792fdf5d6f9fe41c1746c0399eb

See more details on using hashes here.

File details

Details for the file webshart-0.4.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 42e7b97f696d03bade912d0ca6ddf52dc27f9429110959f98eb76344bf29acc4
MD5 2f654131c33b451d8935f4f2b1002d13
BLAKE2b-256 4949b90efb162e065990451e918f2b0248893696db41cf3b9ec929586a58f27e

See more details on using hashes here.

File details

Details for the file webshart-0.4.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d0fe737285b94be102e9664ccc5e0022b0ac80c89585e1869b12b0800147a48a
MD5 783f533096cedc893214ae083979ae94
BLAKE2b-256 2213138561b4d2da41063a04b3d98b74822e4f20e6371cbf0494445d5f25d00f

See more details on using hashes here.

File details

Details for the file webshart-0.4.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 b99bf32d12549eaaa7c67509490506a3086465385d8fddd034f6dcda93a3dab6
MD5 229409e067f14c466cc4b8abca86756c
BLAKE2b-256 ec6d56145f578ce8bdc696c23a1d6934e0ec1e6e642d82323dcf19d37fe8acec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page