Skip to main content

Fast and memory-efficient webdataset shard reader

Project description

image

Fast dataloader and conversion utility for webdataset tar shards. Rust core with Python bindings.

Built for streaming large video and image datasets, but handles any byte data.

Install

pip install webshart

What is this?

Webshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.

The indexed format provides massive performance benefits:

  • Random access: Jump to any file instantly
  • Selective downloads: Only fetch the files you need
  • True parallelism: Read from multiple shards simultaneously
  • Cloud-optimized: Works efficiently with HTTP range requests
  • Aspect bucketing: Optionally include image geometry hints width, height and aspect for the ability to bucket by shape
  • Custom DataLoader: Includes state dict methods on the DataLoader so that you can resume training deterministically
  • Rate-limit friendly: Local caching allows high-frequency random seeking without encountering storage provider rate limits
  • Instant start-up with pre-sorted aspect buckets

Growing ecosystem: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).

Quick Start

import webshart

# Find your dataset
dataset = discover_dataset(
    source="laion/conceptual-captions-12m-webdataset",
    # we're able to upload metadata separately so that we reduce load on huggingface infra.
    metadata="webshart/conceptual-captions-12m-webdataset-metadata",
)
print(f"Found {dataset.num_shards} shards")

Common Patterns

For real-world, working examples:

Creating Indices for / Converting Existing Datasets

Any tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:

A command-line tool that auto-discovers tars to process:

% webshart extract-metadata \
    --source laion/conceptual-captions-12m-webdataset \
    --destination laion_output/ \
    --checkpoint-dir ./laion_output/checkpoints \
    --max-workers 2 \
    --include-image-geometry

Or, if you prefer/require direct-integration to an existing Python application, use the API

Uploading Indices to HuggingFace

Once you've generated indices, share them with the community:

# Upload all JSON files to your dataset
huggingface-cli upload --repo-type=dataset \
    username/dataset-name \
    ./indices/ \
    --include "*.json" \
    --path-in-repo "indices/"

Or if you want to contribute to an existing dataset you don't own:

  1. Create a community dataset with indices: username/original-dataset-indices
  2. Upload the JSON files there
  3. Open a discussion on the original dataset suggesting they add the indices

Creating New Indexed Datasets

If you're creating a new dataset, generate indices during creation:

{
  "files": {
    "image_0001.webp": {"offset": 512, "length": 102400},
    "image_0002.webp": {"offset": 102912, "length": 98304},
    ...
  }
}

The JSON index should have the same name as the tar file (e.g., shard_0000.tarshard_0000.json).

Why is it fast?

Problem: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.

Solution: The indexed format stores byte offsets and sample metadata in a separate JSON file, enabling:

  • HTTP range requests for any file
  • True random access over network
  • Parallel reads from multiple shards
  • Large scale, aspect-bucketed datasets
  • No wasted bandwidth

The Rust implementation provides:

  • Real parallelism (no Python GIL)
  • Zero-copy operations where possible
  • Efficient HTTP connection pooling
  • Optimized tokio async runtime
  • Optional local caching for metadata and shards
  • Fast aspect bucketing for image data

Datasets Using This Format

I discovered after creating this library that cheesechaser is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.

  • NebulaeWis/e621-2024-webp-4Mpixel
  • picollect/danbooru2 (subfolder: images)
  • Many picollect image datasets
  • Your dataset could be next! See "Creating Indices" above

Requirements

  • Python 3.8+
  • Linux/macOS/Windows

Roadmap

  • image decoding is currently not handled by this library, but it will be added with zero-copy.
  • more informative API for caching and other Rust implementation details
  • multi-gpu/multi-node friendly dataloader

Projects using webshart

  • CaptionFlow uses this library to solve memory use and seek performance issues typical to webdatasets

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webshart-0.4.3.tar.gz (92.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

webshart-0.4.3-cp313-cp313-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.13Windows x86-64

webshart-0.4.3-cp313-cp313-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

webshart-0.4.3-cp312-cp312-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.12Windows x86-64

webshart-0.4.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

webshart-0.4.3-cp312-cp312-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

webshart-0.4.3-cp311-cp311-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.11Windows x86-64

webshart-0.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

webshart-0.4.3-cp311-cp311-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

webshart-0.4.3-cp310-cp310-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.10Windows x86-64

webshart-0.4.3-cp39-cp39-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.9Windows x86-64

File details

Details for the file webshart-0.4.3.tar.gz.

File metadata

  • Download URL: webshart-0.4.3.tar.gz
  • Upload date:
  • Size: 92.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.3.tar.gz
Algorithm Hash digest
SHA256 1dd585bb07acc29460b156e87be5b1f6e1d3ebfb35f9390840ccdd59255026ef
MD5 d16644fa10e74577dcf42815e3ca5036
BLAKE2b-256 e003e393e401c39c0610ed8e94e30280db6d13cfa790ad9ffd17c63e2121bba3

See more details on using hashes here.

File details

Details for the file webshart-0.4.3-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.3-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.3-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 666bddd1374c78cb0ddef79f6085e4b54252476041e5b971a3b22136b31d9e73
MD5 aa8c11eed53726acdb880805c86404fb
BLAKE2b-256 e16ecf592599fb0d9126f961d304dbbe7f678b3096cf91ec51b9ebfc602f01a3

See more details on using hashes here.

File details

Details for the file webshart-0.4.3-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.3-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5f7f1ba9dd2c1dfc8ef236b4288e09dd22b79a8e4561a42c0b7c64790f402956
MD5 8ab5efb16b7120ddd39d1993b28661df
BLAKE2b-256 a6788116054ac8499633f1892abd9e4b32fd1b67c02e3066447f6bbfeee02d41

See more details on using hashes here.

File details

Details for the file webshart-0.4.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 e957c871df26cadc3e61c8355a3cf77bcdda6607750ad4b49667639b30b9835c
MD5 6eea5771d47e4b95f71f50435d90b45c
BLAKE2b-256 dde04f92f800403c1be9f6d3a06705c15fa60b1691085303db9b479eeb6aa109

See more details on using hashes here.

File details

Details for the file webshart-0.4.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 91a204bf9a1394b79d967e52f43120176ac28a0ab1b6669653bbfa6b0f015b3b
MD5 d29b404f3616ae87835381f7e45a73d2
BLAKE2b-256 daf618689718812553f2a6a0720f3db807ba9e315bde72c4d0398e008da47081

See more details on using hashes here.

File details

Details for the file webshart-0.4.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2198f1d7fd12b5534de823a8f94b2e0557eb3137bcf18dd68f2d370f4692e28d
MD5 239f8305b026b19a53c7a90eb3aed211
BLAKE2b-256 ac05c5d3a24cc1963b5cfac1cf7b6eef66d3717f5b84cd565278fc8b868d68e0

See more details on using hashes here.

File details

Details for the file webshart-0.4.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 60b8cb63b68b0e0432af39f3436b676ea838b9a8d97b86b7c739a94a073ecdcc
MD5 d0b07201507d15361e08c9b645d29119
BLAKE2b-256 e9a400a4144006591324b9ae1c82fac3b7b0c4c045dfcfe01aa321bf1d22fa55

See more details on using hashes here.

File details

Details for the file webshart-0.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 36517c15f0c1da841f4e53bc60667c0468acbe22b30f3fdb9f86dd843646b1c3
MD5 7e46978e7a004ef50dab4878e78c1fe7
BLAKE2b-256 11f0c4fc1e1c9f343cbc846c6c24d3f9ff1451c1ba9123c35fa6c18664f95360

See more details on using hashes here.

File details

Details for the file webshart-0.4.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9017f4a3613476d46bba97b164d80b48312eb4017ca86758ed186d481b773549
MD5 fde7fdde48609636536851569143aa47
BLAKE2b-256 2ebe897d5a23c1ee4a8d98b72350f5cb0a9326d6a4226c8ea4e5a0c4832ff8e7

See more details on using hashes here.

File details

Details for the file webshart-0.4.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 b5ff5b932110a22be9e2f699ccb6052499b5f142883ff815c5271cebd0b43322
MD5 ff735bc5130ec39045f6fbcb4cdc66ba
BLAKE2b-256 3b88971e59b3ed9ad7e2c8d192ecdb976619fba0c65042cc6a0a705cf2a61b00

See more details on using hashes here.

File details

Details for the file webshart-0.4.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 5755355aab4ca18878da8795fd8afa8bb2d47a8147cf07f038df8eed06e6d223
MD5 1d795a98b1d38337e6f3445c8c32b000
BLAKE2b-256 773eec72be9aa800c4c9be5aec27b741cc9cf2fbaa91ead4a5059b3b30bf4d3d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page