Skip to main content

Fast and memory-efficient webdataset shard reader

Project description

image

Fast dataloader and conversion utility for webdataset tar shards. Rust core with Python bindings.

Built for streaming large video and image datasets, but handles any byte data.

Install

pip install webshart

What is this?

Webshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.

The indexed format provides massive performance benefits:

  • Random access: Jump to any file instantly
  • Selective downloads: Only fetch the files you need
  • True parallelism: Read from multiple shards simultaneously
  • Cloud-optimized: Works efficiently with HTTP range requests
  • Aspect bucketing: Optionally include image geometry hints width, height and aspect for the ability to bucket by shape
  • Custom DataLoader: Includes state dict methods on the DataLoader so that you can resume training deterministically
  • Rate-limit friendly: Local caching allows high-frequency random seeking without encountering storage provider rate limits
  • Instant start-up with pre-sorted aspect buckets

Growing ecosystem: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).

Quick Start

import webshart

# Find your dataset
dataset = discover_dataset(
    source="laion/conceptual-captions-12m-webdataset",
    # we're able to upload metadata separately so that we reduce load on huggingface infra.
    metadata="webshart/conceptual-captions-12m-webdataset-metadata",
)
print(f"Found {dataset.num_shards} shards")

Common Patterns

For real-world, working examples:

Creating Indices for / Converting Existing Datasets

Any tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:

A command-line tool that auto-discovers tars to process:

% webshart extract-metadata \
    --source laion/conceptual-captions-12m-webdataset \
    --destination laion_output/ \
    --checkpoint-dir ./laion_output/checkpoints \
    --max-workers 2 \
    --include-image-geometry

Or, if you prefer/require direct-integration to an existing Python application, use the API

Uploading Indices to HuggingFace

Once you've generated indices, share them with the community:

# Upload all JSON files to your dataset
huggingface-cli upload --repo-type=dataset \
    username/dataset-name \
    ./indices/ \
    --include "*.json" \
    --path-in-repo "indices/"

Or if you want to contribute to an existing dataset you don't own:

  1. Create a community dataset with indices: username/original-dataset-indices
  2. Upload the JSON files there
  3. Open a discussion on the original dataset suggesting they add the indices

Creating New Indexed Datasets

If you're creating a new dataset, generate indices during creation:

{
  "files": {
    "image_0001.webp": {"offset": 512, "length": 102400},
    "image_0002.webp": {"offset": 102912, "length": 98304},
    ...
  }
}

The JSON index should have the same name as the tar file (e.g., shard_0000.tarshard_0000.json).

Why is it fast?

Problem: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.

Solution: The indexed format stores byte offsets and sample metadata in a separate JSON file, enabling:

  • HTTP range requests for any file
  • True random access over network
  • Parallel reads from multiple shards
  • Large scale, aspect-bucketed datasets
  • No wasted bandwidth

The Rust implementation provides:

  • Real parallelism (no Python GIL)
  • Zero-copy operations where possible
  • Efficient HTTP connection pooling
  • Optimized tokio async runtime
  • Optional local caching for metadata and shards
  • Fast aspect bucketing for image data

Datasets Using This Format

I discovered after creating this library that cheesechaser is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.

  • NebulaeWis/e621-2024-webp-4Mpixel
  • picollect/danbooru2 (subfolder: images)
  • Many picollect image datasets
  • Your dataset could be next! See "Creating Indices" above

Requirements

  • Python 3.8+
  • Linux/macOS/Windows

Roadmap

  • image decoding is currently not handled by this library, but it will be added with zero-copy.
  • more informative API for caching and other Rust implementation details
  • multi-gpu/multi-node friendly dataloader

Projects using webshart

  • CaptionFlow uses this library to solve memory use and seek performance issues typical to webdatasets

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webshart-0.4.1.tar.gz (94.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

webshart-0.4.1-cp313-cp313-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.13Windows x86-64

webshart-0.4.1-cp313-cp313-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

webshart-0.4.1-cp312-cp312-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.12Windows x86-64

webshart-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

webshart-0.4.1-cp312-cp312-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

webshart-0.4.1-cp311-cp311-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.11Windows x86-64

webshart-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

webshart-0.4.1-cp311-cp311-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

webshart-0.4.1-cp310-cp310-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.10Windows x86-64

webshart-0.4.1-cp39-cp39-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.9Windows x86-64

File details

Details for the file webshart-0.4.1.tar.gz.

File metadata

  • Download URL: webshart-0.4.1.tar.gz
  • Upload date:
  • Size: 94.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.1.tar.gz
Algorithm Hash digest
SHA256 4b4b54e43cb996a891a7e53477a85861e36e183d08dd1f727677e30fa26ef125
MD5 c2d055cf2a9117cbfa53559bf5076ad9
BLAKE2b-256 a01e024125626b9774b50fb59dd70077c81ae3d3f82aa299f49d07f07147e653

See more details on using hashes here.

File details

Details for the file webshart-0.4.1-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.1-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 c1e42691870686cd59e3dedeb955644f23cfc3555d73d46aeafb0eea54a899ec
MD5 910098b4fb731b0f759f36954ca6d532
BLAKE2b-256 d3b7ad808700d8213bd31dd45db47972dc5c974f45a9df33cddcdd0f750594c7

See more details on using hashes here.

File details

Details for the file webshart-0.4.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1d80bf5ebd7a3048e3d1809639b8b25988b897ef81d22a24f1d0816dfb801b68
MD5 d545b04b4b9cd31409af68c07ca3d832
BLAKE2b-256 2279e208c56cd3c2be5af049c8b910f82c198901d441ec64b5467a0e3a50a234

See more details on using hashes here.

File details

Details for the file webshart-0.4.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 bfdf128c2dfbc97c728021860b0e7a3ec60597a08b90d95b5ef35ad41980f35c
MD5 96b5e66740cf74e4ebff722e37a3059d
BLAKE2b-256 42b34d882c70bc18d4407017f64b108d752d5c18c6922922d3ec9e53b1a84f88

See more details on using hashes here.

File details

Details for the file webshart-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e0fd8a9a95a77688aa2cc6ed035c17580c2fae9b9c712867513b318ec1e42d62
MD5 95b42f08df6db5f7014586e0af357947
BLAKE2b-256 65908ed29e20bd3d9cdf2e7cd9252e81d376abd5cde6a5009c6f2bcdf930320d

See more details on using hashes here.

File details

Details for the file webshart-0.4.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9ae3edb751ae5e970a70b231c87d0fe64f0b80681136e10cdd1e2df59f4218c9
MD5 909250e3eb9bc41f1d12fb5d787ef352
BLAKE2b-256 b8e9fe64aceb0630e99c7d3893da4f2e7911585e9db69b1df013d8d8fd2b9029

See more details on using hashes here.

File details

Details for the file webshart-0.4.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 c3281a9c694c2e6596c410b91f6bb52cda904e54f708d1b07e249112befdfc1b
MD5 23f8ee1e91e4093aed478a5ec6a24522
BLAKE2b-256 06772f4b556fe757aceab4303e0a9b7fdd501437feebecfb05fecce40098369e

See more details on using hashes here.

File details

Details for the file webshart-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 03ed6e9876265291f21a9de4a790b532e84206f96f6259a8acfb81bef3bd97d3
MD5 f6fa3fdbd2c6b5f0c080cfce172d30cd
BLAKE2b-256 471e80d26af12ec1f4abe581488cc61a83676d677df98d3364b639eeadd06f7f

See more details on using hashes here.

File details

Details for the file webshart-0.4.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7381abe274c99446c9089b3b440c959797d71bd10c0a268dea7b49f4cd74004e
MD5 6ef2d571e97ccd656263390813a7ea0c
BLAKE2b-256 dfceada396e7d310f849cd7fcefbcf51f39d435afd5f92e90b6d6f2e2221d79a

See more details on using hashes here.

File details

Details for the file webshart-0.4.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 4255e02e39aff78611bdf2222cdf658a0e8c771f806cda6918f3b692b6f44bb9
MD5 e95afed049180a63fa31ab69b4bb3fa5
BLAKE2b-256 cb19b8ea04238e8ca8e0eac9b3a7880f7669a6ceb6f95608940cbfc1e92a674f

See more details on using hashes here.

File details

Details for the file webshart-0.4.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 a3b1e1c04b13a6f2eb370cb4c089c7eb6cd02d9f4c707d61900ccf7b0006b765
MD5 0d4d74fd655e2b5726b55945a5c7857c
BLAKE2b-256 4bc3751183978f64a4390c43e1973ccf47440d47611aa5801ec12d9e6aef3645

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page