Skip to main content

Fast and memory-efficient webdataset shard reader

Project description

image

Fast dataloader and conversion utility for webdataset tar shards. Rust core with Python bindings.

Built for streaming large video and image datasets, but handles any byte data.

Install

pip install webshart

What is this?

Webshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.

The indexed format provides massive performance benefits:

  • Random access: Jump to any file instantly
  • Selective downloads: Only fetch the files you need
  • True parallelism: Read from multiple shards simultaneously
  • Cloud-optimized: Works efficiently with HTTP range requests
  • Aspect bucketing: Optionally include image geometry hints width, height and aspect for the ability to bucket by shape
  • Custom DataLoader: Includes state dict methods on the DataLoader so that you can resume training deterministically
  • Rate-limit friendly: Local caching allows high-frequency random seeking without encountering storage provider rate limits
  • Instant start-up with pre-sorted aspect buckets

Growing ecosystem: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).

Quick Start

import webshart

# Find your dataset
dataset = discover_dataset(
    source="laion/conceptual-captions-12m-webdataset",
    # we're able to upload metadata separately so that we reduce load on huggingface infra.
    metadata="webshart/conceptual-captions-12m-webdataset-metadata",
)
print(f"Found {dataset.num_shards} shards")

Common Patterns

For real-world, working examples:

Creating Indices for / Converting Existing Datasets

Any tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:

A command-line tool that auto-discovers tars to process:

% webshart extract-metadata \
    --source laion/conceptual-captions-12m-webdataset \
    --destination laion_output/ \
    --checkpoint-dir ./laion_output/checkpoints \
    --max-workers 2 \
    --include-image-geometry

Or, if you prefer/require direct-integration to an existing Python application, use the API

Uploading Indices to HuggingFace

Once you've generated indices, share them with the community:

# Upload all JSON files to your dataset
huggingface-cli upload --repo-type=dataset \
    username/dataset-name \
    ./indices/ \
    --include "*.json" \
    --path-in-repo "indices/"

Or if you want to contribute to an existing dataset you don't own:

  1. Create a community dataset with indices: username/original-dataset-indices
  2. Upload the JSON files there
  3. Open a discussion on the original dataset suggesting they add the indices

Creating New Indexed Datasets

If you're creating a new dataset, generate indices during creation:

{
  "files": {
    "image_0001.webp": {"offset": 512, "length": 102400},
    "image_0002.webp": {"offset": 102912, "length": 98304},
    ...
  }
}

The JSON index should have the same name as the tar file (e.g., shard_0000.tarshard_0000.json).

Why is it fast?

Problem: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.

Solution: The indexed format stores byte offsets and sample metadata in a separate JSON file, enabling:

  • HTTP range requests for any file
  • True random access over network
  • Parallel reads from multiple shards
  • Large scale, aspect-bucketed datasets
  • No wasted bandwidth

The Rust implementation provides:

  • Real parallelism (no Python GIL)
  • Zero-copy operations where possible
  • Efficient HTTP connection pooling
  • Optimized tokio async runtime
  • Optional local caching for metadata and shards
  • Fast aspect bucketing for image data

Datasets Using This Format

I discovered after creating this library that cheesechaser is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.

  • NebulaeWis/e621-2024-webp-4Mpixel
  • picollect/danbooru2 (subfolder: images)
  • Many picollect image datasets
  • Your dataset could be next! See "Creating Indices" above

Requirements

  • Python 3.8+
  • Linux/macOS/Windows

Roadmap

  • image decoding is currently not handled by this library, but it will be added with zero-copy.
  • more informative API for caching and other Rust implementation details
  • multi-gpu/multi-node friendly dataloader

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webshart-0.4.0.tar.gz (85.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

webshart-0.4.0-cp313-cp313-win_amd64.whl (2.6 MB view details)

Uploaded CPython 3.13Windows x86-64

webshart-0.4.0-cp313-cp313-macosx_11_0_arm64.whl (2.9 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

webshart-0.4.0-cp312-cp312-win_amd64.whl (2.6 MB view details)

Uploaded CPython 3.12Windows x86-64

webshart-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

webshart-0.4.0-cp312-cp312-macosx_11_0_arm64.whl (2.9 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

webshart-0.4.0-cp311-cp311-win_amd64.whl (2.6 MB view details)

Uploaded CPython 3.11Windows x86-64

webshart-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

webshart-0.4.0-cp311-cp311-macosx_11_0_arm64.whl (2.9 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

webshart-0.4.0-cp310-cp310-win_amd64.whl (2.6 MB view details)

Uploaded CPython 3.10Windows x86-64

webshart-0.4.0-cp39-cp39-win_amd64.whl (2.6 MB view details)

Uploaded CPython 3.9Windows x86-64

File details

Details for the file webshart-0.4.0.tar.gz.

File metadata

  • Download URL: webshart-0.4.0.tar.gz
  • Upload date:
  • Size: 85.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.0.tar.gz
Algorithm Hash digest
SHA256 6a4e2151389c91f22d03dd2be2bb758b45d8512bc2deb46f089f8cf99b28a71f
MD5 36847edd57ff94a09f5aa17732752d67
BLAKE2b-256 89d9ab264347265cb5a42b764fd10435f168d3dcfc89df11693de2c0cb02ca80

See more details on using hashes here.

File details

Details for the file webshart-0.4.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 cc0c966cabec5e395168cf71c9dae92fb732400d17cc8f9ab1358df047ad625d
MD5 71ff5e2724b594a3836355984d2fb6e0
BLAKE2b-256 4467a2625aad238795763a32b8ef99d730091d54f27b37a0fd7f37d71e96c36a

See more details on using hashes here.

File details

Details for the file webshart-0.4.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d2eabfd2e912aae198a49a4a74ab2af830e9ca5fa6e0d04603b9528ada1b06bb
MD5 d8c201e2f19e13e0c2a73892d3de720d
BLAKE2b-256 8b7002aa26b95f9de4a8af8a512f593bacd55d7b5c7e04fae8113b0c18c99e2e

See more details on using hashes here.

File details

Details for the file webshart-0.4.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 7e2d7fff005f8cd0b8ff064aa202b3a3c30d45b1e6771e32c7cc0d9348f134ab
MD5 9cfb3161e87382ff35f893bb8cb0871f
BLAKE2b-256 121c77e8c25cb4ae8f76290b5cce54184ad71e6b211906419cdb51959d5e9748

See more details on using hashes here.

File details

Details for the file webshart-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ccf0216f976d44fd35a202903080014c61a30c1bb519814b9a15bf02d524e9c2
MD5 2b3334f7b41d5ebe54375d3d4a4a10e2
BLAKE2b-256 b860ec7a987ab23e7d8b7bee655ff0e5b17c5f9b3f4077f1af311e720aeb18e0

See more details on using hashes here.

File details

Details for the file webshart-0.4.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8b771ef0f65d71e64879d1be387dfb8b0251436b1986e274823f48b77d088d82
MD5 97759c96261916c1549307457bca2cc7
BLAKE2b-256 fd06e53376c49b2b06ca63d5bedec66a2a8dc5555f3e542391c06a95a8a7ddde

See more details on using hashes here.

File details

Details for the file webshart-0.4.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 edc1c687c25240e813566a48c042031ed789057f0aaeeb439994e61c79115e64
MD5 48cfa6ac4df1312039bb879d2ab4d502
BLAKE2b-256 376ca3fb0d2e99bbdcaaeb144fb0f23245c5922ee75fc9071433a2b71ac78a20

See more details on using hashes here.

File details

Details for the file webshart-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d3a0bee04b1684abb4530b7ab242e7f2f3fcf442547c304aa67d91bf88ee9847
MD5 246d5b7a78f319a934f23c640e60247c
BLAKE2b-256 0f477b5b22ffb8b00fbf6800306d9e98ed3bf9c801bc9861b699f72a82c25e62

See more details on using hashes here.

File details

Details for the file webshart-0.4.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f1806c0b49d741da0c5667420965ff086b8d06900ead6a4c7960d993af5f187d
MD5 1b80289652905b674df401bdc7f1399a
BLAKE2b-256 54a294d70905cf691c102f33e8fe548d564c0ef83aa146a8c508cd0cde9863fc

See more details on using hashes here.

File details

Details for the file webshart-0.4.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 0cb59a4f76c0b17fbf210679774a2c01dad653dedc4b878afb0f01f40e5eb70e
MD5 273549e76074f1c69af3dcb72a65a2e0
BLAKE2b-256 437469b49c0943cb23e28e12193d63e3f0e245d8957b65bd7aaef94cb87f6c0f

See more details on using hashes here.

File details

Details for the file webshart-0.4.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 1c0361402c380044fd217a77ad2b642f77186dd329012809b40d554a8b64662a
MD5 0a5e38bf2768e4b44c03981f92e9f6eb
BLAKE2b-256 831191cd84c745be21e11cae47bf54549d041ba1294418ced1f28875af33a9db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page