Skip to main content

Fast and memory-efficient webdataset shard reader

Project description

image

Fast dataloader and conversion utility for webdataset tar shards. Rust core with Python bindings.

Built for streaming large video and image datasets, but handles any byte data.

Install

pip install webshart

What is this?

Webshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.

The indexed format provides massive performance benefits:

  • Random access: Jump to any file instantly
  • Selective downloads: Only fetch the files you need
  • True parallelism: Read from multiple shards simultaneously
  • Cloud-optimized: Works efficiently with HTTP range requests
  • Aspect bucketing: Optionally include image geometry hints width, height and aspect for the ability to bucket by shape
  • Custom DataLoader: Includes state dict methods on the DataLoader so that you can resume training deterministically
  • Rate-limit friendly: Local caching allows high-frequency random seeking without encountering storage provider rate limits
  • Instant start-up with pre-sorted aspect buckets

Growing ecosystem: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).

Quick Start

import webshart

# Find your dataset
dataset = discover_dataset(
    source="laion/conceptual-captions-12m-webdataset",
    # we're able to upload metadata separately so that we reduce load on huggingface infra.
    metadata="webshart/conceptual-captions-12m-webdataset-metadata",
)
print(f"Found {dataset.num_shards} shards")

Common Patterns

For real-world, working examples:

Creating Indices for / Converting Existing Datasets

Any tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:

A command-line tool that auto-discovers tars to process:

% webshart extract-metadata \
    --source laion/conceptual-captions-12m-webdataset \
    --destination laion_output/ \
    --checkpoint-dir ./laion_output/checkpoints \
    --max-workers 2 \
    --include-image-geometry

Or, if you prefer/require direct-integration to an existing Python application, use the API

Uploading Indices to HuggingFace

Once you've generated indices, share them with the community:

# Upload all JSON files to your dataset
huggingface-cli upload --repo-type=dataset \
    username/dataset-name \
    ./indices/ \
    --include "*.json" \
    --path-in-repo "indices/"

Or if you want to contribute to an existing dataset you don't own:

  1. Create a community dataset with indices: username/original-dataset-indices
  2. Upload the JSON files there
  3. Open a discussion on the original dataset suggesting they add the indices

Creating New Indexed Datasets

If you're creating a new dataset, generate indices during creation:

{
  "files": {
    "image_0001.webp": {"offset": 512, "length": 102400},
    "image_0002.webp": {"offset": 102912, "length": 98304},
    ...
  }
}

The JSON index should have the same name as the tar file (e.g., shard_0000.tarshard_0000.json).

Why is it fast?

Problem: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.

Solution: The indexed format stores byte offsets and sample metadata in a separate JSON file, enabling:

  • HTTP range requests for any file
  • True random access over network
  • Parallel reads from multiple shards
  • Large scale, aspect-bucketed datasets
  • No wasted bandwidth

The Rust implementation provides:

  • Real parallelism (no Python GIL)
  • Zero-copy operations where possible
  • Efficient HTTP connection pooling
  • Optimized tokio async runtime
  • Optional local caching for metadata and shards
  • Fast aspect bucketing for image data

Datasets Using This Format

I discovered after creating this library that cheesechaser is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.

  • NebulaeWis/e621-2024-webp-4Mpixel
  • picollect/danbooru2 (subfolder: images)
  • Many picollect image datasets
  • Your dataset could be next! See "Creating Indices" above

Requirements

  • Python 3.8+
  • Linux/macOS/Windows

Roadmap

  • image decoding is currently not handled by this library, but it will be added with zero-copy.
  • more informative API for caching and other Rust implementation details
  • multi-gpu/multi-node friendly dataloader

Projects using webshart

  • CaptionFlow uses this library to solve memory use and seek performance issues typical to webdatasets

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webshart-0.4.2.tar.gz (94.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

webshart-0.4.2-cp313-cp313-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.13Windows x86-64

webshart-0.4.2-cp313-cp313-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

webshart-0.4.2-cp312-cp312-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.12Windows x86-64

webshart-0.4.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

webshart-0.4.2-cp312-cp312-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

webshart-0.4.2-cp311-cp311-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.11Windows x86-64

webshart-0.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

webshart-0.4.2-cp311-cp311-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

webshart-0.4.2-cp310-cp310-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.10Windows x86-64

webshart-0.4.2-cp39-cp39-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.9Windows x86-64

File details

Details for the file webshart-0.4.2.tar.gz.

File metadata

  • Download URL: webshart-0.4.2.tar.gz
  • Upload date:
  • Size: 94.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.2.tar.gz
Algorithm Hash digest
SHA256 9cff7bd4219d0cbf76345432d36e5898ca9cc460e8c5aeb394e08c853f0e1cdb
MD5 4d814c7475130bf5f8e891c5e97dcb87
BLAKE2b-256 9f8578247468b8729e8c0a5e532ecfe4d7a6c7b5e2fa70d20e765a44a19a6365

See more details on using hashes here.

File details

Details for the file webshart-0.4.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 212877b340c2ed874a55cf713f056abe9f080ca6f6bf9ad26ae601ebd83a1c9c
MD5 06587e74548fa21886e017741e507da4
BLAKE2b-256 e68d22501d75ca7e897ad85660ad8ad54154bae19029d3962797894f647ab788

See more details on using hashes here.

File details

Details for the file webshart-0.4.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 01c8955ffae81057cfae43b1490a55e280652c1b0a7f59a0eeb3234d4112f9e6
MD5 5f5b27480a5c5744c19e1726daf7f7d5
BLAKE2b-256 c5d7ae6f8b8d8a4bdad19218ec75f2872012c0e8cf7269f0aac2387b7a70f00b

See more details on using hashes here.

File details

Details for the file webshart-0.4.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 4986b104d4fd313b67339247c35dc13ae282c0da48908abf012587e7d6f68601
MD5 35bf5407a5c6667922c29670caf9d2f4
BLAKE2b-256 3264106485ae167d0cbf6c35ee7dfcdc90760576b8dd827bb4d16a82786d5b54

See more details on using hashes here.

File details

Details for the file webshart-0.4.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 54860b55729f16b0761ed68ba614f62769aa4451bfff14510f1a7e701a4c4114
MD5 81ce26f4336a9086cbfc5eb559479ffb
BLAKE2b-256 03484da7dbf53820a27035f1e441c73d02bf64ceb1d6ccf318e5f8e29e9c1809

See more details on using hashes here.

File details

Details for the file webshart-0.4.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a08ab01425c21ce39505b22c940723ce2b4cf598aecf6befad68c223a4df8a96
MD5 e8fe5f0923d1884b1efa2ca5491b2b39
BLAKE2b-256 cba070a5755cda09b02b6f4aa70fb1100fdf2176e2ec71c30c10a9fbb424503f

See more details on using hashes here.

File details

Details for the file webshart-0.4.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 309b080410d4693dac52b4c95f377301544117428308a255da10d2318618f012
MD5 bd6ac148920b1dbac885012fd8a20dc7
BLAKE2b-256 5045302bed64fa374a769cfd37c5416562715da4184d857cb062766a4b60b144

See more details on using hashes here.

File details

Details for the file webshart-0.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 987703802edc200751ba0ec99f314a62ca6bf18b62851e38c4a7dbe0cf89ab16
MD5 8f0bafbaa46e860f99bebca621301ee6
BLAKE2b-256 e5e49dee201ecf9ea1b14190f410739f981bd0463dee3698cba4025cac854cc4

See more details on using hashes here.

File details

Details for the file webshart-0.4.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6363d34131734f5f2460fc0bda3df9e8e445035029e50c330a31e9e1818f9ef6
MD5 29b54b97bdc7a2ba3cc0bb93f2df29db
BLAKE2b-256 3bdb48ad17ae93fa3e7a7f8eed92dc2aae89527832c54afa4ced7cee07488ecb

See more details on using hashes here.

File details

Details for the file webshart-0.4.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 219594e88109206f94d316a8908788988bb2eb07e7a09db83733e4a23183d063
MD5 cae452ecc38e56198e568fb746d71cef
BLAKE2b-256 324dd55b9a5bff3e0bee86aef6142d65718411af0800ec13e91de1db20ad20c3

See more details on using hashes here.

File details

Details for the file webshart-0.4.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 9d303fcf7793e6057851f4a185d4c4024e91a03cb04807a88cd47841f3919cf0
MD5 4d14e937d127039abebb4183337ab8c2
BLAKE2b-256 cd8112552b0e39bf8aef6c4552e7501ee43758090df0e2b4db5062030cc2cecc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page