Skip to main content

Fast and memory-efficient webdataset shard reader

Project description

image

Fast dataloader and conversion utility for webdataset tar shards. Rust core with Python bindings.

Built for streaming large video and image datasets, but handles any byte data.

Install

pip install webshart

What is this?

Webshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.

The indexed format provides massive performance benefits:

  • Random access: Jump to any file instantly
  • Selective downloads: Only fetch the files you need
  • True parallelism: Read from multiple shards simultaneously
  • Cloud-optimized: Works efficiently with HTTP range requests
  • Aspect bucketing: Optionally include image geometry hints width, height and aspect for the ability to bucket images by shape
  • Logical sample APIs: Treat image.ext + image.json pairs as one sample while still allowing raw file access
  • Caption metadata: Store captions in shard metadata under the plural captions key as either a string or a list of strings
  • Custom DataLoader: Includes state dict methods on the DataLoader so that you can resume training deterministically
  • Rate-limit friendly: Local caching allows high-frequency random seeking without encountering storage provider rate limits
  • Instant start-up with pre-sorted aspect buckets

Growing ecosystem: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).

Quick Start

import webshart

# Find your dataset
dataset = webshart.discover_dataset(
    source="laion/conceptual-captions-12m-webdataset",
    # we're able to upload metadata separately so that we reduce load on huggingface infra.
    metadata="webshart/conceptual-captions-12m-webdataset-metadata",
)
print(f"Found {dataset.num_shards} shards")

loader = webshart.TarDataLoader(dataset)

# File-oriented access is still available.
files = dataset.list_files_in_shard(0)

# Sample-oriented access skips paired JSON sidecars.
samples = dataset.list_samples_in_shard(0)
entry = loader.load_sample(0, 0)
print(entry.path, entry.captions, entry.json_metadata)

Common Patterns

For real-world, working examples:

Creating Indices for / Converting Existing Datasets

Any tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:

A command-line tool that auto-discovers tars to process:

% webshart extract-metadata \
    --source laion/conceptual-captions-12m-webdataset \
    --destination laion_output/ \
    --checkpoint-dir ./laion_output/checkpoints \
    --max-workers 2 \
    --include-image-geometry

Or, if you prefer/require direct-integration to an existing Python application, use the API

Uploading Indices to HuggingFace

Once you've generated indices, share them with the community:

# Upload all JSON files to your dataset
huggingface-cli upload --repo-type=dataset \
    username/dataset-name \
    ./indices/ \
    --include "*.json" \
    --path-in-repo "indices/"

Or if you want to contribute to an existing dataset you don't own:

  1. Create a community dataset with indices: username/original-dataset-indices
  2. Upload the JSON files there
  3. Open a discussion on the original dataset suggesting they add the indices

Creating New Indexed Datasets

If you're creating a new dataset, generate indices during creation:

{
  "files": {
    "image_0001.webp": {"offset": 512, "length": 102400},
    "image_0002.webp": {"offset": 102912, "length": 98304},
    ...
  }
}

The JSON index should have the same name as the tar file (e.g., shard_0000.tarshard_0000.json).

Image + JSON Sidecar Samples

Webshart supports webdataset shards that store each sample as an image-like payload plus a paired JSON sidecar:

sample_0001.webp
sample_0001.json
sample_0002.webp
sample_0002.json

When metadata is extracted or loaded, sidecars are attached to their paired sample entries:

{
  "files": {
    "sample_0001.webp": {
      "offset": 512,
      "length": 102400,
      "width": 1024,
      "height": 1024,
      "aspect": 1.0,
      "json_path": "sample_0001.json",
      "json_offset": 103424,
      "json_length": 128,
      "captions": "a product photo on a white background",
      "json_metadata": {
        "caption": "a product photo on a white background"
      }
    },
    "sample_0001.json": {
      "offset": 103424,
      "length": 128
    }
  }
}

Use file-oriented APIs when you want every archive member, including sidecars:

dataset.list_files_in_shard(0)

reader = dataset.open_shard(0)
raw_file_bytes = reader.read_file(0)

Use sample-oriented APIs when you want training samples:

dataset.list_samples_in_shard(0)
dataset.get_shard_sample_count(0)

reader = dataset.open_shard(0)
image_bytes = reader.read_sample(0)
json_bytes = reader.read_sample_json(0)

entry = loader.load_sample(0, 0)
print(entry.path)
print(entry.captions)
print(entry.json_data)

Captions are canonicalized to the plural captions metadata key. The value may be a single string, a list of strings, or absent.

webshart.write_captions_to_metadata(
    "shard_0000.json",
    {
        "sample_0001.webp": "a short caption",
        "sample_0002": ["caption one", "caption two"],
    },
)

The writer updates existing webshart metadata JSON in place, removes old singular caption keys from updated samples, and leaves paired .json sidecar entries untouched.

Aspect Bucketing Samples

list_shard_aspect_buckets() is file-oriented and buckets any indexed file that has width and height.

For training pipelines, prefer list_shard_sample_aspect_buckets():

loader = webshart.TarDataLoader(dataset)
buckets = loader.list_shard_sample_aspect_buckets(
    [0],
    key="geometry-tuple",
    target_pixel_area=1024**2,
)[0]["buckets"]

for bucket_key, entries in buckets.items():
    for item in entries:
        virtual_id = f"webshart://0/{item['sample_idx']}/{item['filename']}"
        image = loader.load_sample(0, item["sample_idx"])

This uses logical samples from metadata.sample_range() / get_sample_by_index() and excludes paired JSON sidecars before bucketing. Each bucket entry includes sample_idx, so callers can build stable IDs and load images directly with loader.load_sample(shard_idx, sample_idx).

Why is it fast?

Problem: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.

Solution: The indexed format stores byte offsets and sample metadata in a separate JSON file, enabling:

  • HTTP range requests for any file
  • True random access over network
  • Parallel reads from multiple shards
  • Large scale, aspect-bucketed datasets
  • No wasted bandwidth

The Rust implementation provides:

  • Real parallelism (no Python GIL)
  • Zero-copy operations where possible
  • Efficient HTTP connection pooling
  • Optimized tokio async runtime
  • Optional local caching for metadata and shards
  • Fast aspect bucketing for image data

Datasets Using This Format

I discovered after creating this library that cheesechaser is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.

  • NebulaeWis/e621-2024-webp-4Mpixel
  • picollect/danbooru2 (subfolder: images)
  • Many picollect image datasets
  • Your dataset could be next! See "Creating Indices" above

Requirements

  • Python 3.8+
  • Linux/macOS/Windows

Roadmap

  • image decoding is currently not handled by this library, but it will be added with zero-copy.
  • more informative API for caching and other Rust implementation details
  • multi-gpu/multi-node friendly dataloader

Projects using webshart

  • CaptionFlow uses this library to solve memory use and seek performance issues typical to webdatasets

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webshart-0.4.6.tar.gz (100.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

webshart-0.4.6-cp38-abi3-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.8+Windows x86-64

webshart-0.4.6-cp38-abi3-manylinux_2_39_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.39+ x86-64

webshart-0.4.6-cp38-abi3-manylinux_2_39_aarch64.whl (5.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.39+ ARM64

webshart-0.4.6-cp38-abi3-manylinux_2_35_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.35+ x86-64

webshart-0.4.6-cp38-abi3-macosx_11_0_arm64.whl (3.0 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file webshart-0.4.6.tar.gz.

File metadata

  • Download URL: webshart-0.4.6.tar.gz
  • Upload date:
  • Size: 100.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.6.tar.gz
Algorithm Hash digest
SHA256 9b61f1918f7efd0b795815cf6431f0fb4d2da3d9b2b89086ab6747df806039b7
MD5 0728f8a881b4bb06cca0ed2f26ab8e04
BLAKE2b-256 b290db57a0de02c3cabb9ce27069a00686e711a25f437b8f39f3f7e66993246d

See more details on using hashes here.

File details

Details for the file webshart-0.4.6-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: webshart-0.4.6-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.4.6-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 001286d3e8d9351d3c796855c82b60c7e7b12fd52b2b98f93c0a9dadc2c6727e
MD5 860e61d0ce350dc63a1aa1bab854e0c8
BLAKE2b-256 7765f93048bbd260169e603bd7bceaba4d35d32fb678634a2f82e9def2c92d8f

See more details on using hashes here.

File details

Details for the file webshart-0.4.6-cp38-abi3-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.6-cp38-abi3-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 ae50c07f66d44db33f2e666bc4d8f85c5a69db8bd0d657835316c4b296a4c9fd
MD5 d670f515d4d3c68e6c615a23f45ebd11
BLAKE2b-256 3d5f1d9cc5df8145a58733c9e746b6000669b1143b08d4967e36fe570ffa64d3

See more details on using hashes here.

File details

Details for the file webshart-0.4.6-cp38-abi3-manylinux_2_39_aarch64.whl.

File metadata

File hashes

Hashes for webshart-0.4.6-cp38-abi3-manylinux_2_39_aarch64.whl
Algorithm Hash digest
SHA256 69c69c490f6f1bac144b3688a6fcc294c856b69fe6dbf8b3d7177d386894afd8
MD5 047f5e174bba8f1e59bd6e14a6ff6ae0
BLAKE2b-256 fda02e83b7b286afa153e03d5c91eb7d8308b141810797de485e9332e3ba2f18

See more details on using hashes here.

File details

Details for the file webshart-0.4.6-cp38-abi3-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.4.6-cp38-abi3-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 28a638732037a018bf9044b053c2a80c60b6a8c1846818aef01e0119045d0754
MD5 243215b4020504f4bb8cbb774783940f
BLAKE2b-256 f98d782b3e1f5c78a87ca250ad845753ffba883ea216804160ce0bad9bb754f2

See more details on using hashes here.

File details

Details for the file webshart-0.4.6-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.4.6-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 28c4bd8420c381195e45610a880aebd65ab5dc1621a1b49774dab09b16c265aa
MD5 7466b84a9544fc269277e2f5abb54397
BLAKE2b-256 f9cfc5c630da7f23bae12ddc3c9d8d5d5e5e9aa13a0c91167abb5ba3b8975182

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page