Skip to main content

Fast and memory-efficient webdataset shard reader

Project description

image

Fast parallel reader for webdataset tar shards. Rust core with Python bindings. Built for streaming large video and image datasets, but handles any byte data.

Install

pip install webshart

What is this?

Webshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.

The indexed format provides massive performance benefits:

  • Random access: Jump to any file instantly
  • Selective downloads: Only fetch the files you need
  • True parallelism: Read from multiple shards simultaneously
  • Cloud-optimized: Works efficiently with HTTP range requests

Performance: 10-20x faster for random access, 5-10x faster for batch reads compared to standard tar extraction.

Growing ecosystem: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).

Quick Start

import webshart

# Find your dataset
dataset = webshart.discover_dataset("NebulaeWis/e621-2024-webp-4Mpixel", subfolder="original")
print(f"Found {dataset.num_shards} shards")

# Read a single file
shard = dataset.open_shard(0)
data = shard.read_file(42)  # -> bytes

# Read many files at once (fast)
byte_list = webshart.read_files_batch(dataset, [
    (0, 0),   # shard 0, file 0
    (0, 1),   # shard 0, file 1  
    (1, 0),   # shard 1, file 0
    (10, 5),  # shard 10, file 5
])

# Save the files
for i, data in enumerate(byte_list):
    if data:  # skip failed reads
        with open(f"image_{i}.webp", "wb") as f:
            f.write(data)

Common Patterns

Stream a subset efficiently:

# Read files 0-100 from each of the first 10 shards
requests = []
for shard_idx in range(10):
    for file_idx in range(100):
        requests.append((shard_idx, file_idx))

# Batch read in chunks of 500 files
for chunk_idx, i in enumerate(range(0, len(requests), 500)):
    byte_list = webshart.read_files_batch(dataset, requests[i:i+500])
    for j, data in enumerate(byte_list):
        if data:  # process successful reads
            # Save with meaningful names
            shard, file = requests[i+j]
            with open(f"shard_{shard:04d}_file_{file:04d}.webp", "wb") as f:
                f.write(data)

Quick dataset stats:

# Without downloading anything
size, num_files = dataset.quick_stats()
print(f"Dataset size: {size / 1e9:.1f} GB")

Creating Indices for Existing Datasets

Any tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:

A command-line tool that auto-discovers tars to process:

% webshart extract-metadata \
    --source laion/conceptual-captions-12m-webdataset \
    --destination laion_output/ \
    --checkpoint-dir ./laion_output/checkpoints \
    --max-workers 2

Or, if you prefer/require direct-integration to an existing Python application, use the API:

from webshart import MetadataExtractor

# Create an extractor (optionally with HF token for private datasets)
extractor = MetadataExtractor(hf_token="hf_...")

# Generate indices for a dataset
extractor.extract_metadata(
    source="username/dataset-name",  # HF dataset or local path
    destination="./indices/",        # Where to save JSON files
    max_workers=4                    # Parallel processing
)

Uploading Indices to HuggingFace

Once you've generated indices, share them with the community:

# Upload all JSON files to your dataset
huggingface-cli upload --repo-type=dataset \
    username/dataset-name \
    ./indices/ \
    --include "*.json" \
    --path-in-repo "indices/"

Or if you want to contribute to an existing dataset you don't own:

  1. Create a community dataset with indices: username/original-dataset-indices
  2. Upload the JSON files there
  3. Open a discussion on the original dataset suggesting they add the indices

Creating New Indexed Datasets

If you're creating a new dataset, generate indices during creation:

{
  "files": {
    "image_0001.webp": {"offset": 512, "length": 102400},
    "image_0002.webp": {"offset": 102912, "length": 98304},
    ...
  }
}

The JSON index should have the same name as the tar file (e.g., shard_0000.tarshard_0000.json).

Batch Operations

# Discover multiple datasets in parallel
datasets = webshart.discover_datasets_batch([
    "NebulaeWis/e621-2024-webp-4Mpixel",
    "picollect/danbooru2",
    "/local/path/to/dataset"
], subfolders=["original", "images", None])

# Process large dataset in chunks
processor = webshart.BatchProcessor()
results = processor.process_dataset(
    "NebulaeWis/e621-2024-webp-4Mpixel",
    batch_size=100,
    callback=lambda data: len(data)  # process each file
)

Advanced

Local dataset:

dataset = webshart.discover_dataset("/path/to/shards/")

Custom auth:

# Pass token directly
dataset = webshart.discover_dataset("private/dataset", hf_token="hf_...")

# Or use your existing HF token from huggingface_hub
from huggingface_hub import get_token
token = get_token()
dataset = webshart.discover_dataset("private/dataset", hf_token=token)

Async interface (if you're already in async code):

dataset = await webshart.discover_dataset_async("NebulaeWis/e621-2024-webp-4Mpixel")

Why is it fast?

Problem: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.

Solution: The indexed format stores byte offsets in a separate JSON file, enabling:

  • HTTP range requests for any file
  • True random access over network
  • Parallel reads from multiple shards
  • No wasted bandwidth

The Rust implementation provides:

  • Real parallelism (no Python GIL)
  • Zero-copy operations where possible
  • Efficient HTTP connection pooling
  • Optimized tokio async runtime

Datasets Using This Format

  • NebulaeWis/e621-2024-webp-4Mpixel
  • picollect/danbooru2 (subfolder: images)
  • Many picollect image datasets
  • Your dataset could be next! See "Creating Indices" above

Requirements

  • Python 3.8+
  • Linux/macOS/Windows

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webshart-0.2.0.tar.gz (51.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

webshart-0.2.0-cp313-cp313-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.13Windows x86-64

webshart-0.2.0-cp313-cp313-macosx_11_0_arm64.whl (2.7 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

webshart-0.2.0-cp312-cp312-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.12Windows x86-64

webshart-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

webshart-0.2.0-cp312-cp312-macosx_11_0_arm64.whl (2.7 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

webshart-0.2.0-cp311-cp311-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.11Windows x86-64

webshart-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

webshart-0.2.0-cp311-cp311-macosx_11_0_arm64.whl (2.7 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

webshart-0.2.0-cp310-cp310-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.10Windows x86-64

webshart-0.2.0-cp39-cp39-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.9Windows x86-64

File details

Details for the file webshart-0.2.0.tar.gz.

File metadata

  • Download URL: webshart-0.2.0.tar.gz
  • Upload date:
  • Size: 51.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d39a8a64339cbfc643a07f8a70fff50e62f83015f9a2b0c14a83a6881c121095
MD5 f8114b069ecc2bd4aea153a68ebee150
BLAKE2b-256 cd4b08285ea664e6b912e87e735f2df35920ce5cc973150a414cbeae0f1c11d3

See more details on using hashes here.

File details

Details for the file webshart-0.2.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: webshart-0.2.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 2.4 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.2.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 c3e70a0fdd38eb95d871a561fc80692fe9cc09aba56da6e9dc3fc1d923280a79
MD5 0d6cc3d79d53389e0676f89571c0ed70
BLAKE2b-256 99545c3cfdd0d190bed2766b1d763e20f6a1fc4c5b493046249dad7edb39e83f

See more details on using hashes here.

File details

Details for the file webshart-0.2.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.2.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 58c306e39213e7978cf66bd2cecc5a8398c3725bc5ec76fc52aa682b134bd643
MD5 b5cce9ef6d98ba2629dea3987e9b8283
BLAKE2b-256 a65b906ebb4a8fa33a394fc284d754215f7abfb88187dff1b158bac10521ae7a

See more details on using hashes here.

File details

Details for the file webshart-0.2.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: webshart-0.2.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 2.4 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.2.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 d8a323c051b6a17f01b116731cbb4bf2ec4aba1d0b04c0b232f9d73431aad9b2
MD5 1118128f8189207b0d54e2a64cd3fedf
BLAKE2b-256 78d3140d41c3e4855237501446995edd528573729d40759a9b68b7b5b8d00593

See more details on using hashes here.

File details

Details for the file webshart-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ca2eee8186cab58fd85e3aeabc7249018406350d8a390db4e98469af2dc66f0e
MD5 b81d09374c85a3ee5b5f1071dbc8585e
BLAKE2b-256 002abef72a4f7ba571f5350b6842e2f0a50c3d8df00e8fc51d6a22dc6af0b04b

See more details on using hashes here.

File details

Details for the file webshart-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.2.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9a2b0fbf559643b48db559f353c5bac190c9eefe238ec4fc1b03faed59134def
MD5 361810ccffd01aa92c5b01c4c31dbf9d
BLAKE2b-256 07d7b83fe513be7fee75a91aa8928fd4e66139ad74d46775c65ef3186b1927db

See more details on using hashes here.

File details

Details for the file webshart-0.2.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: webshart-0.2.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 2.4 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 f674c4618075ea85248125f2ed6f890d6ada4a60e17a1c62ee20c16efecfffc7
MD5 77de744a9f380a4eee578f6220b62892
BLAKE2b-256 7b2eddde8ae1517b516fea7746430008b9739fc114ba2bef81d5b48ce00d4668

See more details on using hashes here.

File details

Details for the file webshart-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b1817c11523728d29a8ea41633fe6666c4f5612887f59bc8209611e76fcc7f65
MD5 6b8bd5129c391730c1ac4b4bbe5457ad
BLAKE2b-256 2ab8288c301fe25b005c032975ccc598d8b036e56a0c15878515e330960f4057

See more details on using hashes here.

File details

Details for the file webshart-0.2.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.2.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0a83258c30cdac9038187877a3bdce6220449079caecee94f0609dcdb7c0a337
MD5 ff27694c2153e6dcc7eeaffeaf1c9b88
BLAKE2b-256 93fbf027b000885498c25802866685a1fe5005af1a45ed4b26caf3bf80730e1a

See more details on using hashes here.

File details

Details for the file webshart-0.2.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: webshart-0.2.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 2.4 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.2.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 cab98413b07a9bb5bed0f86b4a511ff5d275f15e8d3013efd8368da8d60e81ea
MD5 2f168e93cd772ad0d783783b3654f39a
BLAKE2b-256 b385b4153a20d798a06166583721b6fa7d12e316b93baba4d9d062020f1fcd53

See more details on using hashes here.

File details

Details for the file webshart-0.2.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: webshart-0.2.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 2.4 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.2.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 d6dfeb0e9d016f71a862c8d64f1f713b905374007c102dadd3d4d93390e6ad89
MD5 f037016b3cd83f672e853da1d9f9f8b7
BLAKE2b-256 fb2c7b6682c095657e4943916b883ad26b66e622b5679f99a13e652ef16cd8e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page