Skip to main content

Fast and memory-efficient webdataset shard reader

Project description

webshart

Fast parallel reader for webdataset tar shards. Rust core with Python bindings. Built for streaming large video and image datasets, but handles any byte data.

Install

pip install webshart

What is this?

Webshart is a fast reader for a specific webdataset format: tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.

The format is rare but used by some large image datasets:

  • NebulaeWis/e621-2024-webp-4Mpixel
  • picollect/danbooru2 (subfolder: images)
  • Other picollect datasets

Not a replacement for HF datasets or the webdataset library - just a purpose-built tool for this indexed format.

Performance: 10-20x faster for random access, 5-10x faster for batch reads compared to standard tar extraction.

Quick Start

import webshart

# Find your dataset
dataset = webshart.discover_dataset("NebulaeWis/e621-2024-webp-4Mpixel", subfolder="original")
print(f"Found {dataset.num_shards} shards")

# Read a single file
shard = dataset.open_shard(0)
data = shard.read_file(42)  # -> bytes

# Read many files at once (fast)
byte_list = webshart.read_files_batch(dataset, [
    (0, 0),   # shard 0, file 0
    (0, 1),   # shard 0, file 1  
    (1, 0),   # shard 1, file 0
    (10, 5),  # shard 10, file 5
])

# Save the files
for i, data in enumerate(byte_list):
    if data:  # skip failed reads
        with open(f"image_{i}.webp", "wb") as f:
            f.write(data)

Common Patterns

Stream a subset efficiently:

# Read files 0-100 from each of the first 10 shards
requests = []
for shard_idx in range(10):
    for file_idx in range(100):
        requests.append((shard_idx, file_idx))

# Batch read in chunks of 500 files
for chunk_idx, i in enumerate(range(0, len(requests), 500)):
    byte_list = webshart.read_files_batch(dataset, requests[i:i+500])
    for j, data in enumerate(byte_list):
        if data:  # process successful reads
            # Save with meaningful names
            shard, file = requests[i+j]
            with open(f"shard_{shard:04d}_file_{file:04d}.webp", "wb") as f:
                f.write(data)

Quick dataset stats:

# Without downloading anything
size, num_files = dataset.quick_stats()
print(f"Dataset size: {size / 1e9:.1f} GB")

Batch Operations

# Discover multiple datasets in parallel
datasets = webshart.discover_datasets_batch([
    "NebulaeWis/e621-2024-webp-4Mpixel",
    "picollect/danbooru2",
    "/local/path/to/dataset"
], subfolders=["original", "images", None])

# Process large dataset in chunks
processor = webshart.BatchProcessor()
results = processor.process_dataset(
    "NebulaeWis/e621-2024-webp-4Mpixel",
    batch_size=100,
    callback=lambda data: len(data)  # process each file
)

Advanced

Local dataset:

dataset = webshart.discover_dataset("/path/to/shards/")

Custom auth:

# Pass token directly
dataset = webshart.discover_dataset("private/dataset", hf_token="hf_...")

# Or use your existing HF token from huggingface_hub
from huggingface_hub import get_token
token = get_token()
dataset = webshart.discover_dataset("private/dataset", hf_token=token)

Async interface (if you're already in async code):

dataset = await webshart.discover_dataset_async("NebulaeWis/e621-2024-webp-4Mpixel")

Why is it fast?

Problem: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.

Solution: The indexed format stores byte offsets in a separate JSON file, enabling:

  • HTTP range requests for any file
  • True random access over network
  • Parallel reads from multiple shards
  • No wasted bandwidth

The Rust implementation provides:

  • Real parallelism (no Python GIL)
  • Zero-copy operations where possible
  • Efficient HTTP connection pooling
  • Optimized tokio async runtime

Creating indexed datasets

If you're making a new webdataset, consider using the indexed format:

{
  "files": {
    "image_0001.webp": {"offset": 512, "length": 102400},
    "image_0002.webp": {"offset": 102912, "length": 98304},
    ...
  }
}

This enables random access over HTTP, making cloud-stored datasets as fast as local ones for many use cases.

Requirements

  • Python 3.8+
  • Linux/macOS/Windows

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webshart-0.1.0.tar.gz (40.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

webshart-0.1.0-cp313-cp313-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.13Windows x86-64

webshart-0.1.0-cp313-cp313-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

webshart-0.1.0-cp312-cp312-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.12Windows x86-64

webshart-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

webshart-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

webshart-0.1.0-cp311-cp311-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.11Windows x86-64

webshart-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

webshart-0.1.0-cp311-cp311-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

webshart-0.1.0-cp310-cp310-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.10Windows x86-64

webshart-0.1.0-cp39-cp39-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.9Windows x86-64

File details

Details for the file webshart-0.1.0.tar.gz.

File metadata

  • Download URL: webshart-0.1.0.tar.gz
  • Upload date:
  • Size: 40.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ada26b79d2b7a4c7994c753cfef4eeac9292980cfa43931b5adbae28e515ad07
MD5 a9fd0c186d8141ed8c914dbf98d03792
BLAKE2b-256 27fc35a9bfee2b511b66fc444e2a3715c17786ef932609a5713846fcd6cb13f3

See more details on using hashes here.

File details

Details for the file webshart-0.1.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: webshart-0.1.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.1.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 d31d6b486086c6f8edb11a9177be851ef08ce447e9ebc11afc179695dd380c72
MD5 8c98560714f1b137ba746d24134342e0
BLAKE2b-256 b44f452f3c8093a9bbcd8610a886dce93c4fb386a83e04042b9aa160cc452e42

See more details on using hashes here.

File details

Details for the file webshart-0.1.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d7d876433950c41df8a45c43a6d688082b1c50884ba5d346ea5da5cc0f1ce23a
MD5 82ddd8d7d2fea4efccddaf3212a0e10d
BLAKE2b-256 e273009142bb7ba46fd6ef8d72bdae69f9ebf402428fa34dc898c413c9dbbe63

See more details on using hashes here.

File details

Details for the file webshart-0.1.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: webshart-0.1.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.1.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 600268016629d62ee2507f986bcde318f6243ea729114fedd80ff747ab1c18e8
MD5 5eb8013d95551778cca40a2de1963584
BLAKE2b-256 b1158c361ad1570196daa57897fcc794e6e8d4e776b6d9e74a75d4434aecc59c

See more details on using hashes here.

File details

Details for the file webshart-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 202f568e1b19c067607c0a16b450283b5562e4daab29aebdae4e20871846c727
MD5 2c27b35f8667e24cc058a146e9dbffd9
BLAKE2b-256 7e9fdbd3a5100a7a4ced1f722b2e37408ba713ebc3d05c6ae2bd7987d3d990ea

See more details on using hashes here.

File details

Details for the file webshart-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2afcd4823f9a7813d726d8c56e1420ffe4ca6335e061157bb0d7c66437a41e10
MD5 4630d5eedb643e94b8012d34d5056325
BLAKE2b-256 b20c226046dbd1377774c360b32a5dee3773cfe219bbb69c9292709891e2c9bf

See more details on using hashes here.

File details

Details for the file webshart-0.1.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: webshart-0.1.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.1.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1c84b8367765859025f5242f92d46b10d32330d41ffef920de54b395dc8c7b67
MD5 0edc500cef83b9233f5164c5d835c1d8
BLAKE2b-256 e40428167c287c5c0996fb3fac8088afb099c258353158beb5f5bac3d2e0623a

See more details on using hashes here.

File details

Details for the file webshart-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for webshart-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8b46e26ae5962c835e90331348c78cc3dfc42805f0197a12d6acea2f4224d888
MD5 f9bf085dbf19d11ef4d2a12e616bf2b6
BLAKE2b-256 c76103d7e7e9b6f7687b32213e5ddd0d3aad91ef2df43b50dd555c22f5253366

See more details on using hashes here.

File details

Details for the file webshart-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for webshart-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fef27e479d7da391386b2e5014dff1f5f9a3cd290b96134e3a9af0bb570baa68
MD5 dab5f83b2a478f2cd74eef43e64eba51
BLAKE2b-256 7a2ab975581ae739dc8d75ce0cea275086a03ac27a82bd99aa1361346a20d9c4

See more details on using hashes here.

File details

Details for the file webshart-0.1.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: webshart-0.1.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.1.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 4584e611d8da6760149af518ffb788dd13324960b3b089843d76df7148369e56
MD5 ffb75261cdba07cb27382036e22c6f4f
BLAKE2b-256 47bbd12dcdedd2fd59ec7c8566a18f577c3920002cde80edabc68b82f473a893

See more details on using hashes here.

File details

Details for the file webshart-0.1.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: webshart-0.1.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for webshart-0.1.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 2cc9f4de3c383f6d4a0f64a55bf7da850d1bcfe7f1f52e9d8452b3636b4e603e
MD5 a8650436cfd5cbd3f54cc19d1f2eba09
BLAKE2b-256 72e257a49d2d77d2d1ff3e5b9dbb3ae644dcdeee48dad30310838276bf066324

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page