Fast and memory-efficient webdataset shard reader
Project description
webshart
Fast parallel reader for webdataset tar shards. Rust core with Python bindings. Built for streaming large video and image datasets, but handles any byte data.
Install
pip install webshart
What is this?
Webshart is a fast reader for a specific webdataset format: tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.
The format is rare but used by some large image datasets:
NebulaeWis/e621-2024-webp-4Mpixelpicollect/danbooru2(subfolder:images)- Other picollect datasets
Not a replacement for HF datasets or the webdataset library - just a purpose-built tool for this indexed format.
Performance: 10-20x faster for random access, 5-10x faster for batch reads compared to standard tar extraction.
Quick Start
import webshart
# Find your dataset
dataset = webshart.discover_dataset("NebulaeWis/e621-2024-webp-4Mpixel", subfolder="original")
print(f"Found {dataset.num_shards} shards")
# Read a single file
shard = dataset.open_shard(0)
data = shard.read_file(42) # -> bytes
# Read many files at once (fast)
byte_list = webshart.read_files_batch(dataset, [
(0, 0), # shard 0, file 0
(0, 1), # shard 0, file 1
(1, 0), # shard 1, file 0
(10, 5), # shard 10, file 5
])
# Save the files
for i, data in enumerate(byte_list):
if data: # skip failed reads
with open(f"image_{i}.webp", "wb") as f:
f.write(data)
Common Patterns
Stream a subset efficiently:
# Read files 0-100 from each of the first 10 shards
requests = []
for shard_idx in range(10):
for file_idx in range(100):
requests.append((shard_idx, file_idx))
# Batch read in chunks of 500 files
for chunk_idx, i in enumerate(range(0, len(requests), 500)):
byte_list = webshart.read_files_batch(dataset, requests[i:i+500])
for j, data in enumerate(byte_list):
if data: # process successful reads
# Save with meaningful names
shard, file = requests[i+j]
with open(f"shard_{shard:04d}_file_{file:04d}.webp", "wb") as f:
f.write(data)
Quick dataset stats:
# Without downloading anything
size, num_files = dataset.quick_stats()
print(f"Dataset size: {size / 1e9:.1f} GB")
Batch Operations
# Discover multiple datasets in parallel
datasets = webshart.discover_datasets_batch([
"NebulaeWis/e621-2024-webp-4Mpixel",
"picollect/danbooru2",
"/local/path/to/dataset"
], subfolders=["original", "images", None])
# Process large dataset in chunks
processor = webshart.BatchProcessor()
results = processor.process_dataset(
"NebulaeWis/e621-2024-webp-4Mpixel",
batch_size=100,
callback=lambda data: len(data) # process each file
)
Advanced
Local dataset:
dataset = webshart.discover_dataset("/path/to/shards/")
Custom auth:
# Pass token directly
dataset = webshart.discover_dataset("private/dataset", hf_token="hf_...")
# Or use your existing HF token from huggingface_hub
from huggingface_hub import get_token
token = get_token()
dataset = webshart.discover_dataset("private/dataset", hf_token=token)
Async interface (if you're already in async code):
dataset = await webshart.discover_dataset_async("NebulaeWis/e621-2024-webp-4Mpixel")
Why is it fast?
Problem: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.
Solution: The indexed format stores byte offsets in a separate JSON file, enabling:
- HTTP range requests for any file
- True random access over network
- Parallel reads from multiple shards
- No wasted bandwidth
The Rust implementation provides:
- Real parallelism (no Python GIL)
- Zero-copy operations where possible
- Efficient HTTP connection pooling
- Optimized tokio async runtime
Creating indexed datasets
If you're making a new webdataset, consider using the indexed format:
{
"files": {
"image_0001.webp": {"offset": 512, "length": 102400},
"image_0002.webp": {"offset": 102912, "length": 98304},
...
}
}
This enables random access over HTTP, making cloud-stored datasets as fast as local ones for many use cases.
Requirements
- Python 3.8+
- Linux/macOS/Windows
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webshart-0.1.0.tar.gz.
File metadata
- Download URL: webshart-0.1.0.tar.gz
- Upload date:
- Size: 40.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ada26b79d2b7a4c7994c753cfef4eeac9292980cfa43931b5adbae28e515ad07
|
|
| MD5 |
a9fd0c186d8141ed8c914dbf98d03792
|
|
| BLAKE2b-256 |
27fc35a9bfee2b511b66fc444e2a3715c17786ef932609a5713846fcd6cb13f3
|
File details
Details for the file webshart-0.1.0-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: webshart-0.1.0-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d31d6b486086c6f8edb11a9177be851ef08ce447e9ebc11afc179695dd380c72
|
|
| MD5 |
8c98560714f1b137ba746d24134342e0
|
|
| BLAKE2b-256 |
b44f452f3c8093a9bbcd8610a886dce93c4fb386a83e04042b9aa160cc452e42
|
File details
Details for the file webshart-0.1.0-cp313-cp313-macosx_11_0_arm64.whl.
File metadata
- Download URL: webshart-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.13, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7d876433950c41df8a45c43a6d688082b1c50884ba5d346ea5da5cc0f1ce23a
|
|
| MD5 |
82ddd8d7d2fea4efccddaf3212a0e10d
|
|
| BLAKE2b-256 |
e273009142bb7ba46fd6ef8d72bdae69f9ebf402428fa34dc898c413c9dbbe63
|
File details
Details for the file webshart-0.1.0-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: webshart-0.1.0-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
600268016629d62ee2507f986bcde318f6243ea729114fedd80ff747ab1c18e8
|
|
| MD5 |
5eb8013d95551778cca40a2de1963584
|
|
| BLAKE2b-256 |
b1158c361ad1570196daa57897fcc794e6e8d4e776b6d9e74a75d4434aecc59c
|
File details
Details for the file webshart-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: webshart-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.3 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
202f568e1b19c067607c0a16b450283b5562e4daab29aebdae4e20871846c727
|
|
| MD5 |
2c27b35f8667e24cc058a146e9dbffd9
|
|
| BLAKE2b-256 |
7e9fdbd3a5100a7a4ced1f722b2e37408ba713ebc3d05c6ae2bd7987d3d990ea
|
File details
Details for the file webshart-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: webshart-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2afcd4823f9a7813d726d8c56e1420ffe4ca6335e061157bb0d7c66437a41e10
|
|
| MD5 |
4630d5eedb643e94b8012d34d5056325
|
|
| BLAKE2b-256 |
b20c226046dbd1377774c360b32a5dee3773cfe219bbb69c9292709891e2c9bf
|
File details
Details for the file webshart-0.1.0-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: webshart-0.1.0-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c84b8367765859025f5242f92d46b10d32330d41ffef920de54b395dc8c7b67
|
|
| MD5 |
0edc500cef83b9233f5164c5d835c1d8
|
|
| BLAKE2b-256 |
e40428167c287c5c0996fb3fac8088afb099c258353158beb5f5bac3d2e0623a
|
File details
Details for the file webshart-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: webshart-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.3 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b46e26ae5962c835e90331348c78cc3dfc42805f0197a12d6acea2f4224d888
|
|
| MD5 |
f9bf085dbf19d11ef4d2a12e616bf2b6
|
|
| BLAKE2b-256 |
c76103d7e7e9b6f7687b32213e5ddd0d3aad91ef2df43b50dd555c22f5253366
|
File details
Details for the file webshart-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: webshart-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fef27e479d7da391386b2e5014dff1f5f9a3cd290b96134e3a9af0bb570baa68
|
|
| MD5 |
dab5f83b2a478f2cd74eef43e64eba51
|
|
| BLAKE2b-256 |
7a2ab975581ae739dc8d75ce0cea275086a03ac27a82bd99aa1361346a20d9c4
|
File details
Details for the file webshart-0.1.0-cp310-cp310-win_amd64.whl.
File metadata
- Download URL: webshart-0.1.0-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4584e611d8da6760149af518ffb788dd13324960b3b089843d76df7148369e56
|
|
| MD5 |
ffb75261cdba07cb27382036e22c6f4f
|
|
| BLAKE2b-256 |
47bbd12dcdedd2fd59ec7c8566a18f577c3920002cde80edabc68b82f473a893
|
File details
Details for the file webshart-0.1.0-cp39-cp39-win_amd64.whl.
File metadata
- Download URL: webshart-0.1.0-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cc9f4de3c383f6d4a0f64a55bf7da850d1bcfe7f1f52e9d8452b3636b4e603e
|
|
| MD5 |
a8650436cfd5cbd3f54cc19d1f2eba09
|
|
| BLAKE2b-256 |
72e257a49d2d77d2d1ff3e5b9dbb3ae644dcdeee48dad30310838276bf066324
|