Fast and memory-efficient webdataset shard reader
Project description
Fast dataloader and conversion utility for webdataset tar shards. Rust core with Python bindings.
Built for streaming large video and image datasets, but handles any byte data.
Install
pip install webshart
What is this?
Webshart is a fast reader for webdataset tar files with separate JSON index files. This format enables random access to any file in the dataset without downloading the entire archive.
The indexed format provides massive performance benefits:
- Random access: Jump to any file instantly
- Selective downloads: Only fetch the files you need
- True parallelism: Read from multiple shards simultaneously
- Cloud-optimized: Works efficiently with HTTP range requests
- Aspect bucketing: Optionally include image geometry hints
width,heightandaspectfor the ability to bucket by shape - Custom DataLoader: Includes state dict methods on the DataLoader so that you can resume training deterministically
- Rate-limit friendly: Local caching allows high-frequency random seeking without encountering storage provider rate limits
- Instant start-up with pre-sorted aspect buckets
Growing ecosystem: While not all datasets use this format yet, you can easily create indices for any tar-based dataset (see below).
Quick Start
import webshart
# Find your dataset
dataset = discover_dataset(
source="laion/conceptual-captions-12m-webdataset",
# we're able to upload metadata separately so that we reduce load on huggingface infra.
metadata="webshart/conceptual-captions-12m-webdataset-metadata",
)
print(f"Found {dataset.num_shards} shards")
Common Patterns
For real-world, working examples:
Creating Indices for / Converting Existing Datasets
Any tar-based webdataset can benefit from indexing! Webshart includes tools to generate indices:
A command-line tool that auto-discovers tars to process:
% webshart extract-metadata \
--source laion/conceptual-captions-12m-webdataset \
--destination laion_output/ \
--checkpoint-dir ./laion_output/checkpoints \
--max-workers 2 \
--include-image-geometry
Or, if you prefer/require direct-integration to an existing Python application, use the API
Uploading Indices to HuggingFace
Once you've generated indices, share them with the community:
# Upload all JSON files to your dataset
huggingface-cli upload --repo-type=dataset \
username/dataset-name \
./indices/ \
--include "*.json" \
--path-in-repo "indices/"
Or if you want to contribute to an existing dataset you don't own:
- Create a community dataset with indices:
username/original-dataset-indices - Upload the JSON files there
- Open a discussion on the original dataset suggesting they add the indices
Creating New Indexed Datasets
If you're creating a new dataset, generate indices during creation:
{
"files": {
"image_0001.webp": {"offset": 512, "length": 102400},
"image_0002.webp": {"offset": 102912, "length": 98304},
...
}
}
The JSON index should have the same name as the tar file (e.g., shard_0000.tar → shard_0000.json).
Why is it fast?
Problem: Standard tar files require sequential reading. To get file #10,000, you must read through files #1-9,999 first.
Solution: The indexed format stores byte offsets and sample metadata in a separate JSON file, enabling:
- HTTP range requests for any file
- True random access over network
- Parallel reads from multiple shards
- Large scale, aspect-bucketed datasets
- No wasted bandwidth
The Rust implementation provides:
- Real parallelism (no Python GIL)
- Zero-copy operations where possible
- Efficient HTTP connection pooling
- Optimized tokio async runtime
- Optional local caching for metadata and shards
- Fast aspect bucketing for image data
Datasets Using This Format
I discovered after creating this library that cheesechaser is the origin of the indexed tar format, which webshart has formalised and extended to include aspect bucketing support.
NebulaeWis/e621-2024-webp-4Mpixelpicollect/danbooru2(subfolder:images)- Many picollect image datasets
- Your dataset could be next! See "Creating Indices" above
Requirements
- Python 3.8+
- Linux/macOS/Windows
Roadmap
- image decoding is currently not handled by this library, but it will be added with zero-copy.
- more informative API for caching and other Rust implementation details
- multi-gpu/multi-node friendly dataloader
Projects using webshart
- CaptionFlow uses this library to solve memory use and seek performance issues typical to webdatasets
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webshart-0.4.1.tar.gz.
File metadata
- Download URL: webshart-0.4.1.tar.gz
- Upload date:
- Size: 94.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b4b54e43cb996a891a7e53477a85861e36e183d08dd1f727677e30fa26ef125
|
|
| MD5 |
c2d055cf2a9117cbfa53559bf5076ad9
|
|
| BLAKE2b-256 |
a01e024125626b9774b50fb59dd70077c81ae3d3f82aa299f49d07f07147e653
|
File details
Details for the file webshart-0.4.1-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: webshart-0.4.1-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1e42691870686cd59e3dedeb955644f23cfc3555d73d46aeafb0eea54a899ec
|
|
| MD5 |
910098b4fb731b0f759f36954ca6d532
|
|
| BLAKE2b-256 |
d3b7ad808700d8213bd31dd45db47972dc5c974f45a9df33cddcdd0f750594c7
|
File details
Details for the file webshart-0.4.1-cp313-cp313-macosx_11_0_arm64.whl.
File metadata
- Download URL: webshart-0.4.1-cp313-cp313-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.13, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d80bf5ebd7a3048e3d1809639b8b25988b897ef81d22a24f1d0816dfb801b68
|
|
| MD5 |
d545b04b4b9cd31409af68c07ca3d832
|
|
| BLAKE2b-256 |
2279e208c56cd3c2be5af049c8b910f82c198901d441ec64b5467a0e3a50a234
|
File details
Details for the file webshart-0.4.1-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: webshart-0.4.1-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bfdf128c2dfbc97c728021860b0e7a3ec60597a08b90d95b5ef35ad41980f35c
|
|
| MD5 |
96b5e66740cf74e4ebff722e37a3059d
|
|
| BLAKE2b-256 |
42b34d882c70bc18d4407017f64b108d752d5c18c6922922d3ec9e53b1a84f88
|
File details
Details for the file webshart-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: webshart-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.8 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0fd8a9a95a77688aa2cc6ed035c17580c2fae9b9c712867513b318ec1e42d62
|
|
| MD5 |
95b42f08df6db5f7014586e0af357947
|
|
| BLAKE2b-256 |
65908ed29e20bd3d9cdf2e7cd9252e81d376abd5cde6a5009c6f2bcdf930320d
|
File details
Details for the file webshart-0.4.1-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: webshart-0.4.1-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ae3edb751ae5e970a70b231c87d0fe64f0b80681136e10cdd1e2df59f4218c9
|
|
| MD5 |
909250e3eb9bc41f1d12fb5d787ef352
|
|
| BLAKE2b-256 |
b8e9fe64aceb0630e99c7d3893da4f2e7911585e9db69b1df013d8d8fd2b9029
|
File details
Details for the file webshart-0.4.1-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: webshart-0.4.1-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3281a9c694c2e6596c410b91f6bb52cda904e54f708d1b07e249112befdfc1b
|
|
| MD5 |
23f8ee1e91e4093aed478a5ec6a24522
|
|
| BLAKE2b-256 |
06772f4b556fe757aceab4303e0a9b7fdd501437feebecfb05fecce40098369e
|
File details
Details for the file webshart-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: webshart-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.8 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03ed6e9876265291f21a9de4a790b532e84206f96f6259a8acfb81bef3bd97d3
|
|
| MD5 |
f6fa3fdbd2c6b5f0c080cfce172d30cd
|
|
| BLAKE2b-256 |
471e80d26af12ec1f4abe581488cc61a83676d677df98d3364b639eeadd06f7f
|
File details
Details for the file webshart-0.4.1-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: webshart-0.4.1-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7381abe274c99446c9089b3b440c959797d71bd10c0a268dea7b49f4cd74004e
|
|
| MD5 |
6ef2d571e97ccd656263390813a7ea0c
|
|
| BLAKE2b-256 |
dfceada396e7d310f849cd7fcefbcf51f39d435afd5f92e90b6d6f2e2221d79a
|
File details
Details for the file webshart-0.4.1-cp310-cp310-win_amd64.whl.
File metadata
- Download URL: webshart-0.4.1-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4255e02e39aff78611bdf2222cdf658a0e8c771f806cda6918f3b692b6f44bb9
|
|
| MD5 |
e95afed049180a63fa31ab69b4bb3fa5
|
|
| BLAKE2b-256 |
cb19b8ea04238e8ca8e0eac9b3a7880f7669a6ceb6f95608940cbfc1e92a674f
|
File details
Details for the file webshart-0.4.1-cp39-cp39-win_amd64.whl.
File metadata
- Download URL: webshart-0.4.1-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3b1e1c04b13a6f2eb370cb4c089c7eb6cd02d9f4c707d61900ccf7b0006b765
|
|
| MD5 |
0d4d74fd655e2b5726b55945a5c7857c
|
|
| BLAKE2b-256 |
4bc3751183978f64a4390c43e1973ccf47440d47611aa5801ec12d9e6aef3645
|