Iterable Streaming Webdataset for PyTorch from boto3 compliant storage
Project description
streaming-wds (Streaming WebDataset)
streaming-wds is a Python library that enables efficient streaming of WebDataset-format datasets from boto3-compliant object stores for PyTorch. It's designed to handle large-scale datasets with ease, especially in distributed training contexts.
Features
- Streaming of WebDataset-format data from S3-compatible object stores
- Efficient sharding of data across both torch distributed workers and dataloader multiprocessing workers
- Supports (approximate) shard-level mid-epoch resumption when used with
StreamingDataLoader - Blazing fast data loading with local caching and explicit control over memory consumption
- Customizable decoding of dataset elements via
StreamingDataset.process_sample
TODO
- Faster tar extraction in C++ threads (using pybind11)
- Key-level mid-epoch resumption
- Tensor Parallel replication strategy
Installation
You can install streaming-wds using pip:
pip install streaming-wds
Quick Start
Here's a basic example of how to use streaming-wds:
from streaming_wds import StreamingWebDataset, StreamingDataLoader
# Create the dataset
dataset = StreamingWebDataset(
remote="s3://your-bucket/your-dataset",
split="train",
profile="your_aws_profile",
shuffle=True,
max_workers=4,
schema={".jpg": "PIL", ".json": "json"}
)
# or use a custom processing function
import torchvision.transforms.v2 as T
class ImageNetWebDataset(StreamingWebDataset):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.transforms = T.Compose([
T.ToImage(),
T.Resize((64,)),
T.ToDtype(torch.float32),
T.Normalize(mean=(128,), std=(128,)),
])
def process_sample(self, sample):
sample[".jpg"] = self.transforms(sample[".jpg"])
return sample
# Create a StreamingDataLoader for mid-epoch resumption
dataloader = StreamingDataLoader(dataset, batch_size=32, num_workers=4)
# Iterate through the data
for batch in dataloader:
# Your training loop here
pass
# You can save the state for resumption
state_dict = dataloader.state_dict()
# Later, you can resume from this state
dataloader.load_state_dict(state_dict)
Configuration
remote(str): The S3 URI of the dataset.split(Optional[str]): The dataset split (e.g., "train", "val", "test"). Defaults to None.profile(str): The AWS profile to use for authentication. Defaults to "default".shuffle(bool): Whether to shuffle the data. Defaults to False.max_workers(int): Maximum number of worker threads for download and extraction. Defaults to 2.schema(Dict[str, str]): A dictionary defining the decoding method for each data field. Defaults to {}.memory_buffer_limit_bytes(Union[Bytes, int, str]): The maximum size of the memory buffer in bytes per worker. Defaults to "2GB".file_cache_limit_bytes(Union[Bytes, int, str]): The maximum size of the file cache in bytes per worker. Defaults to "2GB".
Contributing
Contributions to streaming-wds are welcome! Please feel free to submit a Pull Request.
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file streaming_wds-0.1.28.tar.gz.
File metadata
- Download URL: streaming_wds-0.1.28.tar.gz
- Upload date:
- Size: 38.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a121690c9ae51644d612f7db9befcd828b4850aa0f24c712194433b548dc9482
|
|
| MD5 |
9ec3e3a43669d0e529dbb9c34d28f81b
|
|
| BLAKE2b-256 |
9b2c546f88b5df74c38841989cafc12e688bf8b2738e78a18d2e67993825798f
|
File details
Details for the file streaming_wds-0.1.28-py3-none-any.whl.
File metadata
- Download URL: streaming_wds-0.1.28-py3-none-any.whl
- Upload date:
- Size: 29.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee25729e154109be38700294e682312153b9faef60c7652802c5ee0ea4c3cb18
|
|
| MD5 |
9d49134713058431b5569fa2f9510475
|
|
| BLAKE2b-256 |
415069c3f1ba0ed0ce2480e1f0928e343787ecbba5c9c9ec79d78fa2e4f34742
|