Iterable Streaming Webdataset for PyTorch from boto3 compliant storage
Project description
streaming-wds (Streaming WebDataset)
streaming-wds
is a Python library that enables efficient streaming of WebDataset-format datasets from boto3-compliant object stores for PyTorch. It's designed to handle large-scale datasets with ease, especially in distributed training contexts.
Features
- Streaming of WebDataset-format data from S3-compatible object stores
- Efficient sharding of data across both torch distributed workers and dataloader multiprocessing workers
- Supports (approximate) shard-level mid-epoch resumption when used with
StreamingDataLoader
- Blazing fast data loading with local caching and explicit control over memory consumption
- Customizable decoding of dataset elements via
StreamingDataset.process_sample
TODO
- Faster tar extraction in C++ threads (using pybind11)
- Key-level mid-epoch resumption
- Tensor Parallel replication strategy
Installation
You can install streaming-wds
using pip:
pip install streaming-wds
Quick Start
Here's a basic example of how to use streaming-wds:
from streaming_wds import StreamingWebDataset, StreamingDataLoader
# Create the dataset
dataset = StreamingWebDataset(
remote="s3://your-bucket/your-dataset",
split="train",
profile="your_aws_profile",
shuffle=True,
max_workers=4,
schema={".jpg": "PIL", ".json": "json"}
)
# or use a custom processing function
import torchvision.transforms.v2 as T
class ImageNetWebDataset(StreamingWebDataset):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.transforms = T.Compose([
T.ToImage(),
T.Resize((64,)),
T.ToDtype(torch.float32),
T.Normalize(mean=(128,), std=(128,)),
])
def process_sample(self, sample):
sample[".jpg"] = self.transforms(sample[".jpg"])
return sample
# Create a StreamingDataLoader for mid-epoch resumption
dataloader = StreamingDataLoader(dataset, batch_size=32, num_workers=4)
# Iterate through the data
for batch in dataloader:
# Your training loop here
pass
# You can save the state for resumption
state_dict = dataloader.state_dict()
# Later, you can resume from this state
dataloader.load_state_dict(state_dict)
Configuration
remote
(str): The S3 URI of the dataset.split
(Optional[str]): The dataset split (e.g., "train", "val", "test"). Defaults to None.profile
(str): The AWS profile to use for authentication. Defaults to "default".shuffle
(bool): Whether to shuffle the data. Defaults to False.max_workers
(int): Maximum number of worker threads for download and extraction. Defaults to 2.schema
(Dict[str, str]): A dictionary defining the decoding method for each data field. Defaults to {}.memory_buffer_limit_bytes
(Union[Bytes, int, str]): The maximum size of the memory buffer in bytes per worker. Defaults to "2GB".file_cache_limit_bytes
(Union[Bytes, int, str]): The maximum size of the file cache in bytes per worker. Defaults to "2GB".
Contributing
Contributions to streaming-wds are welcome! Please feel free to submit a Pull Request.
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
streaming_wds-0.1.28.tar.gz
(38.6 kB
view details)
Built Distribution
File details
Details for the file streaming_wds-0.1.28.tar.gz
.
File metadata
- Download URL: streaming_wds-0.1.28.tar.gz
- Upload date:
- Size: 38.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a121690c9ae51644d612f7db9befcd828b4850aa0f24c712194433b548dc9482 |
|
MD5 | 9ec3e3a43669d0e529dbb9c34d28f81b |
|
BLAKE2b-256 | 9b2c546f88b5df74c38841989cafc12e688bf8b2738e78a18d2e67993825798f |
File details
Details for the file streaming_wds-0.1.28-py3-none-any.whl
.
File metadata
- Download URL: streaming_wds-0.1.28-py3-none-any.whl
- Upload date:
- Size: 29.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee25729e154109be38700294e682312153b9faef60c7652802c5ee0ea4c3cb18 |
|
MD5 | 9d49134713058431b5569fa2f9510475 |
|
BLAKE2b-256 | 415069c3f1ba0ed0ce2480e1f0928e343787ecbba5c9c9ec79d78fa2e4f34742 |