Iterable Streaming Webdataset for PyTorch from boto3 compliant storage
Reason this release was yanked:
core bugs
Project description
streaming-wds (Streaming WebDataset)
streaming-wds
is a Python library that enables efficient streaming of WebDataset-format datasets from boto3-compliant object stores for PyTorch. It's designed to handle large-scale datasets with ease, especially in distributed training contexts.
Note: this was a weekend project and is not yet optimized for production use. Feedback & contributions welcome, especially for performance improvements.
Features
- Streaming of WebDataset-format data from S3-compatible object stores
- Sharding of data across workers
- Supports mid-epoch resumption when used with
StreamingDataLoader
- Efficient prefetching and parallel processing of data
- Customizable decoding of dataset elements
Installation
You can install streaming-wds
using pip:
pip install streaming-wds
Quick Start
Here's a basic example of how to use streaming-wds:
from streaming_wds import StreamingWebDataset, StreamingDataLoader
# Create the dataset
dataset = StreamingWebDataset(
remote="s3://your-bucket/your-dataset",
split="train",
profile="your_aws_profile",
prefetch=2,
shuffle=True,
max_workers=4,
schema={".jpg": "PIL", ".json": "json"}
)
# Create a StreamingDataLoader for mid-epoch resumption
dataloader = StreamingDataLoader(dataset, batch_size=32, num_workers=4)
# Iterate through the data
for batch in dataloader:
# Your training loop here
pass
# You can save the state for resumption
state_dict = dataloader.state_dict()
# Later, you can resume from this state
dataloader.load_state_dict(state_dict)
Configuration
remote
: The S3 URI of your datasetsplit
: The dataset split (e.g., "train", "val", "test")profile
: The AWS profile to use for authenticationprefetch
: Number of samples to prefetchshuffle
: Whether to shuffle the datamax_workers
: Maximum number of worker threads for download and extractionschema
: A dictionary defining the decoding method for each data field
Mid-Epoch Resumption
When used with StatefulDataLoader
from torchdata
, streaming-wds supports mid-epoch resumption. This is particularly useful for long-running training jobs that may be interrupted.
Contributing
Contributions to streaming-wds are welcome! Please feel free to submit a Pull Request.
License
MIT License
Acknowledgements
This project was inspired by the WebDataset format and built to work seamlessly with PyTorch and torchdata.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for streaming_wds-0.1.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba5e1c4c82dafb6a6f134dd5e525919d3c56bb94be9f3f1a706eb79b769dc916 |
|
MD5 | 3593040e4772e306268d0938f64384a8 |
|
BLAKE2b-256 | e7b89644a91e1f96c353d97c8d8c55d19ce3ffa8cfc0136c1c422cc50a839d1f |