Iterable Streaming Webdataset for PyTorch from boto3 compliant storage

These details have not been verified by PyPI

Project links

Reason this release was yanked:

core bugs

Project description

streaming-wds (Streaming WebDataset)

streaming-wds is a Python library that enables efficient streaming of WebDataset-format datasets from boto3-compliant object stores for PyTorch. It's designed to handle large-scale datasets with ease, especially in distributed training contexts.

Features

Streaming of WebDataset-format data from S3-compatible object stores
Efficient sharding of data across both torch distributed workers and dataloader multiprocessing workers
Supports mid-epoch resumption when used with StreamingDataLoader
Blazing fast data loading with local caching and explicit control over memory consumption
Customizable decoding of dataset elements via StreamingDataset.process_sample

Installation

You can install streaming-wds using pip:

pip install streaming-wds

Quick Start

Here's a basic example of how to use streaming-wds:

from streaming_wds import StreamingWebDataset, StreamingDataLoader

# Create the dataset
dataset = StreamingWebDataset(
    remote="s3://your-bucket/your-dataset",
    split="train",
    profile="your_aws_profile",
    prefetch=2,
    shuffle=True,
    max_workers=4,
    schema={".jpg": "PIL", ".json": "json"}
)

# or use a custom processing function
class ImageNetWebDataset(StreamingWebDataset):
    def process_sample(self, sample):
        # Custom processing logic here
        return sample

# Create a StreamingDataLoader for mid-epoch resumption
dataloader = StreamingDataLoader(dataset, batch_size=32, num_workers=4)

# Iterate through the data
for batch in dataloader:
    # Your training loop here
    pass

# You can save the state for resumption
state_dict = dataloader.state_dict()

# Later, you can resume from this state
dataloader.load_state_dict(state_dict)

Configuration

remote: The S3 URI of your dataset
split: The dataset split (e.g., "train", "val", "test")
profile: The AWS profile to use for authentication
buffer_size: The size of the buffer for each worker
shuffle: Whether to shuffle the data
max_workers: Maximum number of worker threads for download and extraction
schema: A dictionary defining the decoding method for each data field
cache_limit_bytes: The maximum size of the file cache in bytes. This should be fairly large to prevent frequent cache evictions.
process_sample: Implement this function to customize the processing of each sample (for example: torchvision transforms)

Contributing

Contributions to streaming-wds are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.28

Jul 11, 2024

0.1.26

Jul 11, 2024

0.1.25

Jul 10, 2024

0.1.24 yanked

Jul 10, 2024

Reason this release was yanked:

core bugs

0.1.23

Jul 10, 2024

0.1.22

Jul 10, 2024

0.1.21

Jul 9, 2024

0.1.20

Jul 9, 2024

0.1.19

Jul 9, 2024

0.1.18

Jul 9, 2024

0.1.17

Jul 9, 2024

0.1.16

Jul 9, 2024

0.1.15

Jul 9, 2024

0.1.14

Jul 9, 2024

0.1.13 yanked

Jul 9, 2024

Reason this release was yanked:

core bugs

0.1.12 yanked

Jul 9, 2024

Reason this release was yanked:

core bugs

0.1.11 yanked

Jul 9, 2024

Reason this release was yanked:

core bugs

0.1.10 yanked

Jul 9, 2024

Reason this release was yanked:

core bugs

This version

0.1.9 yanked

Jul 9, 2024

Reason this release was yanked:

core bugs

0.1.8 yanked

Jul 9, 2024

Reason this release was yanked:

core bugs

0.1.5 yanked

Jul 8, 2024

Reason this release was yanked:

core bugs

0.1.4 yanked

Jul 8, 2024

Reason this release was yanked:

core bugs

0.1.2 yanked

Jul 8, 2024

Reason this release was yanked:

core bugs

0.1.1 yanked

Jul 8, 2024

Reason this release was yanked:

core bugs

0.1.0 yanked

Jul 8, 2024

Reason this release was yanked:

core bugs

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

streaming_wds-0.1.9.tar.gz (14.2 kB view hashes)

Uploaded Jul 9, 2024 Source

Built Distribution

streaming_wds-0.1.9-py3-none-any.whl (15.9 kB view hashes)

Uploaded Jul 9, 2024 Python 3

Hashes for streaming_wds-0.1.9.tar.gz

Hashes for streaming_wds-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`879204fb67b46c4d2f7ac2408b3c87d411949f1a1032e70a6055e9b2421fb711`
MD5	`0948de67cd253ecc2fd6da1d4164e6d6`
BLAKE2b-256	`c726f344833acfcf8a011750e2b55bc1bc50791710705f85099ebdaaa2d951da`

Hashes for streaming_wds-0.1.9-py3-none-any.whl

Hashes for streaming_wds-0.1.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`351bebd91d11af2ee8dd212891fd974c360d3ca3a383319b4f275d9c66931f84`
MD5	`b83bc0fbd05e188425d0f68518bbe232`
BLAKE2b-256	`59a39ef49b7991b77d3d205f071022644b0a9b7135c7a46e40275e09f2c98d0f`