Skip to main content

Iterable Streaming Webdataset for PyTorch from boto3 compliant storage

Project description

streaming-wds (Streaming WebDataset)

streaming-wds is a Python library that enables efficient streaming of WebDataset-format datasets from boto3-compliant object stores for PyTorch. It's designed to handle large-scale datasets with ease, especially in distributed training contexts.

Features

  • Streaming of WebDataset-format data from S3-compatible object stores
  • Efficient sharding of data across both torch distributed workers and dataloader multiprocessing workers
  • Supports (approximate) shard-level mid-epoch resumption when used with StreamingDataLoader
  • Blazing fast data loading with local caching and explicit control over memory consumption
  • Customizable decoding of dataset elements via StreamingDataset.process_sample

TODO

  • Faster tar extraction in C++ threads (using pybind11)
  • Key-level mid-epoch resumption
  • Tensor Parallel replication strategy

Installation

You can install streaming-wds using pip:

pip install streaming-wds

Quick Start

Here's a basic example of how to use streaming-wds:

from streaming_wds import StreamingWebDataset, StreamingDataLoader

# Create the dataset
dataset = StreamingWebDataset(
    remote="s3://your-bucket/your-dataset",
    split="train",
    profile="your_aws_profile",
    shuffle=True,
    max_workers=4,
    schema={".jpg": "PIL", ".json": "json"}
)

# or use a custom processing function
import torchvision.transforms.v2 as T

class ImageNetWebDataset(StreamingWebDataset):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.transforms = T.Compose([
            T.ToImage(),
            T.Resize((64,)),
            T.ToDtype(torch.float32),
            T.Normalize(mean=(128,), std=(128,)),
        ])

    def process_sample(self, sample):
        sample[".jpg"] = self.transforms(sample[".jpg"])
        return sample

# Create a StreamingDataLoader for mid-epoch resumption
dataloader = StreamingDataLoader(dataset, batch_size=32, num_workers=4)

# Iterate through the data
for batch in dataloader:
    # Your training loop here
    pass

# You can save the state for resumption
state_dict = dataloader.state_dict()

# Later, you can resume from this state
dataloader.load_state_dict(state_dict)

Configuration

  • remote (str): The S3 URI of the dataset.
  • split (Optional[str]): The dataset split (e.g., "train", "val", "test"). Defaults to None.
  • profile (str): The AWS profile to use for authentication. Defaults to "default".
  • shuffle (bool): Whether to shuffle the data. Defaults to False.
  • max_workers (int): Maximum number of worker threads for download and extraction. Defaults to 2.
  • schema (Dict[str, str]): A dictionary defining the decoding method for each data field. Defaults to {}.
  • memory_buffer_limit_bytes (Union[Bytes, int, str]): The maximum size of the memory buffer in bytes per worker. Defaults to "2GB".
  • file_cache_limit_bytes (Union[Bytes, int, str]): The maximum size of the file cache in bytes per worker. Defaults to "2GB".

Contributing

Contributions to streaming-wds are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

streaming_wds-0.1.28.tar.gz (38.6 kB view details)

Uploaded Source

Built Distribution

streaming_wds-0.1.28-py3-none-any.whl (29.4 kB view details)

Uploaded Python 3

File details

Details for the file streaming_wds-0.1.28.tar.gz.

File metadata

  • Download URL: streaming_wds-0.1.28.tar.gz
  • Upload date:
  • Size: 38.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for streaming_wds-0.1.28.tar.gz
Algorithm Hash digest
SHA256 a121690c9ae51644d612f7db9befcd828b4850aa0f24c712194433b548dc9482
MD5 9ec3e3a43669d0e529dbb9c34d28f81b
BLAKE2b-256 9b2c546f88b5df74c38841989cafc12e688bf8b2738e78a18d2e67993825798f

See more details on using hashes here.

File details

Details for the file streaming_wds-0.1.28-py3-none-any.whl.

File metadata

File hashes

Hashes for streaming_wds-0.1.28-py3-none-any.whl
Algorithm Hash digest
SHA256 ee25729e154109be38700294e682312153b9faef60c7652802c5ee0ea4c3cb18
MD5 9d49134713058431b5569fa2f9510475
BLAKE2b-256 415069c3f1ba0ed0ce2480e1f0928e343787ecbba5c9c9ec79d78fa2e4f34742

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page