Skip to main content

Record sequential storage for deep learning.

Project description

Test DeepSource

WebDataset

WebDataset is a PyTorch Dataset (IterableDataset) implementation providing efficient access to datasets stored in POSIX tar archives.

Storing data in POSIX tar archives greatly speeds up I/O operations on rotational storage and on networked file systems because it permits all I/O operations to operate as large sequential reads and writes.

WebDataset fulfills a similar function to Tensorflow's TFRecord/tf.Example classes, but it is much easier to adopt because it does not actually require any kind of data conversion: data is stored in exactly the same format inside tar files as it is on disk, and all preprocessing and data augmentation code remains unchanged.

Installation and Documentation

    $ pip install webdataset

For the Github version:

    $ pip install git+https://github.com/tmbdev/webdataset.git

Documentation: ReadTheDocs

Using WebDataset

WebDataset reads dataset that are stored as tar files, with the simple convention that files that belong together and make up a training sample share the same basename. WebDataset can read files from local disk or from any pipe, which allows it to access files using common cloud object stores.

%%bash
curl -s http://storage.googleapis.com/nvdata-openimages/openimages-train-000000.tar | tar tf - | sed 10q
e39871fd9fd74f55.jpg
e39871fd9fd74f55.json
f18b91585c4d3f3e.jpg
f18b91585c4d3f3e.json
ede6e66b2fb59aab.jpg
ede6e66b2fb59aab.json
ed600d57fcee4f94.jpg
ed600d57fcee4f94.json
ff47e649b23f446d.jpg
ff47e649b23f446d.json
%pylab inline

import torch
from torchvision import transforms
import webdataset as wds
from itertools import islice

url = "http://storage.googleapis.com/nvdata-openimages/openimages-train-000000.tar"
url = f"pipe:curl -L -s {url} || true"
Populating the interactive namespace from numpy and matplotlib

WebDatasets are an implementation of PyTorch IterableDataset and fully compatible with PyTorch input pipelines. By default, WebDataset just iterates through the files in a tar file without decoding anything, returning related files in each sample.

dataset = wds.Dataset(url)

for sample in islice(dataset, 0, 3):
    for key, value in sample.items():
        print(key, repr(value)[:50])
    print()
__key__ 'e39871fd9fd74f55'
jpg b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x01
json b'[{"ImageID": "e39871fd9fd74f55", "Source": "xcli

__key__ 'f18b91585c4d3f3e'
jpg b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00
json b'[{"ImageID": "f18b91585c4d3f3e", "Source": "acti

__key__ 'ede6e66b2fb59aab'
jpg b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00
json b'[{"ImageID": "ede6e66b2fb59aab", "Source": "acti
dataset = (
    wds.Dataset(url)
    .shuffle(100)
    .decode()
    .rename(image="jpg;png", data="json")
    .to_tuple("image", "data")
)

for image, data in islice(dataset, 0, 3):
    print(image.shape, image.dtype, type(data))
(768, 1024, 3) float32 <class 'list'>
(768, 1024, 3) float32 <class 'list'>
(683, 1024, 3) float32 <class 'list'>
dataset = (
    wds.Dataset(url)
    .shuffle(100)
    .decode()
    .rename(image="jpg;png", data="json")
    .to_tuple("image", "data")
)

for image, data in islice(dataset, 0, 3):
    print(image.shape, image.dtype, type(data))
(1024, 771, 3) float32 <class 'list'>
(575, 1024, 3) float32 <class 'list'>
(683, 1024, 3) float32 <class 'list'>

There are common processing stages you can add to a dataset to make it a drop-in replacement for any existing dataset. For convenience, common operations are available through a "fluent" interface (as chained method calls).

dataset = (
    wds.Dataset(url)
    .shuffle(100)
    .decode()
    .rename(image="jpg;png", data="json")
    .to_tuple("image", "data")
)

for image, data in islice(dataset, 0, 3):
    print(image.shape, image.dtype, type(data))
(699, 1024, 3) float32 <class 'list'>
(683, 1024, 3) float32 <class 'list'>
(768, 1024, 3) float32 <class 'list'>

Common operations:

  • shuffle(n): shuffle the dataset with a buffer of size n; also shuffles shards (see below)
  • decode([type]): automatically decode files; the type determines desired outputs for images, video, and audio: pil, rgb, rgb8, rgbtorch, etc.
  • rename(new="old1;old2", ...): rename fields
  • map(f): apply f to each sample
  • map_dict(key=f, ...): apply f to its corresponding key
  • map_tuple(f, g, ...): apply f, g, etc. to their corresponding values in the tuple
  • pipe(f): f should be a function that takes an iterator and returns a new iterator

Stages commonly take a handler= argument, which is a function that gets called when there is an exception; you can write whatever function you want, but common functions are:

  • webdataset.ignore_and_stop
  • webdataset.ignore_and_continue
  • webdataset.warn_and_stop
  • webdataset.warn_and_continue
  • webdataset.reraise_exception

Here is an example that uses torchvision data augmentation the same way you might use it with a FileDataset.

normalize = transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225])

preproc = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    normalize,
])

dataset = (
    wds.Dataset(url)
    .shuffle(100)
    .decode("pil")
    .rename(image="jpg;png", data="json")
    .map_dict(image=preproc)
    .to_tuple("image", "data")
)

for image, data in islice(dataset, 0, 3):
    print(image.shape, image.dtype, type(data))
torch.Size([3, 224, 224]) torch.float32 <class 'list'>
torch.Size([3, 224, 224]) torch.float32 <class 'list'>
torch.Size([3, 224, 224]) torch.float32 <class 'list'>

Sharding and Parallel I/O

In order to be able to shuffle data better and to process and load data in parallel, it is a good idea to shard it; that is, to split up the dataset into several .tar files.

WebDataset uses standard UNIX brace notation for sharded dataset. For example, the OpenImages dataset consists of 554 shards, each containing about 1 Gbyte of images. You can open the entire dataset as follows.

url = "http://storage.googleapis.com/nvdata-openimages/openimages-train-{000000..000554}.tar"
url = f"pipe:curl -L -s {url} || true"
dataset = (
    wds.Dataset(url)
    .shuffle(100)
    .decode("pil")
    .rename(image="jpg;png", data="json")
    .map_dict(image=preproc)
    .to_tuple("image", "data")
)

When used with a standard Torch DataLoader, this will now perform parallel I/O and preprocessing.

dataloader = torch.utils.data.DataLoader(dataset, num_workers=4, batch_size=16)
images, targets = next(iter(dataloader))
images.shape
torch.Size([16, 3, 224, 224])

Data Sources

When creating a dataset with webdataset.Dataset(url), the URL can be:

  • the string "-", referring to stdin
  • a UNIX path, opened as a regular file
  • a URL-like string with the schema "pipe:"; such URLs are opened with subprocess.Popen. For example:
    • pipe:curl -s -L http://server/file accesses a file via HTTP
    • pipe:gsutil cat gs://bucket/file accesses a file on GCS
    • pipe:az cp --container bucket --name file --file /dev/stdout accesses a file on Azure
    • pipe:ssh host cat file accesses a file via ssh
  • any other URL-like string with another schema; such URLs are passed to the objectio libraries if it is installed

It might seem at first glance to be "more efficient" to use built-in Python libraries for accessing object stores rather than subprocesses, but efficient object store access from Python really requires spawning a separate process anyway, so this approach to accessing object stores is not only convenient, it also is as efficient as we can make it in Python.

Creating a WebDataset

Since WebDatasets are just regular tar files, you can usually create them by just using the tar command. All you have to do is to arrange for any files that should be in the same sample to share the same basename. Many datasets already come that way. For those, you can simply create a WebDataset with

$ tar --sort=name -cf dataset.tar dataset/

If your dataset has some other directory layout, you can either rearrange the files on disk, or you can use tar --transform to get the right kinds of names in your tar file.

You can also create a WebDataset with library functions in this library:

  • webdataset.TarWriter takes dictionaries containing key value pairs and writes them to disk
  • webdataset.ShardWriter takes dictionaries containing key value pairs and writes them to disk as a series of shards

Here is how you can use TarWriter for writing a dataset:

sink = wds.TarWriter("dest.tar", encoder=False)
for basename in basenames:
    with open(f"{basename}.png", "rb") as stream):
        image = stream.read()
    cls = lookup_cls(basename)
    sample = {
        "__key__": basename,
        "png": image,
        "cls": cls
    }
    sink.write(sample)
sink.close()

Writing Filters and Offline Augmentation

Webdataset can be used for filters and offline augmentation of datasets. Here is a complete example that pre-augments a shard and extracts class labels.

def extract_class(data):
    # mock implementation
    return 0

def augment_wds(input, output, maxcount=999999999):
    src = (
        wds.Dataset(input)
        .decode("pil")
        .rename(key="__key__", image="jpg;png", data="json")
        .map_dict(image=preproc)
        .to_tuple("key", "image", "data")
    )
    with wds.TarWriter(output) as dst:
        for key, image, data in islice(src, 0, maxcount):
            print(key)
            image = image.numpy().transpose(1, 2, 0)
            image -= amin(image)
            image /= amax(image)
            sample = {
                "__key__": key,
                "png": image,
                "cls": extract_class(data)
            }
            dst.write(sample)
url = "http://storage.googleapis.com/nvdata-openimages/openimages-train-000000.tar"
url = f"pipe:curl -L -s {url} || true"
augment_wds(url, "_temp.tar", maxcount=5)
e39871fd9fd74f55
f18b91585c4d3f3e
ede6e66b2fb59aab
ed600d57fcee4f94
ff47e649b23f446d
%%bash
tar tf _temp.tar
e39871fd9fd74f55.cls
e39871fd9fd74f55.png
f18b91585c4d3f3e.cls
f18b91585c4d3f3e.png
ede6e66b2fb59aab.cls
ede6e66b2fb59aab.png
ed600d57fcee4f94.cls
ed600d57fcee4f94.png
ff47e649b23f446d.cls
ff47e649b23f446d.png

If you want to preprocess the entire OpenImages dataset with a process like this, you can use your favorite job queueing or worflow system.

For example, using Dask, you could process all 554 shards in parallel using code like this:

shards = braceexpand.braceexpand("{000000..000554}")
inputs = [f"gs://bucket/openimages-{shard}.tar" for shard in shards]
outputs = [f"gs://bucket2/openimages-augmented-{shard}.tar" for shard in shards]
results = [dask.delayed(augment_wds)(args) for args in zip(inputs, outputs)]
dask.compute(*results)

Note that the data is streaming from and to Google Cloud Storage buckets, so very little local storage is required on each worker.

For very large scale processing, it's easiest to submit separate jobs to a Kubernetes cluster using the Kubernetes Job template, or using a workflow engine like Argo.

Related Libraries and Software

The AIStore server provides an efficient backend for WebDataset; it functions like a combination of web server, content distribution network, P2P network, and distributed file system. Together, AIStore and WebDataset can serve input data from rotational drives distributed across many servers at the speed of local SSDs to many GPUs, at a fraction of the cost. We can easily achieve hundreds of MBytes/s of I/O per GPU even in large, distributed training jobs.

The tarp utilities provide command line manipulation and processing of webdatasets and other tar files, including splitting, concatenation, and xargs-like functionality. (There is a Python prototype called tarproc available as well.)

For large scale ETL and preprocessing jobs, you can simply schedule Kubernetes Pods or Jobs to process shards in parallel. Workflow systems like Argo may be useful for that, or you may find qupods useful, since it will pace job submission and keep track of logs for you.

The tensorcom library provides fast three-tiered I/O; it can be inserted between AIStore and WebDataset to permit distributed data augmentation and I/O. It is particularly useful when data augmentation requires more CPU than the GPU server has available.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webdataset-0.1.19.tar.gz (28.4 kB view hashes)

Uploaded Source

Built Distribution

webdataset-0.1.19-py3-none-any.whl (28.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page