Skip to main content

Record sequential storage for deep learning.

Project description

Test DeepSource

WebDataset

WebDataset is a PyTorch Dataset (IterableDataset) implementation providing efficient access to datasets stored in POSIX tar archives.

Storing data in POSIX tar archives greatly speeds up I/O operations on rotational storage and on networked file systems because it permits all I/O operations to operate as large sequential reads and writes.

WebDataset fulfills a similar function to Tensorflow's TFRecord/tf.Example classes, but it is much easier to adopt because it does not actually require any kind of data conversion: data is stored in exactly the same format inside tar files as it is on disk, and all preprocessing and data augmentation code remains unchanged.

Installation

    $ pip install webdataset

For the Github version:

    $ pip install git+https://github.com/tmbdev/webdataset.git

Documentation

ReadTheDocs

Using WebDataset

Here is an example of an Imagenet input pipeline used for training common visual object recognition models. Note that this code is identical to the standard FileDataset I/O except for the single call that constructs the WebDataset.

    import torch
    from torchvision import transforms
    import webdataset as wds

    normalize = transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225])

    preproc = transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        normalize,
    ]) 

    path = "http://server/imagenet_train-{0000..0147}.tgz"

    dataset = wds.WebDataset(path,
                             decoder="pil",
                             extensions="jpg;png cls",
                             transforms=[preproc, lambda x: x-1])

    loader = torch.utils.data.DataLoader(dataset, batch_size=16, num_workers=4)
    for xs, ys in loader:
        train_batch(xs, ys)

Creating WebDataset

In order to permit record sequential access to data, WebDataset only requires that the files comprising a single training samples are stored adjacent to each other inside the tar archive. Such archives can be easily created using GNU tar:

    tar --sorted -cf dataset.tar dir

On BSD and OSX, you can use:

    find dir -type f -print | sort | tar -T - -cf dataset.tar

Very large datasets are best stored as shards, each comprising a number of samples. Shards can be shuffled, read, and processed in parallel. The companion tarproc library permits easy sharding, as well as parallel processing of web datsets and shards. The tarproc programs simply operate as filters on tar streams, so for sharding, you can use a command like this:

    tar --sorted -cf - dir | tarsplit -s 1e9 -o out

TODO

  • support image.* and image=jpg,png,jpeg syntax for extensions

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webdataset-0.1.14.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

webdataset-0.1.14-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file webdataset-0.1.14.tar.gz.

File metadata

  • Download URL: webdataset-0.1.14.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7

File hashes

Hashes for webdataset-0.1.14.tar.gz
Algorithm Hash digest
SHA256 c4a29789e6999b466d4f1c1b89f794ab03277b108290833aa2ef33db663999c0
MD5 ee883e6416d76096523d8dfa60cf0b9e
BLAKE2b-256 2c0b8de51ce24062f324011ddf194424b7410081a03fa1ca31461a8a5ef995a6

See more details on using hashes here.

File details

Details for the file webdataset-0.1.14-py3-none-any.whl.

File metadata

  • Download URL: webdataset-0.1.14-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7

File hashes

Hashes for webdataset-0.1.14-py3-none-any.whl
Algorithm Hash digest
SHA256 f1d1c481cac5fc9becc1580978a2189454b06a96320311042b223a3846692b57
MD5 ed747c4593c727128f9f8e8f6d68c8f8
BLAKE2b-256 7aef845582ae7ff95675fe28277870cd5b22438fd76d5551957cf3e5f2d1985d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page