Skip to main content

Repackaged with modifications.

Project description

Test DeepSource

WebDataset

WebDataset is a PyTorch Dataset (IterableDataset) implementation providing efficient access to datasets stored in POSIX tar archives and uses only sequential/streaming data access. This brings substantial performance advantage in many compute environments, and it is essential for very large scale training.

While WebDataset scales to very large problems, it also works well with smaller datasets and simplifies creation, management, and distribution of training data for deep learning.

WebDataset implements standard PyTorch IterableDataset interface and works with the PyTorch DataLoader. Access to datasets is as simple as:

import webdataset as wds

dataset = wds.WebDataset(url).shuffle(1000).decode("torchrgb").to_tuple("jpg;png", "json")
dataloader = torch.utils.data.DataLoader(dataset, num_workers=4, batch_size=16)

for inputs, outputs in dataloader:
    ...

In that code snippet, url can refer to a local file, a local HTTP server, a cloud storage object, an object on an object store, or even the output of arbitrary command pipelines.

WebDataset fulfills a similar function to Tensorflow's TFRecord/tf.Example classes, but it is much easier to adopt because it does not actually require any kind of data conversion: data is stored in exactly the same format inside tar files as it is on disk, and all preprocessing and data augmentation code remains unchanged.

Documentation

Installation

$ pip install webdataset

For the Github version:

$ pip install git+https://github.com/tmbdev/webdataset.git

Documentation: ReadTheDocs

Introductory Videos

Here are some videos talking about WebDataset and large scale deep learning:

More Examples

Related Libraries and Software

The AIStore server provides an efficient backend for WebDataset; it functions like a combination of web server, content distribution network, P2P network, and distributed file system. Together, AIStore and WebDataset can serve input data from rotational drives distributed across many servers at the speed of local SSDs to many GPUs, at a fraction of the cost. We can easily achieve hundreds of MBytes/s of I/O per GPU even in large, distributed training jobs.

The tarproc utilities provide command line manipulation and processing of webdatasets and other tar files, including splitting, concatenation, and xargs-like functionality.

The tensorcom library provides fast three-tiered I/O; it can be inserted between AIStore and WebDataset to permit distributed data augmentation and I/O. It is particularly useful when data augmentation requires more CPU than the GPU server has available.

You can find the full PyTorch ImageNet sample code converted to WebDataset at tmbdev/pytorch-imagenet-wds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webdataset_latch-0.6.0.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

webdataset_latch-0.6.0-py3-none-any.whl (83.0 kB view details)

Uploaded Python 3

File details

Details for the file webdataset_latch-0.6.0.tar.gz.

File metadata

  • Download URL: webdataset_latch-0.6.0.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.6

File hashes

Hashes for webdataset_latch-0.6.0.tar.gz
Algorithm Hash digest
SHA256 3d6d5cb6ab41ef40fe2a4617049d2ec6d025cee9cc5b19cc3e3a79652dd5b5fb
MD5 de8e28aafe12d74ec291344ced29ebac
BLAKE2b-256 76490e48d1348968709b19a2ebb05c2d5f354a32e4c98bb0382d8ea867316c4f

See more details on using hashes here.

File details

Details for the file webdataset_latch-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: webdataset_latch-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 83.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.6

File hashes

Hashes for webdataset_latch-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9082f01d43af4140f3eee2cbaf5237709812ac9f391d94aff8ae766efa5babf6
MD5 09c1ba080bce48b55cc115cea66a84b1
BLAKE2b-256 60e0f0ffdc5b926adfd3dff490bd249c9dcae6c538fc29beeca17b0c4019d967

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page