Record sequential storage for deep learning.
Project description
WebDataset
WebDataset is a PyTorch Dataset (IterableDataset) implementation providing efficient access to datasets stored in POSIX tar archives.
Storing data in POSIX tar archives greatly speeds up I/O operations on rotational storage and on networked file systems because it permits all I/O operations to operate as large sequential reads and writes.
WebDataset fulfills a similar function to Tensorflow's TFRecord/tf.Example classes, but it is much easier to adopt because it does not actually require any kind of data conversion: data is stored in exactly the same format inside tar files as it is on disk, and all preprocessing and data augmentation code remains unchanged.
Installation
$ pip install webdataset
For the Github version:
$ pip install git+https://github.com/tmbdev/webdataset.git
Documentation
Using WebDataset
Here is an example of an Imagenet input pipeline used for training common visual object recognition models. Note that this code is identical to the standard FileDataset I/O except for the single call that constructs the WebDataset.
import torch
from torchvision import transforms
import webdataset as wds
normalize = transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
preproc = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
])
path = "http://server/imagenet_train-{0000..0147}.tgz"
dataset = wds.WebDataset(path,
decoder="pil",
extensions="jpg;png cls",
transforms=[preproc, lambda x: x-1])
loader = torch.utils.data.DataLoader(dataset, batch_size=16, num_workers=4)
for xs, ys in loader:
train_batch(xs, ys)
Creating WebDataset
In order to permit record sequential access to data, WebDataset only requires that the files comprising a single training samples are stored adjacent to each other inside the tar archive. Such archives can be easily created using GNU tar:
tar --sorted -cf dataset.tar dir
On BSD and OSX, you can use:
find dir -type f -print | sort | tar -T - -cf dataset.tar
Very large datasets are best stored as shards, each comprising a number of samples. Shards can be shuffled, read, and processed in parallel. The companion tarproc library permits easy sharding, as well as parallel processing of web datsets and shards. The tarproc programs simply operate as filters on tar streams, so for sharding, you can use a command like this:
tar --sorted -cf - dir | tarsplit -s 1e9 -o out
TODO
- support
image.*andimage=jpg,png,jpegsyntax for extensions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webdataset-0.1.2.tar.gz.
File metadata
- Download URL: webdataset-0.1.2.tar.gz
- Upload date:
- Size: 18.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cb669794559089fda6bbde701473a60de3180c0623bf56c5559d9702a330fe6
|
|
| MD5 |
097542c7ffb39a9fb3436c864f4ea477
|
|
| BLAKE2b-256 |
e76e89a09620082ec90789ba4a9a45a4fd25ec7ad7a56ef633e47a52032df28b
|
File details
Details for the file webdataset-0.1.2-py3-none-any.whl.
File metadata
- Download URL: webdataset-0.1.2-py3-none-any.whl
- Upload date:
- Size: 18.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aaa4a8567b2905faabe1f0b5331c8da954239c682dac2728b71639cdaa00bf3d
|
|
| MD5 |
87463feccfc4dc98bf76d92655970046
|
|
| BLAKE2b-256 |
3dd4a603fe8ecea32265a19a926fccd26b3ddccebabb52a9657c2555fdb4eb14
|