Skip to main content

PyTorch based library focused on data processing and input pipelines in general.

Project description

Package renamed to torchdatasets!

  • Use map, apply, reduce or filter directly on Dataset objects
  • cache data in RAM/disk or via your own method (partial caching supported)
  • Full PyTorch's Dataset and IterableDataset support
  • General torchdatasets.maps like Flatten or Select
  • Extensible interface (your own cache methods, cache modifiers, maps etc.)
  • Useful torchdatasets.datasets classes designed for general tasks (e.g. file reading)
  • Support for torchvision datasets (e.g. ImageFolder, MNIST, CIFAR10) via td.datasets.WrapDataset
  • Minimal overhead (single call to super().__init__())
Version Docs Tests Coverage Style PyPI Python PyTorch Docker Roadmap
Version Documentation Tests Coverage codebeat PyPI Python PyTorch Docker Roadmap

:bulb: Examples

Check documentation here: https://szymonmaszke.github.io/torchdatasets

General example

  • Create image dataset, convert it to Tensors, cache and concatenate with smoothed labels:
import torchdatasets as td
import torchvision

class Images(td.Dataset): # Different inheritance
    def __init__(self, path: str):
        super().__init__() # This is the only change
        self.files = [file for file in pathlib.Path(path).glob("*")]

    def __getitem__(self, index):
        return Image.open(self.files[index])

    def __len__(self):
        return len(self.files)


images = Images("./data").map(torchvision.transforms.ToTensor()).cache()

You can concatenate above dataset with another (say labels) and iterate over them as per usual:

for data, label in images | labels:
    # Do whatever you want with your data
  • Cache first 1000 samples in memory, save the rest on disk in folder ./cache:
images = (
    ImageDataset.from_folder("./data").map(torchvision.transforms.ToTensor())
    # First 1000 samples in memory
    .cache(td.modifiers.UpToIndex(1000, td.cachers.Memory()))
    # Sample from 1000 to the end saved with Pickle on disk
    .cache(td.modifiers.FromIndex(1000, td.cachers.Pickle("./cache")))
    # You can define your own cachers, modifiers, see docs
)

To see what else you can do please check torchdatasets documentation

Integration with torchvision

Using torchdatasets you can easily split torchvision datasets and apply augmentation only to the training part of data without any troubles:

import torchvision

import torchdatasets as td

# Wrap torchvision dataset with WrapDataset
dataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder("./images"))

# Split dataset
train_dataset, validation_dataset, test_dataset = torch.utils.data.random_split(
    model_dataset,
    (int(0.6 * len(dataset)), int(0.2 * len(dataset)), int(0.2 * len(dataset))),
)

# Apply torchvision mappings ONLY to train dataset
train_dataset.map(
    td.maps.To(
        torchvision.transforms.Compose(
            [
                torchvision.transforms.RandomResizedCrop(224),
                torchvision.transforms.RandomHorizontalFlip(),
                torchvision.transforms.ToTensor(),
                torchvision.transforms.Normalize(
                    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
                ),
            ]
        )
    ),
    # Apply this transformation to zeroth sample
    # First sample is the label
    0,
)

Please notice you can use td.datasets.WrapDataset with any existing torch.utils.data.Dataset instance to give it additional caching and mapping powers!

:wrench: Installation

:snake: pip

Latest release:

pip install --user torchdatasets

Nightly:

pip install --user torchdatasets-nightly

:whale2: Docker

CPU standalone and various versions of GPU enabled images are available at dockerhub.

For CPU quickstart, issue:

docker pull szymonmaszke/torchdatasets:18.04

Nightly builds are also available, just prefix tag with nightly_. If you are going for GPU image make sure you have nvidia/docker installed and it's runtime set.

:question: Contributing

If you find any issue or you think some functionality may be useful to others and fits this library, please open new Issue or create Pull Request.

To get an overview of thins one can do to help this project, see Roadmap

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchdatasets-nightly-1648686638.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

torchdatasets_nightly-1648686638-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file torchdatasets-nightly-1648686638.tar.gz.

File metadata

  • Download URL: torchdatasets-nightly-1648686638.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.12

File hashes

Hashes for torchdatasets-nightly-1648686638.tar.gz
Algorithm Hash digest
SHA256 1d9e36888c02eab7b74f7063c72e48cae9f9bbc8157ba7e822c4b450751ee690
MD5 f2fae412f363f6ee24b9db4a930df275
BLAKE2b-256 3576d70b9af1b34c98711e12e27a4df6078ebdb77b7934d0d0dbb6b63e9c6f60

See more details on using hashes here.

File details

Details for the file torchdatasets_nightly-1648686638-py3-none-any.whl.

File metadata

  • Download URL: torchdatasets_nightly-1648686638-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.12

File hashes

Hashes for torchdatasets_nightly-1648686638-py3-none-any.whl
Algorithm Hash digest
SHA256 4c1956cce7e4852b6efc3ddd015a8cde63cff1927f9cc62e93dcd8a5121426ff
MD5 e51a62ed6e246b6215501ccd8539e1f7
BLAKE2b-256 ce290b12b99d2165391c20876cce7be687281b74312eff5edd868293d9d12d7d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page