PyTorch based library focused on data processing and input pipelines in general.
Project description
Package renamed to torchdatasets!
- Use
map,apply,reduceorfilterdirectly onDatasetobjects cachedata in RAM/disk or via your own method (partial caching supported)- Full PyTorch's
DatasetandIterableDatasetsupport - General
torchdatasets.mapslikeFlattenorSelect - Extensible interface (your own cache methods, cache modifiers, maps etc.)
- Useful
torchdatasets.datasetsclasses designed for general tasks (e.g. file reading) - Support for
torchvisiondatasets (e.g.ImageFolder,MNIST,CIFAR10) viatd.datasets.WrapDataset - Minimal overhead (single call to
super().__init__())
| Version | Docs | Tests | Coverage | Style | PyPI | Python | PyTorch | Docker | Roadmap |
|---|---|---|---|---|---|---|---|---|---|
:bulb: Examples
Check documentation here: https://szymonmaszke.github.io/torchdatasets
General example
- Create image dataset, convert it to Tensors, cache and concatenate with smoothed labels:
import torchdatasets as td
import torchvision
class Images(td.Dataset): # Different inheritance
def __init__(self, path: str):
super().__init__() # This is the only change
self.files = [file for file in pathlib.Path(path).glob("*")]
def __getitem__(self, index):
return Image.open(self.files[index])
def __len__(self):
return len(self.files)
images = Images("./data").map(torchvision.transforms.ToTensor()).cache()
You can concatenate above dataset with another (say labels) and iterate over them as per usual:
for data, label in images | labels:
# Do whatever you want with your data
- Cache first
1000samples in memory, save the rest on disk in folder./cache:
images = (
ImageDataset.from_folder("./data").map(torchvision.transforms.ToTensor())
# First 1000 samples in memory
.cache(td.modifiers.UpToIndex(1000, td.cachers.Memory()))
# Sample from 1000 to the end saved with Pickle on disk
.cache(td.modifiers.FromIndex(1000, td.cachers.Pickle("./cache")))
# You can define your own cachers, modifiers, see docs
)
To see what else you can do please check torchdatasets documentation
Integration with torchvision
Using torchdatasets you can easily split torchvision datasets and apply augmentation
only to the training part of data without any troubles:
import torchvision
import torchdatasets as td
# Wrap torchvision dataset with WrapDataset
dataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder("./images"))
# Split dataset
train_dataset, validation_dataset, test_dataset = torch.utils.data.random_split(
model_dataset,
(int(0.6 * len(dataset)), int(0.2 * len(dataset)), int(0.2 * len(dataset))),
)
# Apply torchvision mappings ONLY to train dataset
train_dataset.map(
td.maps.To(
torchvision.transforms.Compose(
[
torchvision.transforms.RandomResizedCrop(224),
torchvision.transforms.RandomHorizontalFlip(),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
),
]
)
),
# Apply this transformation to zeroth sample
# First sample is the label
0,
)
Please notice you can use td.datasets.WrapDataset with any existing torch.utils.data.Dataset
instance to give it additional caching and mapping powers!
:wrench: Installation
:snake: pip
Latest release:
pip install --user torchdatasets
Nightly:
pip install --user torchdatasets-nightly
:whale2: Docker
CPU standalone and various versions of GPU enabled images are available at dockerhub.
For CPU quickstart, issue:
docker pull szymonmaszke/torchdatasets:18.04
Nightly builds are also available, just prefix tag with nightly_. If you are going for GPU image make sure you have
nvidia/docker installed and it's runtime set.
:question: Contributing
If you find any issue or you think some functionality may be useful to others and fits this library, please open new Issue or create Pull Request.
To get an overview of thins one can do to help this project, see Roadmap
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file torchdatasets-nightly-1711929801.tar.gz.
File metadata
- Download URL: torchdatasets-nightly-1711929801.tar.gz
- Upload date:
- Size: 25.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f97d1931c2b9e96d324d75f85688d745fb8ba5c6d535cb243ac18124d96a86d6
|
|
| MD5 |
7b0ec21e8773a98c0fdb37f903a0aace
|
|
| BLAKE2b-256 |
efc53ee6e924ca65bdc9fba87c8e0eb28ca5f1befa72d19a589cf6091f86b476
|
File details
Details for the file torchdatasets_nightly-1711929801-py3-none-any.whl.
File metadata
- Download URL: torchdatasets_nightly-1711929801-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd0f3b5a7a3790f671dc7add38d8ba9b4396eda68cbd9a1ef0a39ba48f248028
|
|
| MD5 |
fabeed6d5eba9e63b4283175a57b3aa4
|
|
| BLAKE2b-256 |
d1cc2c2794ff82b7a4a407f4d49dd07546827873cc3bea70c8ec4e337e9cca9e
|