pytorch dataset wrappers for in-memory caching

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

KappaData

Utilities for datasets and dataloading with pytorch

modular datasets
caching datasets in-memory
various dataset filters and other manipulation (filter by class, limit size to a %, ...)

Modular datasets

pytorch datasets load all data in the __getitem__. KappaData decouples the __getitem__ such that single properties of the dataset can be loaded independently.

Image classification dataset example

Let's take an image classification dataset as an example. A sample consists of an image with an associated class label.

class ImageClassificationDataset(torch.utils.data.Dataset):
    def __init__(self, image_paths):
        super().__init__()
        self.image_paths = image_paths
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        img = load_image(self.image_paths[idx])
        class_label = image_path_to_class_label(self.image_paths[idx])
        return img, class_label

If you training process contains something that only requires the class labels, the dataset has to additionally load all the images which can take a long time (whereas loading only labels is very fast). With KappaData the __getitem__ method is split into subparts:

# inherit from kappadata.KDDataset
class ImageClassificationDataset(kappadata.KDDataset):
    def __init__(self, image_paths):
        super().__init__()
        self.image_paths = image_paths
    def __len__(self):
        return len(self.image_paths)
    
    def getitem_x(self, idx, ctx=None):
        return load_image(self.image_paths[idx])
    def getitem_y(self, idx, ctx=None):
        return image_path_to_class_label(self.image_paths[idx])

Now each subpart of the dataset can be retrieved by wrapping the dataset into a ModeWrapper:

ds = ImageClassificationDataset(image_paths=...)
for y in kappadata.ModeWrapper(ds, mode="y"):
    ...
for x, y in kappadata.ModeWrapper(ds, mode="x y"):
    ...

torch.utils.data.Subset / torch.utils.data.ConcatDataset can be used by simply replacing them with kappadata.KDSubset/kappadata.KDConcatDataset.

Augmentation parameters

With KappaData you can also retrieve various properties of your data prepocessing (e.g. augmentation parameters). The following example shows how you can retrieve the parameters of torchvision.transforms.RandomResizedCrop .

import torchvision.transforms.functional as F
class MyRandomResizedCrop(torchvision.transforms.RandomResizedCrop):
    def forward(self, img, ctx=None):
        # make random resized crop
        i, j, h, w = self.get_params(img, self.scale, self.ratio)
        cropped = F.resized_crop(img, i, j, h, w, self.size, self.interpolation)
        # store parameters
        if ctx is not None:
          ctx["crop_parameters"] = (i, j, h, w)
        return cropped
  
class ImageClassificationDataset(kappadata.KDDataset):
    def __init__(self, ...):
      ...
      self.random_resized_crop = MyRandomResizedCrop()
    ...
    def getitem_x(self, idx, ctx=None):
        img = load_image(self.image_paths[idx])
        return self.random_resized_crop(img, ctx=ctx)

When you want to access the parameters simply pass return_ctx=True to the ModeWrapper:

ds = ImageClassificationDataset(image_paths=...)
for x, ctx in kappadata.ModeWrapper(ds, mode="x", return_ctx=True):
    print(ctx["crop_parameters"])

Caching datasets in-memory

SharedDictDataset

kappadata.SharedDictDataset provides a wrapper to store arbitrary datasets in-memory via a dictionary shared between all worker processes (using python multiprocessing data structures). The shared memory part is important for dataloading with num_workers > 0. Small and medium sized datasets can be cached in-memory to avoid bottlenecks when loading data from a disk. For example even the full ImageNet can be cached on many servers as it has ~130GB and its not too uncommon for GPU servers to have more RAM than that.

RedisDataset [EXPERIMENTAL]

kappadata.RedisDataset provides an in-memory cache via the redis in-memory database. This enables sharing data between multiple GPU-proceses (not only worker processes) for multi-GPU training.

Caching image datasets

Naively caching image datasets can lead to high memory consumption because image data is usually stored in a compressed format and decompressed during loading. To reduce memory, the raw uncompressed data needs to be cached.

Example caching a torchvision.datasets.ImageFolder:

from kappadata.loading.image_folder import raw_image_loader, raw_image_folder_sample_to_pil_sample 
class CachedImageFolder(kappadata.KDDataset):
    def __init__(self, ...):
        # modify ImageFolder to load raw samples (NOTE: can't apply transforms onto raw data)
        self.ds = torchvision.datasets.ImageFolder(..., transform=None, loader=raw_image_loader)
        # initialize cached dataset that decompresses the raw data into a PIL image
        self.cached_ds = kappadata.SharedDictDataset(self.ds, transform=raw_image_folder_sample_to_pil_sample)
        # store transforms to apply after decompression
        self.transform = ...
    def getitem_x(self, idx, ctx=None):
        x, _ = self.cached_ds[idx]
        if self.transform is not None:
            x = self.transform(x)
        return x

Automatically copy datasets to a local (fast) disk

Datasets are often stored on a global (slow) storage and before training moved to a local (fast) disk. kappadata.copy_folder_from_global_to_local provides an utility function to do this automatically:

local path doesn't exist -> automatically copy from global to local
local path exists -> do nothing
local path exists but is incomplete -> clear directory and copy again

from pathlib import Path
from kappadata import copy_folder_from_global_to_local
global_path = Path("/system/data/ImageNet")
local_path = Path("/local/data")
# /system/data/ImageNet contains a 'train' and a 'val' folder -> copy whole dataset
copy_folder_from_global_to_local(global_path, local_path)
# copy only "train"
copy_folder_from_global_to_local(global_path, local_path, relative_path="train")

The above code will also work (without modification) if /system/data/ImageNet contains only 2 zip files train.zip and val.zip

Dataset manipulation/filters

Filter by class
- kappadata.ClassFilterWrapper(ds, valid_classes=[0, 1])
- kappadata.ClassFilterWrapper(ds, invalid_classes=[0, 1])
Balance data by oversampling underrepresented classes kappadata.OversamplingWrapper(ds)
Subset by specifying percentages
- kappadata.PercentFilterWrapper(ds, from_percent=0.25)
- kappadata.PercentFilterWrapper(ds, to_percent=0.75)
- kappadata.PercentFilterWrapper(ds, from_percent=0.25, to_percent=0.75)
Repeat the whole dataset
- repeat twice: kappadata.RepeatWrapper(ds, repetitions=2)
- repeat until size is > 100 kappadata.RepeatWrapper(ds, min_size=100)
Shuffle dataset
- kappadata.ShuffleWrapper(ds, seed=5)

Miscellaneous

all datasets derived from kappadata.KDDataset automatically support python
- all_class_labels = ModeWrapper(ds, mode="y")[:]
- all_class_labels = ModeWrapper(ds, mode="y")[5:-3:2]
all datasets derived from kappadata.KDDataset implement iter
```
for y in ModeWrapper(ds, mode="y"):
    ...
```

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.4.15

Jun 11, 2024

1.4.14

May 17, 2024

1.4.13

Jan 23, 2024

1.4.12

Dec 18, 2023

1.4.11

Dec 14, 2023

1.4.10

Dec 14, 2023

1.4.9

Dec 14, 2023

1.4.8

Dec 14, 2023

1.4.7

Dec 4, 2023

1.4.6

Dec 4, 2023

1.4.5

Dec 4, 2023

1.4.4

Dec 4, 2023

1.4.3

Dec 4, 2023

1.4.2

Nov 22, 2023

1.4.1

Nov 14, 2023

1.3.86

Nov 12, 2023

1.3.84

Nov 10, 2023

1.3.82

Nov 4, 2023

1.3.81

Nov 4, 2023

1.3.80

Oct 21, 2023

1.3.79

Oct 12, 2023

1.3.78

Oct 6, 2023

1.3.69

Sep 23, 2023

1.3.68

Sep 23, 2023

1.3.67

Sep 21, 2023

1.3.66

Sep 21, 2023

1.3.65

Sep 20, 2023

1.3.64

Sep 19, 2023

1.3.62

Sep 19, 2023

1.3.61

Sep 19, 2023

1.3.60

Sep 19, 2023

1.3.59

Sep 19, 2023

1.3.58

Sep 19, 2023

1.3.57

Sep 18, 2023

1.3.56

Sep 18, 2023

1.3.55

Sep 18, 2023

1.3.54

Sep 17, 2023

1.3.53

Sep 16, 2023

1.3.52

Sep 16, 2023

1.3.51

Sep 16, 2023

1.3.50

Sep 15, 2023

1.3.49

Sep 15, 2023

1.3.48

Sep 15, 2023

1.3.47

Sep 15, 2023

1.3.46

Sep 15, 2023

1.3.45

Sep 12, 2023

1.3.42

Sep 12, 2023

1.3.41

Sep 11, 2023

1.3.39

Sep 1, 2023

1.3.38

Aug 22, 2023

1.3.37

Aug 21, 2023

1.3.35

Aug 21, 2023

1.3.34

Aug 17, 2023

1.3.33

Aug 17, 2023

1.3.32

Aug 17, 2023

1.3.31

Aug 16, 2023

1.3.30

Aug 16, 2023

1.3.29

Aug 16, 2023

1.3.28

Aug 15, 2023

1.3.27

Aug 11, 2023

1.3.26

Aug 11, 2023

1.3.25

Aug 11, 2023

1.3.24

Aug 9, 2023

1.3.22

Aug 9, 2023

1.3.21

Aug 8, 2023

1.3.20

Aug 7, 2023

1.3.19

Aug 7, 2023

1.3.17

Aug 7, 2023

1.3.16

Aug 6, 2023

1.3.15

Aug 6, 2023

1.3.14

Aug 6, 2023

1.3.13

Aug 6, 2023

1.3.12

Aug 6, 2023

1.3.11

Aug 6, 2023

1.3.10

Jul 26, 2023

1.3.9

Jul 26, 2023

1.3.8

Jul 26, 2023

1.3.7

Jul 17, 2023

1.3.5

Jul 11, 2023

1.3.4

Jul 11, 2023

1.3.3

Jul 11, 2023

1.3.2

Jul 10, 2023

1.3.1

Jul 10, 2023

1.3.0

Jul 3, 2023

1.2.63

Jul 3, 2023

1.2.62

Jun 28, 2023

1.2.61

Jun 28, 2023

1.2.60

Jun 28, 2023

1.2.59

Jun 28, 2023

1.2.58

Jun 27, 2023

1.2.57

Jun 27, 2023

1.2.56

Jun 27, 2023

1.2.55

Jun 27, 2023

1.2.54

Jun 27, 2023

1.2.53

Jun 27, 2023

1.2.52

Jun 27, 2023

1.2.49

Jun 27, 2023

1.2.48

Jun 27, 2023

1.2.47

Jun 25, 2023

1.2.46

Jun 25, 2023

1.2.45

Jun 25, 2023

1.2.44

Jun 22, 2023

1.2.43

Jun 20, 2023

1.2.42

Jun 19, 2023

1.2.41

Jun 18, 2023

1.2.39

Jun 2, 2023

1.2.38

Jun 2, 2023

1.2.36

Jun 2, 2023

1.2.35

May 28, 2023

1.2.34

May 18, 2023

1.2.33

May 16, 2023

1.2.32

May 15, 2023

1.2.30

May 14, 2023

1.2.28

May 14, 2023

1.2.27

May 14, 2023

1.2.26

May 14, 2023

1.2.25

May 12, 2023

1.2.24

May 11, 2023

1.2.23

May 9, 2023

1.2.22

May 7, 2023

1.2.21

May 7, 2023

1.2.20

May 4, 2023

1.2.19

May 3, 2023

1.2.18

May 2, 2023

1.2.17

Apr 30, 2023

1.2.14

Apr 28, 2023

1.2.12

Apr 27, 2023

1.2.11

Apr 24, 2023

1.2.10

Apr 24, 2023

1.2.9

Apr 12, 2023

1.2.8

Apr 9, 2023

1.2.7

Apr 7, 2023

1.2.6

Apr 7, 2023

1.2.5

Apr 7, 2023

1.2.4

Apr 7, 2023

1.2.2

Apr 7, 2023

1.2.1

Apr 1, 2023

1.1.23

Mar 31, 2023

1.1.22

Mar 30, 2023

1.1.21

Mar 30, 2023

1.1.20

Mar 30, 2023

1.1.18

Mar 29, 2023

1.1.17

Mar 29, 2023

1.1.16

Mar 19, 2023

1.1.13

Mar 19, 2023

1.1.11

Mar 18, 2023

1.1.10

Mar 16, 2023

1.1.9

Mar 16, 2023

1.1.8

Mar 16, 2023

1.1.7

Mar 15, 2023

1.1.6

Mar 13, 2023

1.1.5

Feb 18, 2023

1.1.4

Feb 17, 2023

1.1.3

Feb 17, 2023

1.1.2

Feb 16, 2023

1.1.1

Jan 31, 2023

1.1.0

Jan 30, 2023

1.0.99

Jan 22, 2023

1.0.98

Jan 22, 2023

1.0.97

Jan 21, 2023

1.0.96

Jan 21, 2023

1.0.95

Jan 21, 2023

1.0.94

Jan 21, 2023

1.0.93

Jan 21, 2023

1.0.92

Jan 21, 2023

1.0.91

Jan 21, 2023

1.0.90

Jan 21, 2023

1.0.89

Jan 21, 2023

1.0.88

Jan 21, 2023

1.0.86

Jan 21, 2023

1.0.85

Jan 21, 2023

1.0.84

Jan 21, 2023

1.0.83

Jan 21, 2023

1.0.82

Jan 21, 2023

1.0.81

Jan 19, 2023

1.0.80

Jan 19, 2023

1.0.79

Jan 17, 2023

1.0.77

Jan 16, 2023

1.0.75

Jan 14, 2023

1.0.74

Jan 14, 2023

1.0.73

Jan 14, 2023

1.0.72

Jan 14, 2023

1.0.71

Jan 14, 2023

1.0.70

Jan 14, 2023

1.0.69

Jan 14, 2023

1.0.68

Jan 14, 2023

1.0.67

Jan 14, 2023

1.0.64

Jan 14, 2023

1.0.63

Jan 11, 2023

1.0.62

Jan 11, 2023

1.0.61

Jan 11, 2023

1.0.60

Jan 10, 2023

1.0.59

Jan 6, 2023

1.0.58

Jan 6, 2023

1.0.57

Jan 6, 2023

1.0.55

Jan 6, 2023

1.0.54

Jan 4, 2023

1.0.53

Dec 15, 2022

1.0.52

Dec 15, 2022

1.0.51

Dec 15, 2022

1.0.50

Dec 14, 2022

1.0.49

Dec 14, 2022

1.0.48

Dec 12, 2022

1.0.47

Dec 8, 2022

1.0.46

Dec 8, 2022

1.0.45

Dec 8, 2022

1.0.44

Nov 25, 2022

1.0.43

Nov 25, 2022

1.0.42

Nov 25, 2022

1.0.41

Nov 25, 2022

1.0.40

Nov 15, 2022

1.0.39

Nov 15, 2022

1.0.38

Nov 14, 2022

1.0.37

Nov 13, 2022

1.0.36

Nov 12, 2022

1.0.34

Nov 11, 2022

1.0.33

Nov 10, 2022

1.0.31

Nov 10, 2022

1.0.30

Nov 10, 2022

1.0.29

Nov 9, 2022

1.0.27

Nov 9, 2022

1.0.26

Nov 8, 2022

1.0.25

Nov 8, 2022

1.0.23

Nov 8, 2022

1.0.22

Nov 8, 2022

1.0.21

Nov 7, 2022

1.0.20

Nov 7, 2022

1.0.19

Nov 7, 2022

1.0.16

Nov 6, 2022

1.0.15

Nov 6, 2022

1.0.14

Nov 6, 2022

1.0.13

Nov 6, 2022

1.0.12

Nov 6, 2022

1.0.11

Nov 1, 2022

1.0.10

Nov 1, 2022

1.0.9

Oct 30, 2022

1.0.8

Oct 30, 2022

1.0.7

Oct 27, 2022

1.0.6

Oct 27, 2022

1.0.5

Oct 24, 2022

1.0.4

Oct 23, 2022

1.0.3

Oct 23, 2022

This version

1.0.2

Oct 23, 2022

1.0.1

Oct 23, 2022

0.0.10

Oct 23, 2022

0.0.9

Oct 23, 2022

0.0.8

Oct 22, 2022

0.0.7

Oct 22, 2022

0.0.4

Oct 15, 2022

0.0.3

Oct 14, 2022

0.0.0

Oct 14, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kappadata-1.0.2.tar.gz (14.0 kB view hashes)

Uploaded Oct 23, 2022 Source

Built Distribution

kappadata-1.0.2-py3-none-any.whl (17.3 kB view hashes)

Uploaded Oct 23, 2022 Python 3

Hashes for kappadata-1.0.2.tar.gz

Hashes for kappadata-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`cd75f8420887d6a9db4517de04d67718099b151830c815b3419da01bacb7b6f3`
MD5	`57049899c6f305dc073399abd1df3e56`
BLAKE2b-256	`9e07414cd733e252ccad9cf00b502e7a7c0be75650eaa2229ca4a128fd41d200`

Hashes for kappadata-1.0.2-py3-none-any.whl

Hashes for kappadata-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2cb1d3ff6844ba5e3ddff5f44e6ce712249644b2df6b98d9a1c9a0cb3d442a38`
MD5	`25f4885c603e3d91821b584309e9c8b4`
BLAKE2b-256	`488eaa3d0d53f8653b8e52e1d6c9831f36d78cd2ed919b648a4c4ec7ddbcf919`