Skip to main content

Process large datasets as if it was an iterable.

Project description

lazy_dataset

Run python tests codecov.io License: MIT

Lazy_dataset is a helper to deal with large datasets that do not fit into memory. It allows to define transformations that are applied lazily, (e.g. a mapping function to read data from HDD). When someone iterates over the dataset all transformations are applied.

Supported transformations:

  • dataset.map(map_fn): Apply the function map_fn to each example (builtins.map)
  • dataset[2]: Get example at index 2.
  • dataset['example_id'] Get that example that has the example id 'example_id'.
  • dataset[10:20]: Get a sub dataset that contains only the examples in the slice 10 to 20.
  • dataset.filter(filter_fn, lazy=True) Drops examples where filter_fn(example) is false (builtins.filter).
  • dataset.concatenate(*others): Concatenates two or more datasets (numpy.concatenate)
  • dataset.intersperse(*others): Combine two or more datasets such that examples of each input dataset are evenly spaced (https://stackoverflow.com/a/19293603).
  • dataset.zip(*others): Zip two or more datasets
  • dataset.shuffle(reshuffle=False): Shuffles the dataset. When reshuffle is True it shuffles each time when you iterate over the data.
  • dataset.tile(reps, shuffle=False): Repeats the dataset reps times and concatenates it (numpy.tile)
  • dataset.cycle(): Repeats the dataset endlessly (itertools.cycle but without caching)
  • dataset.groupby(group_fn): Groups examples together. In contrast to itertools.groupby a sort is not nessesary, like in pandas (itertools.groupby, pandas.DataFrame.groupby)
  • dataset.sort(key_fn, sort_fn=sorted): Sorts the examples depending on the values key_fn(example) (list.sort)
  • dataset.batch(batch_size, drop_last=False): Batches batch_size examples together as a list. Usually followed by a map (tensorflow.data.Dataset.batch)
  • dataset.random_choice(): Get a random example (numpy.random.choice)
  • dataset.cache(): Cache in RAM (similar to ESPnet's keep_all_data_on_mem)
  • dataset.diskcache(): Cache to a cache directory on the local filesystem (useful in clusters network slow filesystems)
  • ...
>>> from IPython.lib.pretty import pprint
>>> import lazy_dataset
>>> examples = {
...     'example_id_1': {
...         'observation': [1, 2, 3],
...         'label': 1,
...     },
...     'example_id_2': {
...         'observation': [4, 5, 6],
...         'label': 2,
...     },
...     'example_id_3': {
...         'observation': [7, 8, 9],
...         'label': 3,
...     },
... }
>>> for example_id, example in examples.items():
...     example['example_id'] = example_id
>>> ds = lazy_dataset.new(examples)
>>> ds
  DictDataset(len=3)
MapDataset(_pickle.loads)
>>> ds.keys()
('example_id_1', 'example_id_2', 'example_id_3')
>>> for example in ds:
...     print(example)
{'observation': [1, 2, 3], 'label': 1, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 2, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 3, 'example_id': 'example_id_3'}
>>> def transform(example):
...     example['label'] *= 10
...     return example
>>> ds = ds.map(transform)
>>> for example in ds:
...     print(example)
{'observation': [1, 2, 3], 'label': 10, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds = ds.filter(lambda example: example['label'] > 15)
>>> for example in ds:
...     print(example)
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds['example_id_2']
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
>>> ds
      DictDataset(len=3)
    MapDataset(_pickle.loads)
  MapDataset(<function transform at 0x7ff74efb6620>)
FilterDataset(<function <lambda> at 0x7ff74efb67b8>)

Comparison with PyTorch's DataLoader

See here for a feature and throughput comparison of lazy_dataset with PyTorch's DataLoader.

Installation

Install it directly with Pip, if you just want to use it:

pip install lazy_dataset

If you want to make changes or want the most recent version: Clone the repository and install it as follows:

git clone https://github.com/fgnt/lazy_dataset.git
cd lazy_dataset
pip install --editable .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lazy_dataset-0.0.15.tar.gz (53.2 kB view details)

Uploaded Source

Built Distribution

lazy_dataset-0.0.15-py3-none-any.whl (44.0 kB view details)

Uploaded Python 3

File details

Details for the file lazy_dataset-0.0.15.tar.gz.

File metadata

  • Download URL: lazy_dataset-0.0.15.tar.gz
  • Upload date:
  • Size: 53.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for lazy_dataset-0.0.15.tar.gz
Algorithm Hash digest
SHA256 6171608976c98429ea2fa87b496d5b387bef26ef205e53a9db611b7adee81a10
MD5 e8ed61cc03e8e736bb6d568325ae46a2
BLAKE2b-256 22fc9bf757bca27ddf30c5ae1177b180000a082fe6b02080f93d6efdabf3f8b9

See more details on using hashes here.

File details

Details for the file lazy_dataset-0.0.15-py3-none-any.whl.

File metadata

File hashes

Hashes for lazy_dataset-0.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 2998bb526880c6a1ecbf960e356410a358ddf289e636d48825112f7649a5f1f2
MD5 4473a662b907898ecf283e47669485a4
BLAKE2b-256 14afa0f0bfe49afc8c92edbcc8656f6ae9309fab6ce29564ae9f9cc36908d0b0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page