Process large datasets as if it was an iterable.
Project description
lazy_dataset
Lazy_dataset is a helper to deal with large datasets that do not fit into memory. It allows to define transformations that are applied lazily, (e.g. a mapping function to read data from HDD). When someone iterates over the dataset all transformations are applied.
Supported transformations:
dataset.map(map_fn)
: Apply the functionmap_fn
to each example (builtins.map)dataset[2]
: Get example at index2
.dataset['example_id']
Get that example that has the example id'example_id'
.dataset[10:20]
: Get a sub dataset that contains only the examples in the slice 10 to 20.dataset.filter(filter_fn, lazy=True)
Drops examples wherefilter_fn(example)
is false (builtins.filter).dataset.concatenate(*others)
: Concatenates two or more datasets (numpy.concatenate)dataset.intersperse(*others)
: Combine two or more datasets such that examples of each input dataset are evenly spaced (https://stackoverflow.com/a/19293603).dataset.zip(*others)
: Zip two or more datasetsdataset.shuffle(reshuffle=False)
: Shuffles the dataset. Whenreshuffle
isTrue
it shuffles each time when you iterate over the data.dataset.tile(reps, shuffle=False)
: Repeats the datasetreps
times and concatenates it (numpy.tile)dataset.cycle()
: Repeats the dataset endlessly (itertools.cycle but without caching)dataset.groupby(group_fn)
: Groups examples together. In contrast toitertools.groupby
a sort is not nessesary, like in pandas (itertools.groupby, pandas.DataFrame.groupby)dataset.sort(key_fn, sort_fn=sorted)
: Sorts the examples depending on the valueskey_fn(example)
(list.sort)dataset.batch(batch_size, drop_last=False)
: Batchesbatch_size
examples together as a list. Usually followed by a map (tensorflow.data.Dataset.batch)dataset.random_choice()
: Get a random example (numpy.random.choice)dataset.cache()
: Cache in RAM (similar to ESPnet'skeep_all_data_on_mem
)dataset.diskcache()
: Cache to a cache directory on the local filesystem (useful in clusters network slow filesystems)- ...
>>> from IPython.lib.pretty import pprint
>>> import lazy_dataset
>>> examples = {
... 'example_id_1': {
... 'observation': [1, 2, 3],
... 'label': 1,
... },
... 'example_id_2': {
... 'observation': [4, 5, 6],
... 'label': 2,
... },
... 'example_id_3': {
... 'observation': [7, 8, 9],
... 'label': 3,
... },
... }
>>> for example_id, example in examples.items():
... example['example_id'] = example_id
>>> ds = lazy_dataset.new(examples)
>>> ds
DictDataset(len=3)
MapDataset(_pickle.loads)
>>> ds.keys()
('example_id_1', 'example_id_2', 'example_id_3')
>>> for example in ds:
... print(example)
{'observation': [1, 2, 3], 'label': 1, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 2, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 3, 'example_id': 'example_id_3'}
>>> def transform(example):
... example['label'] *= 10
... return example
>>> ds = ds.map(transform)
>>> for example in ds:
... print(example)
{'observation': [1, 2, 3], 'label': 10, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds = ds.filter(lambda example: example['label'] > 15)
>>> for example in ds:
... print(example)
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds['example_id_2']
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
>>> ds
DictDataset(len=3)
MapDataset(_pickle.loads)
MapDataset(<function transform at 0x7ff74efb6620>)
FilterDataset(<function <lambda> at 0x7ff74efb67b8>)
Comparison with PyTorch's DataLoader
See here for a feature and throughput comparison of lazy_dataset with PyTorch's DataLoader.
Installation
Install it directly with Pip, if you just want to use it:
pip install lazy_dataset
If you want to make changes or want the most recent version: Clone the repository and install it as follows:
git clone https://github.com/fgnt/lazy_dataset.git
cd lazy_dataset
pip install --editable .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file lazy_dataset-0.0.15.tar.gz
.
File metadata
- Download URL: lazy_dataset-0.0.15.tar.gz
- Upload date:
- Size: 53.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6171608976c98429ea2fa87b496d5b387bef26ef205e53a9db611b7adee81a10 |
|
MD5 | e8ed61cc03e8e736bb6d568325ae46a2 |
|
BLAKE2b-256 | 22fc9bf757bca27ddf30c5ae1177b180000a082fe6b02080f93d6efdabf3f8b9 |
File details
Details for the file lazy_dataset-0.0.15-py3-none-any.whl
.
File metadata
- Download URL: lazy_dataset-0.0.15-py3-none-any.whl
- Upload date:
- Size: 44.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2998bb526880c6a1ecbf960e356410a358ddf289e636d48825112f7649a5f1f2 |
|
MD5 | 4473a662b907898ecf283e47669485a4 |
|
BLAKE2b-256 | 14afa0f0bfe49afc8c92edbcc8656f6ae9309fab6ce29564ae9f9cc36908d0b0 |