A data processing pipeline and iterator for machine learning.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Lunas

Lunas is a Python based library that provides a set of simple interfaces for data processing pipelines and an iterator for looping through data.

Basically, Lunas draws its data-handling style on Tensorflow, PyTorch, and some implementation details from AllenNLP.

Overview

A Dataset represents a dataset and holds corresponding pre-processing and filtering operations. Currently the following features are supported:

Buffered reading.
Buffered shuffling.
Chained processing and filtering interface.
Handling multiple input sources.
Persistable.

Supported datasets:

Zip: Zips multiple datasets.
Shuffle: A wrapper that performs buffered shuffling.
Sort: A wrapper that performs buffered sorting.
InvertibleSort: A wrapper that performs buffered sorting., and returns the sample along with its original index in the dataset.
Enumerate: Similar to Python's enumerate that wraps a dataset and attach an index to each element of it.
Range: Similar to Python's range.
Count: Similar to Python's itertools.count.
TextLine: A wrapper that wraps a plain-text file. Each line of the file is taken as a sample of the dataset.
Stdin: A wrapper that reads from standard input.

An Iterator generates batches by iterating through the dataset and maintains the iteration state. The following features are supported:

Dynamic batching at runtime.
Custom stopping criteria.
Persistable.

We also modify PyTorch's DataLoader to make it compatitble with our batch iterator.

Persistable provides the class with a PyTorch compatible interface to dump and load instance state, useful to resume the training process.

Requirements

Numpy
overrides
typings
Python >= 3.7

Lunas hardly relies on any third-party libraries, all the required libraries are just to take advantage of the type hint features provided by Python 3.

Type hint feature is used in this project and the built-in typing module of Python version lower than 3.7 can decrease the performance. However, this is solved since Python 3.7. So Lunas currently requires Python 3.7 to work efficiently.

Installation

Install using pip:

pip install lunas

Example

Create a dataset and iterate through it.
```
from lunas import Range

ds = Range(10)
for x in ds: # epoch 1
    print(x)
for x in ds: # epoch 2
    print(x)
```
- A Range dataset is created similar to range(10) and iterate through it for one epoch. As you see, we can iterate through this dataset several times.
Build a data processing pipeline.
```
ds = Range(10).select(lambda x: x + 1).select(lambda x: x * 2).where(lambda x: x % 2 == 0)
```
- The chaining calls of a Dataset obbject defines a processing pipeline on the dataset.
- select(fn) applys transformations on a dataset element lazily. The argument fn is a custom mapping fucntion that takes a single sample as input and output. You can apply any transformations to the dataset and return a sample of any type, e.g., Dict, List and a custom Sample.
- where(fn) accepts a predicate and returns a bool value to filter an input sample, if True, the sample is preserved, otherwise discarded.
- The mapping and filtering ops given by select(fn) and where(fn) are not executed immediately, but later when iterating through the dataset object.
- Both select(fn) and where(fn) returns the dataset itself just to enable chaining style invocations. The mapping and filtering ops are attched to the dataset in an in-place fasion.
Deal with multiple input sources.
```
from lunas import Range, Zip, Shuffle

ds1 = Range(10)
ds2 = Range(start=10, stop=20, step=1)
ds = Zip(ds1, ds2).select(lambda x,y: x + y)
ds = Shuffle(ds, bufsize=5)
```
- In the above code, we create two datasets and zip them as a Zip dataset. A Zip dataset returns a tuple from its internal datasets.
- Shuffle performs randomized shuffling on the dataset.

Practical use case in Machine Translation scenario.

from lunas import TextLine, Zip, Shuffle, Sort, Iterator

# Tokenize the input into a list of tokens.
source = TextLine('train.fr').select(lambda x: x.split())
target = TextLine('train.en').select(lambda x: x.split()) 
# Ensure the inputs are of length no exceeding 50.
ds = Zip(source, target).select(lambda x, y: 
 	   {
 		   x: src_vocab.lookup(x), # Map words to ids
 		   y: tgt_vocab.lookup(y),
 		   size_x: len(x),
 		   size_y: len(y),
 	   }
    )
ds = ds.where(lambda x: max(x['size_x'], x['size_y']) <= 50)
# Shuffle the dataset within a buffer with bufsize 100000
ds = Shuffle(ds, bufsize=10000)
# Sort samples in batch by source text length
sort_key = lambda x: len(x['size_x'])
ds = Sort(ds, bufsize=1000, sort_key_fn=sort_key)

# Convert a list of samples to model inputs
collate_fn = lambda x: ...

it = Iterator(iterable=ds, batch_size=4096, 
      sample_size_fn=lambda x: x['size_x'], 
      collate_fn=collate_fn, 
 	 dist_world_size=1,
 	 dist_local_rank=0,
      drop_tail=True)

# Iterate 100 epoch and 1000000 steps at most.
for batch in it.while_true(lambda: it.epoch < 100 and it.step < 1e6):
    print(it.epoch, it.step, it.step_in_epoch, batch)

This code should be simple enough to understand, even if you are not familiar with machine translation.

Save and reload iteration state.

import pickle
pickle.dump(it.state_dict(), open('state.pkl', 'wb'))
# ...
state = pickle.load(open('state.pkl', 'rb'))
it.load_state_dict(state)

state_dict() returns a picklable dictionary, which can be loaded by it.load_state_dict() to resume the iteration process later.

Extend the dataset.
- You can refer to the implementation of TextLine dataset to customize your own data dataset.

Conclusions

Please feel free to contact me if you have any question or find any bug of Lunas.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.5.1

Jan 13, 2022

0.5.0

Jan 12, 2022

0.4.2

Jan 12, 2022

0.4.2a0 pre-release

Jan 12, 2022

0.4.1

Jan 6, 2022

0.4.0

Aug 28, 2020

0.3.9

Jul 29, 2020

0.3.8

May 31, 2020

0.3.7

Oct 27, 2019

This version

0.3.6

Oct 25, 2019

0.3.5

Sep 7, 2019

0.3.4

Apr 2, 2019

0.3.3

Apr 2, 2019

0.3.2

Mar 25, 2019

0.3.1

Mar 1, 2019

0.3.0

Feb 18, 2019

0.2.8

Feb 5, 2019

0.2.7

Jan 31, 2019

0.2.6

Jan 29, 2019

0.2.5

Jan 28, 2019

0.2.4

Jan 28, 2019

0.2.3

Jan 28, 2019

0.2.2

Jan 23, 2019

0.2.1

Jan 23, 2019

0.2.0

Jan 16, 2019

0.1.9

Jan 11, 2019

0.1.8

Jan 11, 2019

0.1.7

Jan 11, 2019

0.1.6

Jan 10, 2019

0.1.5

Jan 10, 2019

0.1.4

Jan 10, 2019

0.1.3

Jan 9, 2019

0.1.2

Jan 9, 2019

0.1.1

Jan 4, 2019

0.1.0

Jan 3, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Lunas-0.3.6.tar.gz (10.6 kB view hashes)

Uploaded Oct 25, 2019 Source

Built Distribution

Lunas-0.3.6-py3-none-any.whl (12.1 kB view hashes)

Uploaded Oct 25, 2019 Python 3

Hashes for Lunas-0.3.6.tar.gz

Hashes for Lunas-0.3.6.tar.gz
Algorithm	Hash digest
SHA256	`b60ef22d041e683c12674dcc5fcc778672f4f5d37fea1354b5ad1dde5c51dea8`
MD5	`3858a7142ae210760d295a78e4b8d14d`
BLAKE2b-256	`97cb886bc6b42c3f5ca6db99859e5b52667b9b4dca8a3447408ccad9e311f840`

Hashes for Lunas-0.3.6-py3-none-any.whl

Hashes for Lunas-0.3.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c05e921cba57931874eb3af3eebba728ab87a3189735036594b7a497ab0bbbe6`
MD5	`f9522f4364b436bb3a3f160f24786f8c`
BLAKE2b-256	`0c1f37951f33620df14254b87439fe4918bea8f4dec276b28b5819d478311802`