A data processing pipeline and iterator with minimal dependencies for machine learning.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Lunas

Lunas is a Python 3-based library that provides a set of simple interfaces for data processing pipelines and an iterator for looping through data.

Basically, Lunas draws its data-handling style on Tensorflow, PyTorch, and some implementation details from AllenNLP.

Features

Reader A reader defines a dataset and corresponding preprocessing and filtering rules. Currently the following features are supported:

Buffered reading.
Buffered shuffling.
Chained processing and filtering interface.
Preprocess and filter the data buffer in parallel.
Handling multiple input sources.
Persistable.

DataIterator An iterator performs multi-pass iterations over the dataset and maintains the iteration state:

Dynamic batch size at runtime.
Custom stopping criteria.
Sort samples of a batch, which is useful for learning text presentation by RNNs in PyTorch.
Persistable.

Persistable provides the class with a PyTorch like interface to dump and load instance state, useful when the training process is accidentally aborted.

Requirements

Numpy
overrides
typings
Python = 3.x

Lunas hardly relies on any third-party libraries, all the required libraries are just to take advantage of the type hint feature provided by Python 3.

Installation

You can simply install Lunas by running pip:

pip install lunas

Example

Lunas exposes minimal interfaces to the user so as to make it as simple as possible. We try to avoid adding any unnecessary features to keep it light-weight.

However, you can still extend this library to suit your needs at any time to handle arbitrary data types such as text, images, and audios.

Create a dataset reader and iterate through it.
```
from lunas.readers import Range

ds = Range(10)
for sample in ds:
    print(sample)
for sample in ds:
    print(sample)
```
- We create a dataset similar to range(10) and iterate through it for one epoch. As you see, we can iterate through this dataset several times.
Build a data processing pipeline.
```
ds = Range(10).select(lambda x: x + 1).select(lambda x: x * 2).where(lambda x: x % 2 == 0)
```
- we call Reader.select(fn) to define a processing procedure for the dataset.
- select() returns the dataset itself to enable chaining invocations. You can apply any transformations to the dataset and return a sample of any type, say Dict, List and custom Sample.
- where() accepts a predicate and returns a bool value to filter input sample, if True, the sample is preserved, otherwise discarded.
- It should be noted that the processing is not executed immediately, but will be performed when iterating through ds.
Deal with multiple input sources.
```
from lunas.readers import Range, Zip, Shuffle

ds1 = Range(10)
ds2 = Range(10)
ds = Zip(ds1, ds2).select(lambda x: x[0] + x[1])
ds = Shuffle(ds)
```
- In the above code, we create two datasets and zip them as a Zip reader. A Zip reader returns a tuple from its internal readers.
- Shuffle performs randomized shuffling on the dataset.

Practical use case in Machine Translation scenario.

from lunas.readers import TextLine
from lunas.iterator import Iterator

# Tokenize the input into a list of tokens.
tokenize = lambda line: line.split()
# Ensure the inputs are of length no exceeding 50.
limit = lambda src_tgt: max(map(len, src_tgt)) <= 50
# Map word to id.
word2id = lambda src_tgt: ...

source = TextLine('train.fr').select(tokenize)
target = TextLine('train.en').select(tokenize)
ds = Zip(source, target).where(limit)
ds = Shuffle().select(word2id)

# Take maximum length of the sentence pair as sample_size
sample_size = lambda x: max(map(len), x)
# Convert a list of samples to model inputs
collate_fn = lambda x: ...
# Sort samples in batch by source text length
sort_key = lambda x: len(x[0])

it = Iterator(ds, batch_size=4096, cache_size=40960, sample_size_fn=lambda x, collate_fn=collate_fn, sort_desc_by=sort_key)

# Iterate 100 epoch and 1000000 steps at most.
for batch in it.while_true(lambda: it.epoch < 100 and it.step < 1e6):
	print(it.epoch, it.step, it.step_in_epoch, batch)

This code should be simple enough to understand, even if you are not familiar with machine translation.

Save and reload iteration state.

import pickle
pickle.dump(it.state_dict(), open('state.pkl', 'wb'))
# ...
state = pickle.load(open('state.pkl', 'rb'))
it.load_state_dict(state)

state_dict() returns a picklable dictionary, which can be loaded by it.load_state_dict() to resume the iteration process later.

Extend the reader.
- You can refer to the implementation of Text reader to customize your own data reader.

Conclusions

Please feel free to contact me if you have any question or find any bug of Lunas.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.5.1

Jan 13, 2022

0.5.0

Jan 12, 2022

0.4.2

Jan 12, 2022

0.4.2a0 pre-release

Jan 12, 2022

0.4.1

Jan 6, 2022

0.4.0

Aug 28, 2020

0.3.9

Jul 29, 2020

0.3.8

May 31, 2020

0.3.7

Oct 27, 2019

0.3.6

Oct 25, 2019

0.3.5

Sep 7, 2019

0.3.4

Apr 2, 2019

0.3.3

Apr 2, 2019

0.3.2

Mar 25, 2019

0.3.1

Mar 1, 2019

0.3.0

Feb 18, 2019

0.2.8

Feb 5, 2019

0.2.7

Jan 31, 2019

0.2.6

Jan 29, 2019

0.2.5

Jan 28, 2019

0.2.4

Jan 28, 2019

0.2.3

Jan 28, 2019

This version

0.2.2

Jan 23, 2019

0.2.1

Jan 23, 2019

0.2.0

Jan 16, 2019

0.1.9

Jan 11, 2019

0.1.8

Jan 11, 2019

0.1.7

Jan 11, 2019

0.1.6

Jan 10, 2019

0.1.5

Jan 10, 2019

0.1.4

Jan 10, 2019

0.1.3

Jan 9, 2019

0.1.2

Jan 9, 2019

0.1.1

Jan 4, 2019

0.1.0

Jan 3, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Lunas-0.2.2.tar.gz (11.5 kB view details)

Uploaded Jan 23, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

Lunas-0.2.2-py3-none-any.whl (14.8 kB view details)

Uploaded Jan 23, 2019 Python 3

File details

Details for the file Lunas-0.2.2.tar.gz.

File metadata

Download URL: Lunas-0.2.2.tar.gz
Upload date: Jan 23, 2019
Size: 11.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.8

File hashes

Hashes for Lunas-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`5003de7c72af309bec41b06ac5ac7905eab9de2833e64496fc1e45bd191e0d2d`
MD5	`3c560e9912b31958c95eb9f982156219`
BLAKE2b-256	`06b6e0c58cbf47772cd79613b26c4ed40c2d19127f9ef8479a33fe15e4815657`

See more details on using hashes here.

File details

Details for the file Lunas-0.2.2-py3-none-any.whl.

File metadata

Download URL: Lunas-0.2.2-py3-none-any.whl
Upload date: Jan 23, 2019
Size: 14.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.8

File hashes

Hashes for Lunas-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c2e70f7a356a43e23521729d7133fdb74655eaa773c0d2097346a4ce1242a187`
MD5	`b56d2c2a3164b1bfb27dfb88a53c5b70`
BLAKE2b-256	`d956c1d41a715a87df436a68401715b4a5244947b7062cdc76368b34711d9f1a`

See more details on using hashes here.

Lunas 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Lunas

Features

Requirements

Installation

Example

Conclusions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes