A data processing pipeline and iterator for machine learning.

These details have not been verified by PyPI

Project links

Homepage

Project description

Lunas

Lunas is a Python based library that mimics TensorFlow's dataset API to provide data processing pipeline and data feeding utility.

The implementation mostly draws on TensorFlow but in a simplified and pure-Python fashion.

Overview

A Dataset represents a dataset and holds corresponding pre-processing and filtering operations.

Currently the following features are supported:

Dataset.repeat()
Dataset.interleave()
Dataset.shard()
Dataset.shuffle()
Dataset.sort()
Dataset.take()
Dataset.skip()

Datasets

Zip: Zip multiple datasets.
Concat: Concat two datasets.
Range: Behave similarly to the builtin range().
Enumerate: Similar to the builtin enumerate().
Array: Represent an array as dataset. An array could be an iterable, iterator or numpy array.
Glob: Use Python's glob module to match potentially multiple files and wraps them into a dataset.
Stdin: A dataset that reads from standard input.
TextLine: Represent a plain-text file.

Iterator

An Iterator generates batches by iterating through the dataset.

Currently supported batching schemes include:

SimpleIterator: Used to batch according to number of samples.
BucketIterator: Arrange samples of similar length together. Usually it's used to reduce redundant computations and thus maximize GPU utilization.

These iterators can work with PyTorch's DataLoader to exploit parallelism for data processing. A convenient DataLoader is provided for this purpose.

Requirements

setuptools
typings
numpy
Python >= 3.7

Installation

Install using pip:

pip install lunas

Examples

Create a dataset and iterate through it
```
from lunas import Range

ds = Range(10).shuffle(buffer_size=5)
for x in ds: # epoch 1
    print(x)
for x in ds: # epoch 2
    print(x)
```
- A Range dataset is created similarly to range(10) and then iterated through for one epoch. As you see, we can scan through it for several times.
- Instead of iterating twice, you can also use ds.repeat(2) to create another dataset that does the same work.
- Dataset.shuffle() performs a buffered shuffling, which uses a queue under the hood.
Build a data processing pipeline
```
ds = Range(10).select(lambda x: x * 2).select(lambda x: x + 1).where(lambda x: x % 2 == 0)
```
- The chaining calls of a Dataset object defines a processing pipeline on the dataset.
- select(fn) applies transformation on every dataset element lazily. The argument fn is a custom mapping function that takes a single sample as input and output. You can apply any transformation to a dataset and return a sample of any type, e.g., Dict, List or a custom Sample.
- where(fn) accepts a predicate and returns a bool value to filter an input sample. If returned True, the sample is preserved, otherwise discarded.
- Both select(fn) and where(fn) returns the dataset itself just to enable chaining style invocations. The mapping and filtering ops are attached to the dataset immediately in an in-place fashion.
Deal with multiple input sources
```
from lunas import Range, Zip

ds1 = Range(10)
ds2 = Range(start=10, stop=20, step=1)
ds = Zip([ds1, ds2]).select(lambda x,y: x + y)
```
- We create two datasets and pack them as a Zip dataset. A Zip dataset returns a tuple from its internal datasets in the given order.
- Zip also allows zip multiple datasets of different size by providing additional mode argument and paddinng to indicate the way for padding shorter dataset.

Practical use case in a distributed Machine Translation training scenario

from lunas import *

X = TextLine('train.fr').select(lambda x:x.split())
Y = TextLine('train.en').select(lambda x:x.split())
# Each worker holds a mutually different shard of the original dataset.
# This should be done before shuffling to reduce unnecessary shulffing efforts in each workers.
ds = Zip(X, Y).shard(dist_word_size,dist_local_rank)
# Construct a sample from the dataset.
ds = ds.select(lambda x, y: 
        {
            'x': vocab_s.lookup(x),
            'y': vocab_t.lookup(y),
            'size_x': len(x),
            'size_y': len(y),
        }
    )
ds = ds.shuffle(100000)
# Repeat endlessly to ensure different workers can receive the same number of batches to sync.
ds = ds.repeat() 

it = SimpleIterator(ds, batch_size=128, drop_tail=False)

dataloader = DataLoader(it, batch_size=4096,
      num_workers=6, 
      collate_fn=collate_fn, 
      pin_memory=True)

it = iter(dataloader)
for _ in range(max_steps):
    batch = cuda(next(it))
    ...

It doesn't matter if you are not familiar with machine translation task, since this code should be simple enough to expain itself.

Continuable iteration
```
import pickle
pickle.dump(it.state_dict(), open('state.pkl', 'wb'))
# ...
state = pickle.load(open('state.pkl', 'rb'))
it.load_state_dict(state)

r = Range(5)
# outputs: [0, 1, 2]
for i,x in enumerate(r):
    print(x)
    if i == 2:
        break
# outputs: [3, 4]
for i,x in enumerate(r):
    print(x)
```
- state_dict() returns a picklable dictionary, which can be loaded by it.load_state_dict() to resume the iteration process later.
- lunas provides limited support for resumable iteration for Dataset, Iterator and Dataloader. The iteration state is maintained by a iteration pointer in Dataset. For those classes that manage dataset by internal buffering, including Shuffle, Sort, BucketIterator and DataLoader, this would NOT recover the dataset elements in the buffer.
Extend the dataset
- You can refer to the implementation of TextLine to customize your own data dataset.

Conclusions

Please feel free to contact if you have any question or find any bug of Lunas.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.5.1

Jan 13, 2022

0.5.0

Jan 12, 2022

0.4.2

Jan 12, 2022

0.4.2a0 pre-release

Jan 12, 2022

0.4.1

Jan 6, 2022

0.4.0

Aug 28, 2020

0.3.9

Jul 29, 2020

0.3.8

May 31, 2020

This version

0.3.7

Oct 27, 2019

0.3.6

Oct 25, 2019

0.3.5

Sep 7, 2019

0.3.4

Apr 2, 2019

0.3.3

Apr 2, 2019

0.3.2

Mar 25, 2019

0.3.1

Mar 1, 2019

0.3.0

Feb 18, 2019

0.2.8

Feb 5, 2019

0.2.7

Jan 31, 2019

0.2.6

Jan 29, 2019

0.2.5

Jan 28, 2019

0.2.4

Jan 28, 2019

0.2.3

Jan 28, 2019

0.2.2

Jan 23, 2019

0.2.1

Jan 23, 2019

0.2.0

Jan 16, 2019

0.1.9

Jan 11, 2019

0.1.8

Jan 11, 2019

0.1.7

Jan 11, 2019

0.1.6

Jan 10, 2019

0.1.5

Jan 10, 2019

0.1.4

Jan 10, 2019

0.1.3

Jan 9, 2019

0.1.2

Jan 9, 2019

0.1.1

Jan 4, 2019

0.1.0

Jan 3, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Lunas-0.3.7.tar.gz (11.3 kB view hashes)

Uploaded Oct 27, 2019 Source

Built Distribution

Lunas-0.3.7-py3-none-any.whl (13.0 kB view hashes)

Uploaded Oct 27, 2019 Python 3

Hashes for Lunas-0.3.7.tar.gz

Hashes for Lunas-0.3.7.tar.gz
Algorithm	Hash digest
SHA256	`8dc9f20d52a58e0d71ee9a47c37c48c862ac093fc0b088aab15d72c3d17592fd`
MD5	`20ab6aaa43c4e4ef6dfd5f891815845b`
BLAKE2b-256	`9db763647a7019c07b555808824b7ca58c399f4c74e830cbf17156c585dc55c4`

Hashes for Lunas-0.3.7-py3-none-any.whl

Hashes for Lunas-0.3.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`81d7cc52a45bd742496ce1bbb28452c519c126b465e9fe5a3312acd0a5231ca7`
MD5	`23a5715500938eacf3c244e836a0e9f2`
BLAKE2b-256	`b52f8e8fd821c41881c8324b6bafea056c4bb59d8271956a01c10aaa568eff6b`