Framework-Agnostic NLP Data Loader in Python

Project description

Lineflow: Framework-Agnostic NLP Data Loader in Python

Lineflow is a simple text dataset loader for NLP deep learning tasks.

Lineflow was designed to use in all deep learning frameworks.
Lineflow enables you to build pipelines.
Lineflow supports functional API and lazy evaluation.

Lineflow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.

Installation

To install Lineflow:

pip install lineflow

Basic Usage

lineflow.TextDataset expects line-oriented text files:

import lineflow as lf


'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')

ds.first()  # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds)  # 3
ds.map(lambda x: x.split()).first()  # ["i", "'m", "a", "line", "1", "."]

Example

Please check out the examples/small_parallel_enja_pytorch.py to see how to tokenize a sentence, build vocabulary, and do indexing.
Also check out the other examples to see how to use Lineflow.

Load the predefined dataset:

>>> import lineflow.datasets as lfds
>>> train = lfds.SmallParallelEnJa('train')
>>> train.first()
("i can 't tell who will arrive first .", '誰 が 一番 に 着 く か 私 に は 分か り ま せ ん 。')

Split the sentence to the words:

>>> # continuing from above
>>> train = train.map(lambda x: (x[0].split(), x[1].split()))
>>> train.first()
(['i', 'can', "'t", 'tell', 'who', 'will', 'arrive', 'first', '.'],
 ['誰', 'が', '一番', 'に', '着', 'く', 'か', '私', 'に', 'は', '分か', 'り', 'ま', 'せ', 'ん', '。'])

Obtain words in dataset:

>>> # continuing from above
>>> import lineflow as lf
>>> en_tokens = lf.flat_map(lambda x: x[0], train)
>>> en_tokens[:5] # This is useful to build vocabulary.
['i', 'can', "'t", 'tell', 'who']

Datasets

CNN / Daily Mail:

import lineflow.datasets as lfds

train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')

IMDB:

import lineflow.datasets as lfds

train = lfds.Imdb('train')
test = lfds.Imdb('test')

Microsoft Research Paraphrase Corpus:

import lineflow.datasets as lfds

train = lfds.MsrParaphrase('train')
test = lfds.MsrParaphrase('test')

small_parallel_enja:

import lineflow.datasets as lfds

train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')

SQuAD:

import lineflow.datasets as lfds

train = lfds.Squad('train')
dev = lfds.Squad('dev')

WikiText-2 (Added by @sobamchan, thanks.)

import lineflow.datasets as lfds

train = lfds.WikiText2('train')
dev = lfds.WikiText2('dev')
test = lfds.WikiText2('test')

Project details

Release history Release notifications | RSS feed

0.6.8

Nov 22, 2021

0.6.7

Oct 31, 2021

0.6.6

Oct 10, 2021

0.6.5

Oct 5, 2021

0.6.4

Jan 20, 2020

0.6.3

Oct 21, 2019

0.6.2

Oct 8, 2019

0.6.1

Aug 19, 2019

0.6.0

Aug 19, 2019

0.5.0

Aug 9, 2019

0.4.5

Jul 23, 2019

0.4.4

Jul 18, 2019

0.4.3

Jul 16, 2019

0.4.2

Jul 16, 2019

0.4.1

Jul 15, 2019

This version

0.4.0

Jul 14, 2019

0.3.10

Jul 14, 2019

0.3.9

Jun 21, 2019

0.3.7

Jun 13, 2019

0.3.6

Jun 11, 2019

0.3.5

Jun 11, 2019

0.3.4

Jun 10, 2019

0.3.3

Jun 8, 2019

0.2.8

May 9, 2019

0.2.7

May 7, 2019

0.2.6

Mar 27, 2019

0.2.5

Mar 25, 2019

0.2.4

Mar 15, 2019

0.2.3

Mar 15, 2019

0.2.2

Mar 15, 2019

0.2.1

Mar 10, 2019

0.2.0

Mar 8, 2019

0.1.9

Mar 7, 2019

0.1.8

Mar 7, 2019

0.1.7

Mar 7, 2019

0.1.6

Mar 7, 2019

0.1.5

Mar 7, 2019

0.1.4

Mar 4, 2019

0.1.3

Feb 28, 2019

0.1.2

Feb 26, 2019

0.1.1

Feb 25, 2019

0.1.0

Feb 25, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineflow-0.4.0.tar.gz (27.1 kB view hashes)

Uploaded Jul 14, 2019 Source

Hashes for lineflow-0.4.0.tar.gz

Hashes for lineflow-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`5df1b40b450261ed9b49d46937af043e5164c812988d153a7ad5f877175483a4`
MD5	`72136554356128f153b8977dcfcb0ca4`
BLAKE2b-256	`41c52df6aa0bca5c8498af391655deecf41ba3bcea8b7a855d794b6bfc1da3bb`