Framework-Agnostic NLP Data Loader in Python
Project description
Lineflow: Framework-Agnostic NLP Data Loader in Python
Lineflow is a simple text dataset loader for NLP deep learning tasks.
- Lineflow was designed to use in all deep learning frameworks.
- Lineflow enables you to build pipelines.
- Lineflow supports functional API and lazy evaluation.
Lineflow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.
Installation
To install Lineflow:
pip install lineflow
Basic Usage
lineflow.TextDataset expects line-oriented text files:
import lineflow as lf
'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')
ds.first() # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds) # 3
ds.map(lambda x: x.split()).first() # ["i", "'m", "a", "line", "1", "."]
Example
- Please check out the examples/small_parallel_enja_pytorch.py to see how to tokenize a sentence, build vocabulary, and do indexing.
- Also check out the other examples to see how to use Lineflow.
Load the predefined dataset:
>>> import lineflow.datasets as lfds
>>> train = lfds.SmallParallelEnJa('train')
>>> train.first()
("i can 't tell who will arrive first .", '誰 が 一番 に 着 く か 私 に は 分か り ま せ ん 。')
Split the sentence to the words:
>>> # continuing from above
>>> train = train.map(lambda x: (x[0].split(), x[1].split()))
>>> train.first()
(['i', 'can', "'t", 'tell', 'who', 'will', 'arrive', 'first', '.'],
['誰', 'が', '一番', 'に', '着', 'く', 'か', '私', 'に', 'は', '分か', 'り', 'ま', 'せ', 'ん', '。'])
Obtain words in dataset:
>>> # continuing from above
>>> import lineflow as lf
>>> en_tokens = lf.flat_map(lambda x: x[0], train)
>>> en_tokens[:5] # This is useful to build vocabulary.
['i', 'can', "'t", 'tell', 'who']
Datasets
import lineflow.datasets as lfds
train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')
IMDB:
import lineflow.datasets as lfds
train = lfds.Imdb('train')
test = lfds.Imdb('test')
Microsoft Research Paraphrase Corpus:
import lineflow.datasets as lfds
train = lfds.MsrParaphrase('train')
test = lfds.MsrParaphrase('test')
import lineflow.datasets as lfds
train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')
import lineflow.datasets as lfds
train = lfds.Squad('train')
dev = lfds.Squad('dev')
WikiText-2 (Added by @sobamchan, thanks.)
import lineflow.datasets as lfds
train = lfds.WikiText2('train')
dev = lfds.WikiText2('dev')
test = lfds.WikiText2('test')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lineflow-0.4.0.tar.gz
(27.1 kB
view hashes)