Framework-Agnostic NLP Data Loader in Python
Project description
Lineflow: Framework-Agnostic NLP Data Loader in Python
Lineflow is a simple text dataset loader for NLP deep learning tasks.
- Lineflow was designed to use in all deep learning frameworks.
- Lineflow enables you to build pipelines.
- Lineflow supports functional API and lazy evaluation.
Lineflow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.
Installation
To install Lineflow:
pip install lineflow
Basic Usage
lineflow.TextDataset expects line-oriented text files:
import lineflow as lf
'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')
ds.first() # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds) # 3
ds.map(lambda x: x.split()).first() # ["i", "'m", "a", "line", "1", "."]
Example
- Please check out the examples to see how to use Lineflow, especially for tokenization, building vocabulary, and indexing.
Load Penn Treebank:
>>> import lineflow.datasets as lfds
>>> train = lfds.PennTreebank('train')
>>> train.first()
' aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter '
Split the sentence to the words:
>>> # continuing from above
>>> train = train.map(str.split)
>>> train.first()
['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter']
Obtain words in dataset:
>>> # continuing from above
>>> words = train.flat_map(lambda x: x)
>>> words.take(5) # This is useful to build vocabulary.
['aer', 'banknote', 'berlitz', 'calloway', 'centrust']
Datasets
import lineflow.datasets as lfds
train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')
IMDB:
import lineflow.datasets as lfds
train = lfds.Imdb('train')
test = lfds.Imdb('test')
Microsoft Research Paraphrase Corpus:
import lineflow.datasets as lfds
train = lfds.MsrParaphrase('train')
test = lfds.MsrParaphrase('test')
import lineflow.datasets as lfds
train = lfds.PennTreebank('train')
dev = lfds.PennTreebank('dev')
test = lfds.PennTreebank('test')
import lineflow.datasets as lfds
train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')
import lineflow.datasets as lfds
train = lfds.Squad('train')
dev = lfds.Squad('dev')
WikiText-2 (Added by @sobamchan, thanks.)
import lineflow.datasets as lfds
train = lfds.WikiText2('train')
dev = lfds.WikiText2('dev')
test = lfds.WikiText2('test')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lineflow-0.4.3.tar.gz
(28.8 kB
view hashes)