Framework-Agnostic NLP Data Loader in Python
Project description
lineflow: Framework-Agnostic NLP Data Loader in Python
lineflow is a simple text dataset loader for NLP deep learning tasks.
- lineflow was designed to use in all deep learning frameworks.
- lineflow enables you to build pipelines.
- lineflow supports functional API and lazy evaluation.
lineflow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.
Installation
To install lineflow, simply:
pip install lineflow
If you'd like to use lineflow with AllenNLP:
pip install "lineflow[allennlp]"
Also, if you'd like to use lineflow with torchtext:
pip install "lineflow[torchtext]"
Basic Usage
lineflow.TextDataset expects line-oriented text files:
import lineflow as lf
'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')
ds.first() # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds) # 3
lineflow with PyTorch, torchtext, AllenNLP
You can find more examples here.
PyTorch
You can check full code here.
...
import lineflow as lf
import lineflow.datasets as lfds
...
if __name__ == '__main__':
train = lfds.SmallParallelEnJa('train')
validation = lfds.SmallParallelEnJa('dev')
train = train.map(preprocess)
validation = validation.map(preprocess)
en_tokens = lf.flat_map(lambda x: x[0],
train + validation,
lazy=True)
ja_tokens = lf.flat_map(lambda x: x[1],
train + validation,
lazy=True)
en_token_to_index, _ = build_vocab(en_tokens, 'en.vocab')
ja_token_to_index, _ = build_vocab(ja_tokens, 'ja.vocab')
...
loader = DataLoader(
train
.map(postprocess(en_token_to_index, en_unk_index, ja_token_to_index, ja_unk_index))
.save('enja.cache'),
batch_size=32,
num_workers=4,
collate_fn=get_collate_fn(pad_index))
torchtext
You can check full code here.
...
import lineflow.datasets as lfds
if __name__ == '__main__':
src = data.Field(tokenize=str.split, init_token='<s>', eos_token='</s>')
tgt = data.Field(tokenize=str.split, init_token='<s>', eos_token='</s>')
fields = [('src', src), ('tgt', tgt)]
train = lfds.SmallParallelEnJa('train').to_torchtext(fields)
validation = lfds.SmallParallelEnJa('dev').to_torchtext(fields)
src.build_vocab(train, validation)
tgt.build_vocab(train, validation)
iterator = data.BucketIterator(
dataset=train, batch_size=32, sort_key=lambda x: len(x.src))
AllenNLP
You can check full code here.
...
import lineflow.datasets as lfds
if __name__ == '__main__':
train = lfds.SmallParallelEnJa('train') \
.to_allennlp(source_field_name=SOURCE_FIELD_NAME, target_field_name=TARGET_FIELD_NAME).all()
validation = lfds.SmallParallelEnJa('dev') \
.to_allennlp(source_field_name=SOURCE_FIELD_NAME, target_field_name=TARGET_FIELD_NAME).all()
if not osp.exists('./enja_vocab'):
vocab = Vocabulary.from_instances(train + validation, max_vocab_size=50000)
vocab.save_to_files('./enja_vocab')
else:
vocab = Vocabulary.from_files('./enja_vocab')
iterator = BucketIterator(sorting_keys=[(SOURCE_FIELD_NAME, 'num_tokens')], batch_size=32)
iterator.index_with(vocab)
Datasets support
import lineflow.datasets as lfds
train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')
IMDB:
import lineflow.datasets as lfds
train = lfds.Imdb('train')
test = lfds.Imdb('test')
import lineflow.datasets as lfds
train = lfds.Squad('train')
dev = lfds.Squad('dev')
import lineflow.datasets as lfds
train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lineflow-0.2.7.tar.gz
(9.8 kB
view hashes)
Built Distribution
lineflow-0.2.7-py3-none-any.whl
(11.6 kB
view hashes)