Skip to main content

Framework-Agnostic NLP Data Loader in Python

Project description

lineflow: Framework-Agnostic NLP Data Loader in Python

Build Status codecov

lineflow is a simple text dataset loader for NLP deep learning tasks.

  • lineflow was designed to use in all deep learning frameworks.
  • lineflow enables you to build pipelines.
  • lineflow supports functional API and lazy evaluation.

lineflow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.

Installation

To install lineflow, simply:

pip install lineflow

If you'd like to use lineflow with AllenNLP:

pip install "lineflow[allennlp]"

Also, if you'd like to use lineflow with torchtext:

pip install "lineflow[torchtext]"

Basic Usage

lineflow.TextDataset expects line-oriented text files:

import lineflow as lf


'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')

ds.first()  # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds)  # 3

lineflow with PyTorch, torchtext, AllenNLP

You can find more examples here.

PyTorch

You can check full code here.

...
import lineflow as lf
import lineflow.datasets as lfds

...


if __name__ == '__main__':
    train = lfds.SmallParallelEnJa('train')
    validation = lfds.SmallParallelEnJa('dev')

    train = train.map(preprocess)
    validation = validation.map(preprocess)

    en_tokens = lf.flat_map(lambda x: x[0],
                            train + validation,
                            lazy=True)
    ja_tokens = lf.flat_map(lambda x: x[1],
                            train + validation,
                            lazy=True)

    en_token_to_index, _ = build_vocab(en_tokens, 'en.vocab')
    ja_token_to_index, _ = build_vocab(ja_tokens, 'ja.vocab')

    ...

    loader = DataLoader(
        train
        .map(postprocess(en_token_to_index, en_unk_index, ja_token_to_index, ja_unk_index))
        .save('enja.cache'),
        batch_size=32,
        num_workers=4,
        collate_fn=get_collate_fn(pad_index))

torchtext

You can check full code here.

...
import lineflow.datasets as lfds


if __name__ == '__main__':
    src = data.Field(tokenize=str.split, init_token='<s>', eos_token='</s>')
    tgt = data.Field(tokenize=str.split, init_token='<s>', eos_token='</s>')
    fields = [('src', src), ('tgt', tgt)]
    train = lfds.SmallParallelEnJa('train').to_torchtext(fields)
    validation = lfds.SmallParallelEnJa('dev').to_torchtext(fields)

    src.build_vocab(train, validation)
    tgt.build_vocab(train, validation)

    iterator = data.BucketIterator(
        dataset=train, batch_size=32, sort_key=lambda x: len(x.src))

AllenNLP

You can check full code here.

...
import lineflow.datasets as lfds


if __name__ == '__main__':
    train = lfds.SmallParallelEnJa('train') \
        .to_allennlp(source_field_name=SOURCE_FIELD_NAME, target_field_name=TARGET_FIELD_NAME).all()
    validation = lfds.SmallParallelEnJa('dev') \
        .to_allennlp(source_field_name=SOURCE_FIELD_NAME, target_field_name=TARGET_FIELD_NAME).all()

    if not osp.exists('./enja_vocab'):
        vocab = Vocabulary.from_instances(train + validation, max_vocab_size=50000)
        vocab.save_to_files('./enja_vocab')
    else:
        vocab = Vocabulary.from_files('./enja_vocab')

    iterator = BucketIterator(sorting_keys=[(SOURCE_FIELD_NAME, 'num_tokens')], batch_size=32)
    iterator.index_with(vocab)

Datasets

CNN / Daily Mail:

import lineflow.datasets as lfds

train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')

IMDB:

import lineflow.datasets as lfds

train = lfds.Imdb('train')
test = lfds.Imdb('test')

Microsoft Research Paraphrase Corpus:

import lineflow.datasets as lfds

train = lfds.MsrParaphrase('train')
test = lfds.MsrParaphrase('test')

small_parallel_enja:

import lineflow.datasets as lfds

train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')

SQuAD:

import lineflow.datasets as lfds

train = lfds.Squad('train')
dev = lfds.Squad('dev')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineflow-0.3.4.tar.gz (47.7 kB view details)

Uploaded Source

File details

Details for the file lineflow-0.3.4.tar.gz.

File metadata

  • Download URL: lineflow-0.3.4.tar.gz
  • Upload date:
  • Size: 47.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.7

File hashes

Hashes for lineflow-0.3.4.tar.gz
Algorithm Hash digest
SHA256 71befecaa46769d73dea61473f29df123847ef9e364027d24dda8a99985f5b26
MD5 360b0869d7b725f81783991ce9416f56
BLAKE2b-256 905ecfa524d928dc90e90634554b84747fe3e940d8ca0667871b83dcbaee63f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page