Skip to main content

Framework-Agnostic NLP Data Loader in Python

Project description

lineflow: Framework-Agnostic NLP Data Loader in Python

Build Status codecov

lineflow is a simple text dataset loader for NLP deep learning tasks.

  • lineflow was designed to use in all deep learning frameworks.
  • lineflow enables you to build pipelines.
  • lineflow supports functional API and lazy evaluation.

lineflow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.

Installation

To install lineflow, simply:

pip install lineflow

If you'd like to use lineflow with AllenNLP:

pip install "lineflow[allennlp]"

Also, if you'd like to use lineflow with torchtext:

pip install "lineflow[torchtext]"

Basic Usage

lineflow.TextDataset expects line-oriented text files:

import lineflow as lf


'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')

ds.first()  # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds)  # 3

lineflow with PyTorch, torchtext, AllenNLP

You can find more examples here.

PyTorch

You can check full code here.

...
import lineflow as lf
import lineflow.datasets as lfds

...


if __name__ == '__main__':
    train = lfds.SmallParallelEnJa('train')
    validation = lfds.SmallParallelEnJa('dev')

    train = train.map(preprocess)
    validation = validation.map(preprocess)

    en_tokens = lf.flat_map(lambda x: x[0],
                            train + validation,
                            lazy=True)
    ja_tokens = lf.flat_map(lambda x: x[1],
                            train + validation,
                            lazy=True)

    en_token_to_index, _ = build_vocab(en_tokens, 'en.vocab')
    ja_token_to_index, _ = build_vocab(ja_tokens, 'ja.vocab')

    ...

    loader = DataLoader(
        train
        .map(postprocess(en_token_to_index, en_unk_index, ja_token_to_index, ja_unk_index))
        .save('enja.cache'),
        batch_size=32,
        num_workers=4,
        collate_fn=get_collate_fn(pad_index))

torchtext

You can check full code here.

...
import lineflow.datasets as lfds


if __name__ == '__main__':
    src = data.Field(tokenize=str.split, init_token='<s>', eos_token='</s>')
    tgt = data.Field(tokenize=str.split, init_token='<s>', eos_token='</s>')
    fields = [('src', src), ('tgt', tgt)]
    train = lfds.SmallParallelEnJa('train').to_torchtext(fields)
    validation = lfds.SmallParallelEnJa('dev').to_torchtext(fields)

    src.build_vocab(train, validation)
    tgt.build_vocab(train, validation)

    iterator = data.BucketIterator(
        dataset=train, batch_size=32, sort_key=lambda x: len(x.src))

AllenNLP

You can check full code here.

...
import lineflow.datasets as lfds


if __name__ == '__main__':
    train = lfds.SmallParallelEnJa('train') \
        .to_allennlp(source_field_name=SOURCE_FIELD_NAME, target_field_name=TARGET_FIELD_NAME).all()
    validation = lfds.SmallParallelEnJa('dev') \
        .to_allennlp(source_field_name=SOURCE_FIELD_NAME, target_field_name=TARGET_FIELD_NAME).all()

    if not osp.exists('./enja_vocab'):
        vocab = Vocabulary.from_instances(train + validation, max_vocab_size=50000)
        vocab.save_to_files('./enja_vocab')
    else:
        vocab = Vocabulary.from_files('./enja_vocab')

    iterator = BucketIterator(sorting_keys=[(SOURCE_FIELD_NAME, 'num_tokens')], batch_size=32)
    iterator.index_with(vocab)

Datasets

CNN / Daily Mail:

import lineflow.datasets as lfds

train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')

IMDB:

import lineflow.datasets as lfds

train = lfds.Imdb('train')
test = lfds.Imdb('test')

Microsoft Research Paraphrase Corpus:

import lineflow.datasets as lfds

train = lfds.MsrParaphrase('train')
test = lfds.MsrParaphrase('test')

small_parallel_enja:

import lineflow.datasets as lfds

train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')

SQuAD:

import lineflow.datasets as lfds

train = lfds.Squad('train')
dev = lfds.Squad('dev')

WikiText-2 (Added by sobamchan, thanks.)

import lineflow.datasets as lfds

train = lfds.WikiText2('train')
dev = lfds.WikiText2('dev')
test = lfds.WikiText2('test')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineflow-0.3.7.tar.gz (53.3 kB view details)

Uploaded Source

File details

Details for the file lineflow-0.3.7.tar.gz.

File metadata

  • Download URL: lineflow-0.3.7.tar.gz
  • Upload date:
  • Size: 53.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.1

File hashes

Hashes for lineflow-0.3.7.tar.gz
Algorithm Hash digest
SHA256 71284c89a74bf07d54049b4306cf00e8a7bbb82129d8c9484a9878af2169ebe2
MD5 d3897ad0db792a058b78a91f22605501
BLAKE2b-256 ca8a72ca1c7e6455dd43e1c18db9c98f92f9430822e561a23311aef040ee0ffe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page