Skip to main content

Framework-Agnostic NLP Data Loader in Python

Project description

lineflow: Framework-Agnostic NLP Data Loader in Python

Build Status codecov

lineflow is a simple text dataset loader for NLP deep learning tasks.

  • lineflow was designed to use in all deep learning frameworks.
  • lineflow enables you to build pipelines and it's lazy evaluation.
  • lineflow supports some functional API.

Installation

To install lineflow, simply:

$ pip install lineflow

If you'd like to use lineflow with AllenNLP:

$ pip install "lineflow[allennlp]"

Also, if you'd like to use lineflow with torchtext:

$ pip install "lineflow[torchtext]"

Usage

lineflow.TextDataset expects line-oriented text files:

import lineflow as lf

'''/path/to/text will looks like below:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''


def preprocess(x):
    return x.split()

ds = lf.TextDataset('/path/to/text')
ds.first()  # "i 'm a line 1 ."
ds[1]  # "i 'm a line 2 ."

ds = ds.map(preprocess)
ds.first()  # ["i", "'m", "a", "line", "1", "."]

ds = lf.TextDataset(['/path/to/text', '/path/to/text'])
ds.first()  # ("i 'm a line 1 .", "i 'm a line 1 .")

ds = ds.map(lambda x: (x[0].split(), x[1].split()))
ds.first()  # (["i", "'m", "a", "line", "1", "."], ["i", "'m", "a", "line", "1", "."])

Use lineflow with AllenNLP

Use lineflow with AllenNLP:

import math

from allennlp.common.tqdm import Tqdm
from allennlp.data.vocabulary import Vocabulary
from allennlp.data.iterators import BucketIterator

from lineflow.datasets import Seq2SeqDataset


ds = Seq2SeqDataset(
    source_file_path='/path/to/source',
    target_file_path='/path/to/target'
).to_allennlp()

vocab = Vocabulary.from_instances(ds)

iterator = BucketIterator(sorting_keys=[('source_tokens', 'num_tokens')])
iterator.index_with(vocab)

num_batches = math.ceil(len(ds) / iterator._batch_size)

for batch in Tqdm.tqdm(iterator(train, num_epochs=1), total=num_batches):
    ...  # Your training code here

See more in examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineflow-0.2.0.tar.gz (4.5 kB view hashes)

Uploaded Source

Built Distribution

lineflow-0.2.0-py3-none-any.whl (6.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page