Skip to main content

Framework-Agnostic NLP Data Loader in Python

Project description

lineflow: Framework-Agnostic NLP Data Loader in Python

Build Status codecov

lineflow is a simple text dataset loader for NLP deep learning tasks.

  • lineflow was designed to use in all deep learning frameworks.
  • lineflow enables you to build pipelines.
  • lineflow supports functional API and lazy evaluation.

Installation

To install lineflow, simply:

$ pip install lineflow

If you'd like to use lineflow with AllenNLP:

$ pip install "lineflow[allennlp]"

Also, if you'd like to use lineflow with torchtext:

$ pip install "lineflow[torchtext]"

Usage

lineflow.TextDataset expects line-oriented text files:

import lineflow as lf


def preprocess(x):
    return x.split()

'''/path/to/text will look like below:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')
ds.first()  # "i 'm a line 1 ."
ds[1]  # "i 'm a line 2 ."

ds = ds.map(preprocess)
ds.first()  # ["i", "'m", "a", "line", "1", "."]

ds = lf.TextDataset(['/path/to/text', '/path/to/text'])
ds.first()  # ("i 'm a line 1 .", "i 'm a line 1 .")

ds = ds.map(lambda x: (x[0].split(), x[1].split()))
ds.first()  # (["i", "'m", "a", "line", "1", "."], ["i", "'m", "a", "line", "1", "."])

lineflow with Deep Learning Frameworks

Use lineflow with AllenNLP:

import math

from allennlp.common.tqdm import Tqdm
from allennlp.data.vocabulary import Vocabulary
from allennlp.data.iterators import BucketIterator

from lineflow.datasets import Seq2SeqDataset


ds = Seq2SeqDataset(
    source_file_path='/path/to/source',
    target_file_path='/path/to/target'
).to_allennlp()

vocab = Vocabulary.from_instances(ds)

iterator = BucketIterator(sorting_keys=[('source_tokens', 'num_tokens')])
iterator.index_with(vocab)

num_batches = math.ceil(len(ds) / iterator._batch_size)

for batch in Tqdm.tqdm(iterator(train, num_epochs=1), total=num_batches):
    ...  # Your training code here

You can find other examples here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineflow-0.2.2.tar.gz (5.1 kB view hashes)

Uploaded Source

Built Distribution

lineflow-0.2.2-py3-none-any.whl (6.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page