Framework-Agnostic NLP Data Loader in Python

These details have not been verified by PyPI

Project links

Homepage

Project description

lineflow: Framework-Agnostic NLP Data Loader in Python

lineflow is a simple text dataset loader for NLP deep learning tasks.

lineflow was designed to use in all deep learning frameworks.
lineflow enables you to build pipelines.
lineflow supports functional API and lazy evaluation.

lineflow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.

Installation

To install lineflow, simply:

pip install lineflow

If you'd like to use lineflow with AllenNLP:

pip install "lineflow[allennlp]"

Also, if you'd like to use lineflow with torchtext:

pip install "lineflow[torchtext]"

Basic Usage

lineflow.TextDataset expects line-oriented text files:

import lineflow as lf


'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')

ds.first()  # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds)  # 3

lineflow with PyTorch, torchtext, AllenNLP

PyTorch
torchtext
AllenNLP

You can find more examples here.

PyTorch

You can check full code here.

...
import lineflow as lf
import lineflow.datasets as lfds

...


if __name__ == '__main__':
    train = lfds.SmallParallelEnJa('train')
    validation = lfds.SmallParallelEnJa('dev')

    train = train.map(preprocess)
    validation = validation.map(preprocess)

    en_tokens = lf.flat_map(lambda x: x[0],
                            train + validation,
                            lazy=True)
    ja_tokens = lf.flat_map(lambda x: x[1],
                            train + validation,
                            lazy=True)

    en_token_to_index, _ = build_vocab(en_tokens, 'en.vocab')
    ja_token_to_index, _ = build_vocab(ja_tokens, 'ja.vocab')

    ...

    loader = DataLoader(
        train
        .map(postprocess(en_token_to_index, en_unk_index, ja_token_to_index, ja_unk_index))
        .save('enja.cache'),
        batch_size=32,
        num_workers=4,
        collate_fn=get_collate_fn(pad_index))

torchtext

You can check full code here.

...
import lineflow.datasets as lfds


if __name__ == '__main__':
    src = data.Field(tokenize=str.split, init_token='<s>', eos_token='</s>')
    tgt = data.Field(tokenize=str.split, init_token='<s>', eos_token='</s>')
    fields = [('src', src), ('tgt', tgt)]
    train = lfds.SmallParallelEnJa('train').to_torchtext(fields)
    validation = lfds.SmallParallelEnJa('dev').to_torchtext(fields)

    src.build_vocab(train, validation)
    tgt.build_vocab(train, validation)

    iterator = data.BucketIterator(
        dataset=train, batch_size=32, sort_key=lambda x: len(x.src))

AllenNLP

You can check full code here.

...
import lineflow.datasets as lfds


if __name__ == '__main__':
    train = lfds.SmallParallelEnJa('train') \
        .to_allennlp(source_field_name=SOURCE_FIELD_NAME, target_field_name=TARGET_FIELD_NAME).all()
    validation = lfds.SmallParallelEnJa('dev') \
        .to_allennlp(source_field_name=SOURCE_FIELD_NAME, target_field_name=TARGET_FIELD_NAME).all()

    if not osp.exists('./enja_vocab'):
        vocab = Vocabulary.from_instances(train + validation, max_vocab_size=50000)
        vocab.save_to_files('./enja_vocab')
    else:
        vocab = Vocabulary.from_files('./enja_vocab')

    iterator = BucketIterator(sorting_keys=[(SOURCE_FIELD_NAME, 'num_tokens')], batch_size=32)
    iterator.index_with(vocab)

Datasets

CNN / Daily Mail:

import lineflow.datasets as lfds

train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')

IMDB:

import lineflow.datasets as lfds

train = lfds.Imdb('train')
test = lfds.Imdb('test')

Microsoft Research Paraphrase Corpus:

import lineflow.datasets as lfds

train = lfds.MsrParaphrase('train')
test = lfds.MsrParaphrase('test')

small_parallel_enja:

import lineflow.datasets as lfds

train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')

SQuAD:

import lineflow.datasets as lfds

train = lfds.Squad('train')
dev = lfds.Squad('dev')

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.6.8

Nov 22, 2021

0.6.7

Oct 31, 2021

0.6.6

Oct 10, 2021

0.6.5

Oct 5, 2021

0.6.4

Jan 20, 2020

0.6.3

Oct 21, 2019

0.6.2

Oct 8, 2019

0.6.1

Aug 19, 2019

0.6.0

Aug 19, 2019

0.5.0

Aug 9, 2019

0.4.5

Jul 23, 2019

0.4.4

Jul 18, 2019

0.4.3

Jul 16, 2019

0.4.2

Jul 16, 2019

0.4.1

Jul 15, 2019

0.4.0

Jul 14, 2019

0.3.10

Jul 14, 2019

0.3.9

Jun 21, 2019

0.3.7

Jun 13, 2019

0.3.6

Jun 11, 2019

0.3.5

Jun 11, 2019

This version

0.3.4

Jun 10, 2019

0.3.3

Jun 8, 2019

0.2.8

May 9, 2019

0.2.7

May 7, 2019

0.2.6

Mar 27, 2019

0.2.5

Mar 25, 2019

0.2.4

Mar 15, 2019

0.2.3

Mar 15, 2019

0.2.2

Mar 15, 2019

0.2.1

Mar 10, 2019

0.2.0

Mar 8, 2019

0.1.9

Mar 7, 2019

0.1.8

Mar 7, 2019

0.1.7

Mar 7, 2019

0.1.6

Mar 7, 2019

0.1.5

Mar 7, 2019

0.1.4

Mar 4, 2019

0.1.3

Feb 28, 2019

0.1.2

Feb 26, 2019

0.1.1

Feb 25, 2019

0.1.0

Feb 25, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineflow-0.3.4.tar.gz (47.7 kB view details)

Uploaded Jun 10, 2019 Source

File details

Details for the file lineflow-0.3.4.tar.gz.

File metadata

Download URL: lineflow-0.3.4.tar.gz
Upload date: Jun 10, 2019
Size: 47.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.7

File hashes

Hashes for lineflow-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`71befecaa46769d73dea61473f29df123847ef9e364027d24dda8a99985f5b26`
MD5	`360b0869d7b725f81783991ce9416f56`
BLAKE2b-256	`905ecfa524d928dc90e90634554b84747fe3e940d8ca0667871b83dcbaee63f4`

See more details on using hashes here.

lineflow 0.3.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

lineflow: Framework-Agnostic NLP Data Loader in Python

Installation

Basic Usage

lineflow with PyTorch, torchtext, AllenNLP

PyTorch

torchtext

AllenNLP

Datasets

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes