Skip to main content

Framework-Agnostic NLP Data Loader in Python

Project description

Lineflow: Framework-Agnostic NLP Data Loader in Python

Build Status codecov

Lineflow is a simple text dataset loader for NLP deep learning tasks.

  • Lineflow was designed to use in all deep learning frameworks.
  • Lineflow enables you to build pipelines.
  • Lineflow supports functional API and lazy evaluation.

Lineflow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.

Installation

To install Lineflow:

pip install lineflow

Basic Usage

lineflow.TextDataset expects line-oriented text files:

import lineflow as lf


'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')

ds.first()  # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds)  # 3
ds.map(lambda x: x.split()).first()  # ["i", "'m", "a", "line", "1", "."]

Example

Load the predefined dataset:

>>> import lineflow.datasets as lfds
>>> train = lfds.SmallParallelEnJa('train')
>>> train.first()
("i can 't tell who will arrive first .", '誰 が 一番 に 着 く か 私 に は 分か り ま せ ん 。')

Split the sentence to the words:

>>> # continuing from above
>>> train = train.map(lambda x: (x[0].split(), x[1].split()))
>>> train.first()
(['i', 'can', "'t", 'tell', 'who', 'will', 'arrive', 'first', '.'],
 ['誰', 'が', '一番', 'に', '着', 'く', 'か', '私', 'に', 'は', '分か', 'り', 'ま', 'せ', 'ん', '。'])

Obtain words in dataset:

>>> # continuing from above
>>> import lineflow as lf
>>> en_tokens = lf.flat_map(lambda x: x[0], train)
>>> en_tokens[:5] # This is useful to build vocabulary.
['i', 'can', "'t", 'tell', 'who']

Datasets

CNN / Daily Mail:

import lineflow.datasets as lfds

train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')

IMDB:

import lineflow.datasets as lfds

train = lfds.Imdb('train')
test = lfds.Imdb('test')

Microsoft Research Paraphrase Corpus:

import lineflow.datasets as lfds

train = lfds.MsrParaphrase('train')
test = lfds.MsrParaphrase('test')

small_parallel_enja:

import lineflow.datasets as lfds

train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')

SQuAD:

import lineflow.datasets as lfds

train = lfds.Squad('train')
dev = lfds.Squad('dev')

WikiText-2 (Added by @sobamchan, thanks.)

import lineflow.datasets as lfds

train = lfds.WikiText2('train')
dev = lfds.WikiText2('dev')
test = lfds.WikiText2('test')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineflow-0.4.2.tar.gz (27.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page