Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

These details have not been verified by PyPI

Project links

Homepage

Project description

LineFlow: Framework-Agnostic NLP Data Loader in Python

LineFlow is a simple text dataset loader for NLP deep learning tasks.

LineFlow was designed to use in all deep learning frameworks.
LineFlow enables you to build pipelines via functional APIs (.map, .filter, .flat_map).
LineFlow provides common NLP datasets.

LineFlow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.

Basic Usage

lineflow.TextDataset expects line-oriented text files:

import lineflow as lf


'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')

ds.first()  # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds)  # 3
ds.map(lambda x: x.split()).first()  # ["i", "'m", "a", "line", "1", "."]

Example

Please check out the examples to see how to use LineFlow, especially for tokenization, building vocabulary, and indexing.

Loads Penn Treebank:

>>> import lineflow.datasets as lfds
>>> train = lfds.PennTreebank('train')
>>> train.first()
' aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter '

Splits the sentence to the words:

>>> # continuing from above
>>> train = train.map(str.split)
>>> train.first()
['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter']

Obtains words in dataset:

>>> # continuing from above
>>> words = train.flat_map(lambda x: x)
>>> words.take(5) # This is useful to build vocabulary.
['aer', 'banknote', 'berlitz', 'calloway', 'centrust']

Further more:

How to fine-tune BERT with pytorch-lightning by @sobamchan

Requirements

Python3.6+

Installation

To install LineFlow:

pip install lineflow

Datasets

Is the dataset you want to use not supported? Suggest a new dataset :tada:

Commonsense Reasoning
Language Modeling
Machine Translation
Paraphrase
Question Answering
Sentiment Analysis
Sequence Tagging
Text Summarization

Commonsense Reasoning

CommonsenseQA

Loads the CommonsenseQA dataset:

>>> import lineflow.datasets as lfds

>>> train = lfds.CommonsenseQA("train")
>>> dev = lfds.CommonsenseQA("dev")
>>> test = lfds.CommonsenseQA("test")

The items in this datset as follows:

>>> import lineflow.datasets as lfds

>>> train = lfds.CommonsenseQA("train")
>>> train.first()
{"id": "075e483d21c29a511267ef62bedc0461",
 "answer_key": "A",
 "options": {"A": "ignore",
 "B": "enforce",
 "C": "authoritarian",
 "D": "yell at",
 "E": "avoid"},
 "stem": "The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?"}
}

Language Modeling

Penn Treebank

Loads the Penn Treebank dataset:

import lineflow.datasets as lfds

train = lfds.PennTreebank('train')
dev = lfds.PennTreebank('dev')
test = lfds.PennTreebank('test')

WikiText-103

Loads the WikiText-103 dataset:

import lineflow.datasets as lfds

train = lfds.WikiText103('train')
dev = lfds.WikiText103('dev')
test = lfds.WikiText103('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.WikiText103('train').flat_map(lambda x: x.split() + ['<eos>'])
>>> train.take(5)
['<eos>', '=', 'Valkyria', 'Chronicles', 'III']

WikiText-2 (Added by @sobamchan, thanks.)

Loads the WikiText-2 dataset:

import lineflow.datasets as lfds

train = lfds.WikiText2('train')
dev = lfds.WikiText2('dev')
test = lfds.WikiText2('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.WikiText2('train').flat_map(lambda x: x.split() + ['<eos>'])
>>> train.take(5)
['<eos>', '=', 'Valkyria', 'Chronicles', 'III']

Machine Translation

small_parallel_enja:

Loads the small_parallel_enja dataset which is small English-Japanese parallel corpus:

import lineflow.datasets as lfds

train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.SmallParallelEnJa('train').map(lambda x: (x[0].split(), x[1].split()))
>>> train.first()
(['i', 'can', "'t", 'tell', 'who', 'will', 'arrive', 'first', '.'], ['誰', 'が', '一番', 'に', '着', 'く', 'か', '私', 'に', 'は', '分か', 'り', 'ま', 'せ', 'ん', '。']

Paraphrase

Microsoft Research Paraphrase Corpus:

Loads the Miscrosoft Research Paraphrase Corpus:

import lineflow.datasets as lfds

train = lfds.MsrParaphrase('train')
test = lfds.MsrParaphrase('test')

The item in this dataset as follows:

>>> import lineflow.datasets as lfds
>>> train = lfds.MsrParaphrase('train')
>>> train.first()
{'quality': '1',
 'id1': '702876',
 'id2': '702977',
 'string1': 'Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.',
 'string2': 'Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.'
}

Question Answering

SQuAD:

Loads the SQuAD dataset:

import lineflow.datasets as lfds

train = lfds.Squad('train')
dev = lfds.Squad('dev')

The item in this dataset as follows:

>>> import lineflow.datasets as lfds
>>> train = lfds.Squad('train')
>>> train.first()
{'answers': [{'answer_start': 515, 'text': 'Saint Bernadette Soubirous'}],
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'}

Sentiment Analysis

IMDB:

Loads the IMDB dataset:

import lineflow.datasets as lfds

train = lfds.Imdb('train')
test = lfds.Imdb('test')

The item in this dataset as follows:

>>> import lineflow.datasets as lfds
>>> train = lfds.Imdb('train')
>>> train.first()
('For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.', 0)

Sequence Tagging

CoNLL2000

Loads the CoNLL2000 dataset:

import lineflow.datasets as lfds

train = lfds.Conll2000('train')
test = lfds.Conll2000('test')

Text Summarization

CNN / Daily Mail:

Loads the CNN / Daily Mail dataset:

import lineflow.datasets as lfds

train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.CnnDailymail('train').map(lambda x: (x[0].split(), x[1].split()))
>>> train.first()
... # the output is omitted because it's too long to display here.

SciTLDR

Loads the TLDR dataset:

import lineflow.datasets as lfds

train = lfds.SciTLDR('train')
dev = lfds.SciTLDR('dev')
test = lfds.SciTLDR('test')

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.6.8

Nov 22, 2021

0.6.7

Oct 31, 2021

0.6.6

Oct 10, 2021

0.6.5

Oct 5, 2021

0.6.4

Jan 20, 2020

0.6.3

Oct 21, 2019

0.6.2

Oct 8, 2019

0.6.1

Aug 19, 2019

0.6.0

Aug 19, 2019

0.5.0

Aug 9, 2019

0.4.5

Jul 23, 2019

0.4.4

Jul 18, 2019

0.4.3

Jul 16, 2019

0.4.2

Jul 16, 2019

0.4.1

Jul 15, 2019

0.4.0

Jul 14, 2019

0.3.10

Jul 14, 2019

0.3.9

Jun 21, 2019

0.3.7

Jun 13, 2019

0.3.6

Jun 11, 2019

0.3.5

Jun 11, 2019

0.3.4

Jun 10, 2019

0.3.3

Jun 8, 2019

0.2.8

May 9, 2019

0.2.7

May 7, 2019

0.2.6

Mar 27, 2019

0.2.5

Mar 25, 2019

0.2.4

Mar 15, 2019

0.2.3

Mar 15, 2019

0.2.2

Mar 15, 2019

0.2.1

Mar 10, 2019

0.2.0

Mar 8, 2019

0.1.9

Mar 7, 2019

0.1.8

Mar 7, 2019

0.1.7

Mar 7, 2019

0.1.6

Mar 7, 2019

0.1.5

Mar 7, 2019

0.1.4

Mar 4, 2019

0.1.3

Feb 28, 2019

0.1.2

Feb 26, 2019

0.1.1

Feb 25, 2019

0.1.0

Feb 25, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineflow-0.6.8.tar.gz (42.2 kB view details)

Uploaded Nov 22, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lineflow-0.6.8-py3-none-any.whl (24.0 kB view details)

Uploaded Nov 22, 2021 Python 3

File details

Details for the file lineflow-0.6.8.tar.gz.

File metadata

Download URL: lineflow-0.6.8.tar.gz
Upload date: Nov 22, 2021
Size: 42.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for lineflow-0.6.8.tar.gz
Algorithm	Hash digest
SHA256	`71bd764868e874e796c6b1c8e01a1e01ed76a3ae8d8539b3010007eb6296465d`
MD5	`ce71f64a74555722febee92d66c0fa73`
BLAKE2b-256	`cc0bc9400b5fa331b674ba74d27ef18754c7dcf85cd15aae1b499d7d9438a580`

See more details on using hashes here.

File details

Details for the file lineflow-0.6.8-py3-none-any.whl.

File metadata

Download URL: lineflow-0.6.8-py3-none-any.whl
Upload date: Nov 22, 2021
Size: 24.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for lineflow-0.6.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2bbda2a7a713a555294ec2ff1490ee935aa0e7da54cda7b871dc0cefe5d84d8e`
MD5	`1d1d57d8f63e3f7d99bd128ddef93340`
BLAKE2b-256	`4223bbffe38b572c5426c11ce2d9323c0519b9889676e20141f3b9479859deea`

See more details on using hashes here.

lineflow 0.6.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LineFlow: Framework-Agnostic NLP Data Loader in Python

Basic Usage

Example

Requirements

Installation

Datasets

Commonsense Reasoning

CommonsenseQA

Language Modeling

Penn Treebank

WikiText-103

WikiText-2 (Added by @sobamchan, thanks.)

Machine Translation

small_parallel_enja:

Paraphrase

Microsoft Research Paraphrase Corpus:

Question Answering

Sentiment Analysis

IMDB:

Sequence Tagging

CoNLL2000

Text Summarization

CNN / Daily Mail:

SciTLDR

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes