Boilerplate code to wrap different libs for NLP tasks.

Project description

nlptasks

A collection of boilerplate code for different NLP tasks with standardised input/output data types so that it becomes easier to combine NLP tasks with different libraries/models under the hood.

NLP Tasks

Sentence Boundary Disambiguation (SBD)
Word Tokenization
Lemmatization
PoS-Tagging
Named Entity Recognition (NER)
Dependency Relations

Sentence Boundary Disambiguation

Input:

A list of M documents as string (data type: List[str])

Output:

A list of K sentences as string (data type: List[str])

Algorithms:

Factory `name`	Package	Algorithm	Notes
`'spacy'`	`de_core_news_lg-2.3.0`	Rule-based tokenization followed by Dependency Parsing for SBD
`'stanza'`	`stanza==1.1.*`, `de`	Char-based Bi-LSTM + 1D-CNN Dependency Parser for Tokenization, MWT and SBD	Qi et. al. (2018), GitHub
`'nltk_punkt'`	`nltk==3.5`, `german`	Punkt Tokenizer, rule-based	Kiss and Strunk (2006), Source Code
`'somajo'`	`SoMaJo==2.1.1`, `de_CMC`	rule-based	Proisl and Uhrig (2016), GitHub
`'spacy_rule'`	`spacy==2.3.0`	rule-based	Sentencizer class

Usage:

from nlptasks.sbd import sbd_factory
docs = ["Die Kuh ist bunt. Die Bäuerin mäht die Wiese.", "Ein anderes Dokument: Ganz super! Oder nicht?"]
my_sbd_fn = sbd_factory("somajo")
sents = my_sbd_fn(docs)

Word Tokenization

Input:

A list of K sentences as string (data type: List[str])

Output:

A list of K token sequences (data type: List[List[str]])

Lemmatization

Input:

A list of token sequences (data type: List[List[str]])

Outputs A:

A list of ID sequences (data type: List[List[int]])
Vocabulary with ID:Lemma mapping (data type: List[str])

PoS-Tagging

Input:

A list of token sequences (data type: List[List[str]])

Outputs A:

A list of ID sequences (data type: List[List[int]])
Vocabulary with ID:postag mapping, i.e. the "tag set" (data type: List[str])

Outputs B:

A list of index pairs of a logical matrix (data type: List[List[Tuple[int, int]]])
Numbers of PoS-tags len(tagset)

Named Entity Recognition

Input:

A list of token sequences (data type: List[List[str]])

Outputs A:

A list of ID sequences (data type: List[List[int]])
Vocabulary with ID:nerscheme mapping (data type: List[str])

Outputs B:

A list of index pairs of a logical matrix (data type: List[List[Tuple[int, int]]])
Numbers of NER-Scheme tags len(nerscheme)

Dependency Relations

Input:

A list of token sequences (data type: List[List[str]])

Outputs:

A list of index pairs of an adjacency matrix (data type: List[List[Tuple[int, int]]])

Appendix

Install a virtual environment

python3.8 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements-dev.txt
pip install -r requirements.txt
bash download.sh

(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv. Use an absolute path without whitespaces.)

Python commands

Jupyter for the examples: jupyter lab
Check syntax: flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
Run Unit Tests: pytest
Upload to PyPi with twine: python setup.py sdist && twine upload -r pypi dist/*

Clean up

find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv

Support

Please open an issue for support.

Contributing

Please contribute using Github Flow. Create a branch, add commits, and open a pull request.

Project details

Release history Release notifications | RSS feed

0.3.2

Apr 27, 2021

0.3.1

Apr 9, 2021

0.3.0

Dec 26, 2020

0.2.7

Nov 16, 2020

0.2.6

Nov 16, 2020

0.2.5

Nov 16, 2020

0.2.4

Nov 14, 2020

0.2.2

Nov 13, 2020

0.2.1

Nov 13, 2020

0.2.0

Nov 11, 2020

0.1.3

Nov 6, 2020

0.1.2

Nov 5, 2020

0.1.1

Nov 5, 2020

This version

0.1.0

Nov 3, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlptasks-0.1.0.tar.gz (10.0 kB view hashes)

Uploaded Nov 3, 2020 Source

Hashes for nlptasks-0.1.0.tar.gz

Hashes for nlptasks-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f8b8a932f6329a9faee2ed5c292439d10756f5d024440982333184f8f19d2e5c`
MD5	`524465a77b34f99873a9a25c5a1d6433`
BLAKE2b-256	`48e280fcb2e86927ef75e998d28e78196a3896d3c0fad2a14f40ce0d5b88c2d3`