Boilerplate code to wrap different libs for NLP tasks.
Project description
nlptasks
A collection of boilerplate code for different NLP tasks with standardised input/output data types so that it becomes easier to combine NLP tasks with different libraries/models under the hood.
- Sentence Boundary Disambiguation (SBD)
- Word Tokenization
- Lemmatization
- PoS-Tagging
- Named Entity Recognition (NER)
- Dependency Relations
Installation
The nlptasks
package is available on the PyPi server
pip install nlptasks>=0.2.1
Sentence Boundary Disambiguation
Input:
- A list of M documents as string (data type:
List[str]
)
Output:
- A list of K sentences as string (data type:
List[str]
)
Usage:
from nlptasks.sbd import sbd_factory
docs = [
"Die Kuh ist bunt. Die Bäuerin mäht die Wiese.",
"Ein anderes Dokument: Ganz super! Oder nicht?"]
my_sbd_fn = sbd_factory(name="somajo")
sents = my_sbd_fn(docs)
print(sents)
Example output:
[
'Die Kuh ist bunt.',
'Die Bäuerin mäht die Wiese.',
'Ein anderes Dokument: Ganz super!',
'Oder nicht?'
]
Algorithms:
Factory name |
Package | Algorithm | Notes |
---|---|---|---|
'spacy-de' |
de_core_news_lg-2.3.0 |
Rule-based tokenization followed by Dependency Parsing for SBD | |
'stanza-de' |
stanza==1.1.* , de |
Char-based Bi-LSTM + 1D-CNN Dependency Parser for Tokenization, MWT and SBD | Qi et. al. (2018), GitHub |
'nltk-punkt-de' |
nltk==3.5 , german |
Punkt Tokenizer, rule-based | Kiss and Strunk (2006), Source Code |
'somajo-de' |
SoMaJo==2.1.1 , de_CMC |
rule-based | Proisl and Uhrig (2016), GitHub |
'spacy-rule-de' |
spacy==2.3.0 |
rule-based | Sentencizer class |
Notes:
- Dependency parser based SBDs (e.g.
'spacy'
,'stanza'
) are more suitable for documents with typos (e.g.','
instead of'.'
,' .'
instead of'. '
) or missing punctuation. - Rule-based based SBD algorithms (e.g.
'nltk_punkt'
,'somajo'
,'spacy_rule'
) are more suitable for documents that can be assumed error free, i.e. it's very likely that spelling and grammar rules are being followed by the author, e.g. newspaper articles, published books, reviewed articles.
Word Tokenization
Input:
- A list of K sentences as string (data type:
List[str]
)
Output:
- A list of K token sequences (data type:
List[List[str]]
)
Usage:
from nlptasks.token import token_factory
sentences = [
"Die Kuh ist bunt.",
"Die Bäuerin mäht die Wiese."]
my_tokenizer_fn = token_factory(name="stanza")
sequences = my_tokenizer_fn(sentences)
print(sequences)
Example output
[
['Die', 'Kuh', 'ist', 'bunt', '.'],
['Die', 'Bäuerin', 'mäht', 'die', 'Wiese', '.']
]
Algorithms:
Factory name |
Package | Algorithm | Notes |
---|---|---|---|
'spacy-de' |
de_core_news_lg-2.3.0 |
Rule-based tokenization | Docs |
'stanza-de' |
stanza==1.1.* , de |
Char-based Bi-LSTM + 1D-CNN Dependency Parser for Tokenization, MWT and SBD | Qi et. al. (2018), GitHub |
Lemmatization
Input:
- A list of token sequences (data type:
List[List[str]]
)
Outputs:
- A list of ID sequences (data type:
List[List[int]]
) - Vocabulary with
ID:Lemma
mapping (data type:List[str]
)
Usage:
from nlptasks.lemma import lemma_factory
sequences = [
['Die', 'Kuh', 'ist', 'bunt', '.'],
['Die', 'Bäuerin', 'mäht', 'die', 'Wiese', '.']
]
my_lemmatizer_fn = lemma_factory(name="spacy")
idseqs, VOCAB = my_lemmatizer_fn(sequences, min_occurrences=0)
print(idseqs)
print(VOCAB)
Example output
[[5, 2, 7, 4, 0], [5, 1, 6, 5, 3, 0]]
['.', 'Bäuerin', 'Kuh', 'Wiese', 'bunt', 'der', 'mähen', 'sein', '[UNK]']
Algorithms:
Factory name |
Package | Algorithm | Notes |
---|---|---|---|
'spacy-de' |
de_core_news_lg-2.3.0 |
Rule-based tokenization | Docs |
'stanza-de' |
stanza==1.1.* , de |
n.a. | Qi et. al. (2018), Ch. 2.3, GitHub |
PoS-Tagging
Input:
- A list of token sequences (data type:
List[List[str]]
)
Outputs:
- A list of ID sequences (data type:
List[List[int]]
) - Vocabulary with
ID:postag
mapping, i.e. the "tag set" (data type:List[str]
)
Usage:
from nlptasks.pos import pos_factory
sequences = [
['Die', 'Kuh', 'ist', 'bunt', '.'],
['Die', 'Bäuerin', 'mäht', 'die', 'Wiese', '.']
]
my_postagger = pos_factory(name="spacy")
idseqs, TAGSET = my_postagger(sequences, maxlen=4)
print(idseqs)
Example output
[[19, 41, 4, 2], [48, 10, 19, 2]]
Algorithms:
Factory name |
Package | Algorithm | Notes |
---|---|---|---|
'spacy-de' |
de_core_news_lg-2.3.0 |
multi-task CNN | Docs |
'stanza-de' |
stanza==1.1.* , de |
Bi-LSTM with a) word2vec, b) own embedding layer, c) char-based embedding as input | Qi et. al. (2018), Ch. 2.2, GitHub |
'flair-de' |
flair==0.6.* , de-pos-ud-hdt-v0.5.pt |
Docs |
PoS (Variant 2)
The PoS tagger returns UPOS and UD feats (v2) for a token, e.g. "DET"
and "Case=Gen|Definite=Def|Gender=Neut|Number=Sing|PronType=Art"
. All information are one-hot encoded, i.e. one token (column) can have one or more 1s.
Input:
- A list of token sequences (data type:
List[List[str]]
)
Outputs:
- A list of index pairs of a logical matrix (data type:
List[List[Tuple[int, int]]]
) - A list with with original sequence length
- Combined UPOS and UD feats Scheme
Usage:
from nlptasks.pos2 import pos2_factory
sequences = [
['Die', 'Frau', 'arbeit', 'in', 'der', 'UN', '.'],
['Angela', 'Merkel', 'mäht', 'die', 'Wiese', '.']
]
myfunc = pos2_factory(name="stanza-de")
maskseqs, seqlen, SCHEME = myfunc(sequences)
print(maskseqs)
print(seqlen)
print(SCHEME)
Example output
[
[
(5, 0), (112, 0), (115, 0), (41, 0), (77, 0), (17, 0), (7, 1),
...
(11, 5), (100, 5), (41, 5), (77, 5), (12, 6)
],
[
(11, 0), (112, 0), (41, 0), (77, 0), (11, 1), (112, 1), (41, 1),
...
(17, 3), (7, 4), (110, 4), (41, 4), (77, 4), (12, 5)]
]
[7, 6]
['ADJ', 'ADP', ... 'VERB', 'X', 'PronType=Art', ..., 'Clusivity=In']
Algorithms:
Factory name |
Package | Algorithm | Notes |
---|---|---|---|
'stanza-de' |
stanza==1.1.* , de |
Bi-LSTM with a) word2vec, b) own embedding layer, c) char-based embedding as input | Qi et. al. (2018), Ch. 2.2, GitHub |
Named Entity Recognition
The NE-tags without prefix (e.g. LOC
, PER
) are mapped with IDs, i.e. int
.
Input:
- A list of token sequences (data type:
List[List[str]]
)
Outputs:
- A list of ID sequences (data type:
List[List[int]]
) - Vocabulary with
ID:nerscheme
mapping (data type:List[str]
)
Usage:
from nlptasks.ner import ner_factory
sequences = [
['Die', 'Frau', 'arbeit', 'in', 'der', 'UN', '.'],
['Angela', 'Merkel', 'mäht', 'die', 'Wiese', '.']
]
my_ner = ner_factory(name="spacy")
idseqs, SCHEME = my_ner(sequences)
print(idseqs)
print(SCHEME)
Example output
[[4, 4, 4, 4, 4, 2, 4], [0, 0, 4, 4, 4, 4]]
['PER', 'LOC', 'ORG', 'MISC', '[UNK]']
Algorithms:
Factory name |
Package | Algorithm | Notes |
---|---|---|---|
'flair-multi' |
flair==0.6.* , quadner-large.pt |
Docs | |
'spacy-de' |
de_core_news_lg-2.3.0 |
multi-task CNN | Docs |
'stanza-de' |
stanza==1.1.* , de |
n.a. | Docs, GitHub |
NER (Variant 2)
The NER tagger will return NE-tags with IOB-prefix, e.g. E-LOC
.
Both information are one-hot encoded, i.e. one token (column) can have one or two 1s.
Input:
- A list of token sequences (data type:
List[List[str]]
)
Outputs:
- A list of index pairs of a logical matrix (data type:
List[List[Tuple[int, int]]]
) - A list with with original sequence length
- NER-Scheme tags
Usage:
from nlptasks.ner2 import ner2_factory
sequences = [
['Die', 'Frau', 'arbeit', 'in', 'der', 'UN', '.'],
['Angela', 'Merkel', 'mäht', 'die', 'Wiese', '.']
]
my_ner = ner2_factory(name="flair-multi")
maskseqs, seqlen, SCHEME = my_ner(sequences)
print(maskseqs)
print(seqlen)
print(SCHEME)
Example output
[
[(6, 0), (6, 1), (6, 2), (6, 3), (6, 4), (8, 5), (2, 5), (6, 6)],
[(4, 0), (0, 0), (7, 1), (0, 1), (6, 2), (6, 3), (6, 4), (6, 5)]
]
['PER', 'LOC', 'ORG', 'MISC', 'B', 'I', 'O', 'E', 'S']
Algorithms:
Factory name |
Package | Algorithm | Notes |
---|---|---|---|
'flair-multi' |
flair==0.6.* , quadner-large.pt |
Docs |
Dependency Relations
Input:
- A list of token sequences (data type:
List[List[str]]
)
Outputs:
- A list of index pairs of an adjacency matrix (data type:
List[List[Tuple[int, int]]]
) for- children relations to a token
- parent relation to a token
- A list with with original sequence length
Usage:
from nlptasks.deprel import deprel_factory
sequences = [
['Die', 'Kuh', 'ist', 'bunt', '.'],
['Die', 'Bäuerin', 'mäht', 'die', 'Wiese', '.']
]
my_deps = deprel_factory("spacy")
deps_child, deps_parent, seqlens = my_deps(sequences)
print(deps_child)
print(deps_parent)
Example output
[
[(0, 1), (1, 2), (3, 2), (4, 2)],
[(0, 1), (1, 2), (4, 2), (5, 2), (3, 4)]
]
[
[(1, 0), (2, 1), (2, 2), (2, 3), (2, 4)],
[(1, 0), (2, 1), (2, 2), (4, 3), (2, 4), (2, 5)]
]
Algorithms:
Factory name |
Package | Algorithm | Notes |
---|---|---|---|
'spacy-de' |
de_core_news_lg-2.3.0 |
multi-task CNN | Docs |
Appendix
Install a virtual environment
python3.6 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements-dev.txt --use-feature=2020-resolver
pip install -r requirements.txt --use-feature=2020-resolver
python scripts/nlptasks_downloader.py
bash download_testdata.sh
(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv
. Use an absolute path without whitespaces.)
Python commands
- Jupyter for the examples:
jupyter lab
- Check syntax:
flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
- Run Unit Tests:
pytest
- Upload to PyPi with twine:
python setup.py sdist && twine upload -r pypi dist/*
Clean up
find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv
Support
Please open an issue for support.
Contributing
Please contribute using Github Flow. Create a branch, add commits, and open a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.