Skip to main content

Boilerplate code to wrap different libs for NLP tasks.

Project description

nlptasks

A collection of boilerplate code for different NLP tasks with standardised input/output data types so that it becomes easier to combine NLP tasks with different libraries/models under the hood.

NLP Tasks

Sentence Boundary Disambiguation

Input:

  • A list of M documents as string (data type: List[str])

Output:

  • A list of K sentences as string (data type: List[str])

Algorithms:

Factory name Package Algorithm Notes
'spacy' de_core_news_lg-2.3.0 Rule-based tokenization followed by Dependency Parsing for SBD
'stanza' stanza==1.1.*, de Char-based Bi-LSTM + 1D-CNN Dependency Parser for Tokenization, MWT and SBD Qi et. al. (2018), GitHub
'nltk_punkt' nltk==3.5, german Punkt Tokenizer, rule-based Kiss and Strunk (2006), Source Code
'somajo' SoMaJo==2.1.1, de_CMC rule-based Proisl and Uhrig (2016), GitHub
'spacy_rule' spacy==2.3.0 rule-based Sentencizer class

Usage:

from nlptasks.sbd import sbd_factory
docs = ["Die Kuh ist bunt. Die Bäuerin mäht die Wiese.", "Ein anderes Dokument: Ganz super! Oder nicht?"]
my_sbd_fn = sbd_factory("somajo")
sents = my_sbd_fn(docs)

Word Tokenization

Input:

  • A list of K sentences as string (data type: List[str])

Output:

  • A list of K token sequences (data type: List[List[str]])

Lemmatization

Input:

  • A list of token sequences (data type: List[List[str]])

Outputs A:

  • A list of ID sequences (data type: List[List[int]])
  • Vocabulary with ID:Lemma mapping (data type: List[str])

PoS-Tagging

Input:

  • A list of token sequences (data type: List[List[str]])

Outputs A:

  • A list of ID sequences (data type: List[List[int]])
  • Vocabulary with ID:postag mapping, i.e. the "tag set" (data type: List[str])

Outputs B:

  • A list of index pairs of a logical matrix (data type: List[List[Tuple[int, int]]])
  • Numbers of PoS-tags len(tagset)

Named Entity Recognition

Input:

  • A list of token sequences (data type: List[List[str]])

Outputs A:

  • A list of ID sequences (data type: List[List[int]])
  • Vocabulary with ID:nerscheme mapping (data type: List[str])

Outputs B:

  • A list of index pairs of a logical matrix (data type: List[List[Tuple[int, int]]])
  • Numbers of NER-Scheme tags len(nerscheme)

Dependency Relations

Input:

  • A list of token sequences (data type: List[List[str]])

Outputs:

  • A list of index pairs of an adjacency matrix (data type: List[List[Tuple[int, int]]])

Appendix

Install a virtual environment

python3.8 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements-dev.txt
pip install -r requirements.txt
bash download.sh

(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv. Use an absolute path without whitespaces.)

Python commands

  • Jupyter for the examples: jupyter lab
  • Check syntax: flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
  • Run Unit Tests: pytest
  • Upload to PyPi with twine: python setup.py sdist && twine upload -r pypi dist/*

Clean up

find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv

Support

Please open an issue for support.

Contributing

Please contribute using Github Flow. Create a branch, add commits, and open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlptasks-0.1.1.tar.gz (10.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page