A Python package (using a Docker image under the hood) to lemmatize German texts.
Project description
German Lemmatizer
A Python package (using a Docker image under the hood) to lemmatize German texts.
Built upon:
- IWNLP uses the crowd-generated token tables on de.wikitionary.
- GermaLemma: Looks up lemmas in the TIGER Corpus and uses Pattern as a fallback for some rule-based lemmatizations.
It works as follows. First spaCy tags the token with POS. Then German Lemmatizer
looks up lemmas on IWNLP and GermanLemma. If they disagree, choose the one from IWNLP. If they agree or only one tool finds it, take it. Try to preserve the casing of the original token.
You may want to use underlying Docker image: german-lemmatizer-docker
Installation
- Install Docker.
pip install german-lemmatizer
Usage
- Read and accept the license terms of the TIGER Corpus (free to use for non-commercial purposes).
- Make sure the Docker daemons runs.
- Write some Python code
from german_lemmatizer import lemmatize
lemmatize(
['Johannes war ein guter Schüler', 'Sabiene sang zahlreiche Lieder'],
working_dir='*',
chunk_size=10000,
n_jobs=1,
escape=False,
remove_stop=False)
The list of texts is split into chunks (chunk_size
) and processed in parallel (n_jobs
).
Enable the escape
parameter if your text contains newslines. remove_stop
removes stopwords as defined by spaCy.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file german_lemmatizer-0.1.1.tar.gz
.
File metadata
- Download URL: german_lemmatizer-0.1.1.tar.gz
- Upload date:
- Size: 3.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4b638853e0f4549fb2866d89c9208e50692c2b3ddf6e2c6e75456e67aaf0790 |
|
MD5 | a975285f1f73b0352e9a3f211e636830 |
|
BLAKE2b-256 | 6464c7c2913cff0eb14d08440cdb9ff7f63292dfb9032f65f4275306e4912f5c |
File details
Details for the file german_lemmatizer-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: german_lemmatizer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 4.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7275adaab3259f6e907629e148c2f66ad6a9af3cbfb871b01f611f2a3c85092 |
|
MD5 | 77dae0fb78551f7091fab11ed629e847 |
|
BLAKE2b-256 | 2206c1958afb1a0d9979423eb67b2acecff6b95762b346fd9c872cfffa2d867a |