Skip to main content

A Python package (using a Docker image under the hood) to lemmatize German texts.

Project description

Scissors

German Lemmatizer

A Python package (using a Docker image under the hood) to lemmatize German texts.

Built upon:

It works as follows. First spaCy tags the token with POS. Then German Lemmatizer looks up lemmas on IWNLP and GermanLemma. If they disagree, choose the one from IWNLP. If they agree or only one tool finds it, take it. Try to preserve the casing of the original token.

You may want to use underlying Docker image: german-lemmatizer-docker

Installation

  1. Install Docker.
  2. pip install german-lemmatizer

Usage

  1. Read and accept the license terms of the TIGER Corpus (free to use for non-commercial purposes).
  2. Make sure the Docker daemons runs.
  3. Write some Python code
from german_lemmatizer import lemmatize

lemmatize(
    ['Johannes war ein guter Schüler', 'Sabiene sang zahlreiche Lieder'],
    working_dir='*',
    chunk_size=10000,
    n_jobs=1,
    escape=False,
    remove_stop=False)

The list of texts is split into chunks (chunk_size) and processed in parallel (n_jobs).

Enable the escape parameter if your text contains newslines. remove_stop removes stopwords as defined by spaCy.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

german_lemmatizer-0.1.1.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

german_lemmatizer-0.1.1-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file german_lemmatizer-0.1.1.tar.gz.

File metadata

  • Download URL: german_lemmatizer-0.1.1.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for german_lemmatizer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a4b638853e0f4549fb2866d89c9208e50692c2b3ddf6e2c6e75456e67aaf0790
MD5 a975285f1f73b0352e9a3f211e636830
BLAKE2b-256 6464c7c2913cff0eb14d08440cdb9ff7f63292dfb9032f65f4275306e4912f5c

See more details on using hashes here.

File details

Details for the file german_lemmatizer-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: german_lemmatizer-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for german_lemmatizer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b7275adaab3259f6e907629e148c2f66ad6a9af3cbfb871b01f611f2a3c85092
MD5 77dae0fb78551f7091fab11ed629e847
BLAKE2b-256 2206c1958afb1a0d9979423eb67b2acecff6b95762b346fd9c872cfffa2d867a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page