Skip to main content

Lemmatizer for Danish

Project description

🤘 Lemmy

Lemmy is a lemmatizer for Danish 🇩🇰 and Swedish 🇸🇪. It comes ready for use. The Danish model is trained on Dansk Sprognævn's (DSN) word list (‘fuldformliste’) and the Danish Universal Dependencies. The Swedish model is trained on the SALDO's morphology dataset and the Swedish Universal Dependencies (Talbanken). Lemmy also supports training on your own dataset.

The models included in Lemmy were evaluated on the respective Universal Dependencies dev datasets. The Danish model scored > 99% accuracy, while the Swedish model scored > 97%. All reported scores were obtained when supplying Lemmy with POS tags.

You can use Lemmy as a spaCy extension, more specifcally a spaCy pipeline component. This is highly recommended and makes the lemmas easily accessible from the spaCy tokens. Lemmy makes use of POS tags to predict the lemmas. When wired up to the spaCy pipeline, Lemmy has the benefit of using spaCy’s builtin POS tagger.

Lemmy can also by used without spaCy, as a standalone lemmatizer. In that case, you will have to provide the POS tags. Alternatively, you can use Lemmy without POS tags, though most likely the accuracy will suffer. Currrently, only the Danish Lemmy model comes with a model trained for use without POS tags. That is, if you want to use Lemmy on Swedish text without POS tags, you must train your own Lemmy model.

Lemmy is heavily inspired by the CST Lemmatizer for Danish.

Install

pip install lemmy

Basic Usage Without POS tags

import lemmy

# Create an instance of the standalone lemmatizer.
lemmatizer = lemmy.load("da")

# Find lemma for the word 'akvariernes'. First argument is an empty POS tag.
lemmatizer.lemmatize("", "akvariernes")

Basic Usage With POS tags

import lemmy

# Create an instance of the standalone lemmatizer.
# Replace 'da' with 'sv' for the Swedish lemmatizer.
lemmatizer = lemmy.load("da")

# Find lemma for the word 'akvariernes'. First argument is the user-provided POS tag.
lemmatizer.lemmatize("NOUN", "akvariernes")

Usage with spaCy Model

import da_custom_model as da # replace da_custom_model with name of your spaCy model
import lemmy.pipe
nlp = da.load()

# Create an instance of Lemmy's pipeline component for spaCy.
# Replace 'da' with 'sv' for the Swedish lemmatizer.
pipe = lemmy.pipe.load('da')

# Add the component to the spaCy pipeline.
nlp.add_pipe(pipe, after='tagger')

# Lemmas can now be accessed using the `._.lemmas` attribute on the tokens.
nlp("akvariernes")[0]._.lemmas

Training

The notebooks folder contains examples showing how to train your own model using Lemmy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lemmy-2.1.0.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

lemmy-2.1.0-py2.py3-none-any.whl (1.1 MB view details)

Uploaded Python 2 Python 3

File details

Details for the file lemmy-2.1.0.tar.gz.

File metadata

  • Download URL: lemmy-2.1.0.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for lemmy-2.1.0.tar.gz
Algorithm Hash digest
SHA256 06f970c4e54b614cf740a7228778fbc62f01c366a544f7c86fecc7f1d2324b63
MD5 7f23b86de4e2a22990b5633e1ae23316
BLAKE2b-256 189801f75fe58c4c67114c99502788cc9d16d2d7d871c6ab70f1d0d360eb16f0

See more details on using hashes here.

File details

Details for the file lemmy-2.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: lemmy-2.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for lemmy-2.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 0ed5fc3030f3858ebdc5c41b0c909963b5166121c855343912003b1b1c6b1603
MD5 eda4fa2db951d202d445c4c6a83d5af9
BLAKE2b-256 79e366f7cf52608ee094fb411a3af4958cb3ba31af167dc03eb269568ccd0248

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page