Skip to main content

A Dictionary-Based, Variety-Aware Lemmatizer for Romansh

Reason this release was yanked:

Packaging issues, use 1.0.1 instead

Project description

Basic Lemmatizer for Romansh Varieties (Beta)

This Python package presents a basic dictionary-based lemmatizer for the Romansh language. Provided a Romansh text, the lemmatizer splits it into words and looks up each word in the Pledari Grond dictionaries for the five standard Romansh idioms: Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader, as well as the dictionary for Rumantsch Grischun.

For example, if a Romansh text contains the word lavuraiva, the lemmatizer traces the word back to the Vallader and Puter dictionaries:

illustration

Typical use cases for the lemmatizer include:

  • Accessing potential German translations (glosses) of Romansh words
  • Automatically detecting the variety of a Romansh text, based on how many words are found in the respective dictionaries

A limitation of the current version is that the lemmatizer does not disambiguate between multiple possible ways of lemmatizing a word. Specifically:

  1. If a word has multiple dictionary entries, all the dictionary entries are returned, irrespective of the context in which the word occurs.
  2. If there are multiple ways of morphologically analysing a given word form, all possible analyses are returned.

Acknowledgements and Data Rights

This package incorporates dictionary data from the Pledari Grond project.

  • The dictionaries for Rumantsch Grischun, Surmiran, Sursilvan and Sutsilvan are openly licensed. © Lia Rumantscha 1980 – 2025
  • The dictionaries for Vallader and Puter are kindly provided by Uniun dals Grischs and may only be used in the context of this lemmatizer. © Uniun dals Grischs. All rights reserved.

Usage

Installation

pip install git+https://github.com/ZurichNLP/romansh_lemmatizer.git@v0.0.4

Demo: https://huggingface.co/spaces/ZurichNLP/romansh-lemmatizer

Examples

A couple of example use cases, namely:

  • Analysis of words in a corpus
  • Idiom classification
  • Romansh vs. non-Romansh identification are provided under "example_notebooks".

Initialising the lemmatizer

from romansh_lemmatizer import Lemmatizer

lemmatizer = Lemmatizer()
sent = "La vuolp d'eira darcheu üna jada fomantada."
doc = lemmatizer(sent)

Automatic idiom detection

This will automatically detect the idiom as the one with the highest covereage score, and return each idiom with its corresponding score.

print("Automatic Idiom Detection:")
print(f"The sentence '{sent}' is in:", doc.idiom) 
# <Idiom.VALLADER: 'rm-vallader'>

print("\Scores across idioms:")
for k, v in doc.idiom_scores.items():
    print("\t", k, v) 
    # {<Idiom.RUMGR: 'rm-rumgr'>: 0.77..8, <Idiom.SURSILV: 'rm-sursilv'>: 0.22...,...}

Idiom detection given a lang-specifically initialised lemmatizer

Here, since the idiom is given, that is the one that will be "detected", and will always be assigned a score of 1, while all others get a score of 0.

idiom = "rm-vallader"
vallader_lemmatizer = Lemmatizer(idiom=idiom)
doc = vallader_lemmatizer(sent)

print(f"\nPassing '{idiom}' as an argument:")
print(f"The sentence '{sent}' is in: ", doc.idiom) 
# <Idiom.VALLADER: 'rm-vallader'>

print("\Scores across idioms:")
for k, v in doc.idiom_scores.items():
    print("\t", k, v) 
    # {<Idiom.RUMGR: 'rm-rumgr'>: 0.0, <Idiom.SURSILV: 'rm-sursilv'>: 0.0,...}

Accessing the tokens and their attributes

The tokens can be accessed as follows:

print("\n", doc.tokens) 
# ['La', 'vuolp', "d'", 'eira', 'darcheu', 'üna', 'jada', 'fomantada', '.']
token = doc.tokens[-2]

And the lemmas of tokens as is described here:

print(f"\nPrint {idiom}-lemmas of token '{token}'")
print(token.lemmas) 
# {rm-vallader::fomantar: [PoS=V;VerbForm=PTCP;Tense=PST;Gender=FEM;Number=SG]}

".all_lemmas" gives lemmas in any idiom, not just in the detected one.

print(f"\nPrint all lemmas of token '{token}'")
for t in token.all_lemmas:
    print(t)
# {
#   rm-surmiran::fomanto: [PoS=ADJ;Gender=FEM;Number=SG],
#   rm-surmiran::fomantar: [PoS=V;VerbForm=PTCP;Tense=PST;Gender=FEM;Number=SG],
#   rm-vallader::fomantar: [PoS=V;VerbForm=PTCP;Tense=PST;Gender=FEM;Number=SG]
# }

Lemma objects have an attribute ".translation_de":

token = doc.tokens[1]
lemma = list(token.lemmas.keys())

print(f"\nThe German translation of a lemma, here '{token}' is an attribute of the Lemma object")
for l in lemma:
    print(f"{l}: {l.translation_de}")

# rm-vallader::vuolp: Filou (Gauner, Spitzbube)
# rm-vallader::vuolp: Fuchs
# rm-vallader::vuolp: Schlauberger (Filou)

For more detailed information on object types and their attributes, cf. example_notebooks/overview.ipynb.

Development

Installation

pip install -e ".[dev]"

Running the tests

python -m unittest discover -s tests pytest -v

License

The software in this project is licensed under the MIT License. For license information regarding the dictionary data, please refer to the Acknowledgements and Data Rights section above.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rumlem-1.0.0.tar.gz (25.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rumlem-1.0.0-py3-none-any.whl (26.3 MB view details)

Uploaded Python 3

File details

Details for the file rumlem-1.0.0.tar.gz.

File metadata

  • Download URL: rumlem-1.0.0.tar.gz
  • Upload date:
  • Size: 25.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for rumlem-1.0.0.tar.gz
Algorithm Hash digest
SHA256 6938f28ca5a587a1b61e0e001c42807290fb97d867271de55cca4e3b7114ae15
MD5 791a9cbdda6d4ca0074c6df65cbf1c48
BLAKE2b-256 69df93e784f70211dec1d0ba70fd498091e8601b5f4627810db3848bdee1acfc

See more details on using hashes here.

File details

Details for the file rumlem-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: rumlem-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 26.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for rumlem-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fbc1c1d65f91314cd8d2d86838f3beeba666daf86d507b7e8154f5bb03929ecc
MD5 bef3c252d080160871e08cab755259e4
BLAKE2b-256 6be3631b69bf4bfae5374161102fb62045d4277770dbbeda9d81a5ba5d44f3bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page