Skip to main content

An NLP library for Uralic languages such as Finnish and Sami. Also supports Spanish, Arabic, Russian etc.

Project description

UralicNLP

Natural language processing for many languages

Updates Downloads DOI

UralicNLP can produce morphological analyses, generate morphological forms, lemmatize words and give lexical information about words in Uralic and other languages. The languages we support include the following languages: Finnish, Russian, German, English, Norwegian, Swedish, Arabic, Ingrian, Meadow & Eastern Mari, Votic, Olonets-Karelian, Erzya, Moksha, Hill Mari, Udmurt, Tundra Nenets, Komi-Permyak, North Sami, South Sami and Skolt Sami. Currently, UralicNLP uses stable builds for the supported languages.

See the catalog of supported languages

Some of the supported languages: 🇸🇦 🇪🇸 🇮🇹 🇵🇹 🇩🇪 🇫🇷 🇳🇱 🇬🇧 🇷🇺 🇫🇮 🇸🇪 🇳🇴 🇩🇰 🇱🇻 🇪🇪

Check out UralicGUI - a graphical user interface for UralicNLP.

☕ Check out UralicNLP official Java version

♯ Check out UralicNLP official C# version

Installation

The library can be installed from PyPi.

pip install uralicNLP

If you want to use the Constraint Grammar features (from uralicNLP.cg3 import Cg3), you will also need to install VISL CG-3.

MCP

Who said LLMs don't speak endangered languages? UralicNLP now supports MCP! Connect UralicNLP main functionality directly to your favorite MCP supporting LLM! Read more in the UralicMCP wiki.

Large language models (LLMs)

UralicNLP supports a wide range of LLMs and it can even embed text in some endangered languages Check out LLMs.

UralicNLP can cluster texts into semantically similar categories. Learn more about clustering.

List supported languages

The API is under constant development and new languages will be added to the nightly builds system. That's why UralicNLP provides a functionality for looking up the list of currently supported languages. The method returns 3 letter ISO codes for the languages.

from uralicNLP import uralicApi
uralicApi.supported_languages()
>>{'cg': ['vot', 'lav', 'izh', 'rus', 'lut', 'fao', 'est', 'nob', 'ron', 'olo', 'bxr', 'hun', 'crk', 'chr', 'vep', 'deu', 'mrj', 'gle', 'sjd', 'nio', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'bak', 'kca', 'otw', 'ciw', 'fkv', 'nds', 'kpv', 'sme', 'sje', 'evn', 'oji', 'ipk', 'fit', 'fin', 'mns', 'rmf', 'liv', 'cor', 'mdf', 'yrk', 'tat', 'smj'], 'dictionary': ['vot', 'lav', 'rus', 'est', 'nob', 'ron', 'olo', 'hun', 'koi', 'chr', 'deu', 'mrj', 'sjd', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'fkv', 'mhr', 'kpv', 'sme', 'sje', 'hdn', 'fin', 'mns', 'mdf', 'vro', 'udm', 'smj'], 'morph': ['vot', 'lav', 'izh', 'rus', 'lut', 'fao', 'est', 'nob', 'swe', 'ron', 'eng', 'olo', 'bxr', 'hun', 'koi', 'crk', 'chr', 'vep', 'deu', 'mrj', 'ara', 'gle', 'sjd', 'nio', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'bak', 'kca', 'otw', 'ciw', 'fkv', 'nds', 'mhr', 'kpv', 'sme', 'sje', 'evn', 'oji', 'ipk', 'fit', 'fin', 'mns', 'rmf', 'liv', 'cor', 'mdf', 'yrk', 'vro', 'udm', 'tat', 'smj']}

The dictionary key lists the languages that are supported by the lexical lookup, whereas morph lists the languages that have morphological FSTs and cg lists the languages that have a CG.

Download models

On the command line:

python -m uralicNLP.download --languages fin eng

From python code:

from uralicNLP import uralicApi
uralicApi.download("fin")

When models are installed, generate(), analyze() and lemmatize() methods will automatically use them instead of the server side API. More information about the models.

Lemmatize words

A word form can be lemmatized with UralicNLP. This does not do any disambiguation but rather returns a list of all the possible lemmas.

from uralicNLP import uralicApi
uralicApi.lemmatize("вирев", "myv")
>>['вирев', 'вирь']
uralicApi.lemmatize("luutapiiri", "fin", word_boundaries=True)
>>['luuta|piiri', 'luu|tapiiri']

An example of lemmatizing the word вирев in Erzya (myv). By default, a descriptive analyzer is used. Use uralicApi.lemmatize("вирев", "myv", descriptive=False) for a non-descriptive analyzer. If word_boundaries is set to True, the lemmatizer will mark word boundaries with a |.

Morphological analysis

Apart from just getting the lemmas, it's also possible to perform a complete morphological analysis.

from uralicNLP import uralicApi
uralicApi.analyze("voita", "fin")
>>[['voi+N+Sg+Par', 0.0], ['voi+N+Pl+Par', 0.0], ['voitaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voitaa+V+Act+Imprt+Sg2', 0.0], ['voitaa+V+Act+Ind+Prs+ConNeg', 0.0], ['voittaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voittaa+V+Act+Imprt+Sg2', 0.0], ['voittaa+V+Act+Ind+Prs+ConNeg', 0.0], ['vuo+N+Pl+Par', 0.0]]

An example of analyzing the word voita in Finnish (fin). The default analyzer is descriptive. To use a normative analyzer instead, use uralicApi.analyze("voita", "fin", descriptive=False).

Morphological generation

From a lemma and a morphological analysis, it's possible to generate the desired word form.

from uralicNLP import uralicApi
uralicApi.generate("käsi+N+Sg+Par", "fin")
>>[['kättä', 0.0]]

An example of generating the singular partitive form for the Finnish noun käsi. The result is kättä. The default generator is a regular normative generator. uralicApi.generate("käsi+N+Sg+Par", "fin", dictionary_forms=True) uses a normative dictionary generator and uralicApi.generate("käsi+N+Sg+Par", "fin", descriptive=True) a descriptive generator.

Morphological segmentation

UralicNLP makes it possible to split a word form into morphemes. (Note: this does not work with all languages)

from uralicNLP import uralicApi
uralicApi.segment("luutapiirinikin", "fin")
>>[['luu', 'tapiiri', 'ni', 'kin'], ['luuta', 'piiri', 'ni', 'kin']]

In the example, the word luutapiirinikin has two possible interpretations luu|tapiiri and luuta|piiri, the segmentation is done for both interpretations.

Disambiguation

This section has been moved to UralicNLP wiki page on disambiguation.

Dictionaries

Learn more about dictionaries in the wiki page on dictionaries.

Parsing UD CoNLL-U annotated TreeBank data

UralicNLP comes with tools for parsing and searching CoNLL-U formatted data. Please refer to the Wiki for the UD parser documentation.

Other functionalities

Cite

If you use UralicNLP in an academic publication, please cite it as follows:

Hämäläinen, Mika. (2019). UralicNLP: An NLP Library for Uralic Languages. Journal of open source software, 4(37), [1345]. https://doi.org/10.21105/joss.01345

@article{uralicnlp_2019, 
    title={{UralicNLP}: An {NLP} Library for {U}ralic Languages},
    DOI={10.21105/joss.01345}, 
    journal={Journal of Open Source Software}, 
    author={Mika Hämäläinen}, 
    year={2019}, 
    volume={4},
    number={37},
    pages={1345}
}

For citing the FSTs and CGs, see uralicApi.model_info(language).

The FST and CG tools and dictionaries come mostly from the GiellaLT repositories and Apertium.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uralicnlp-2.1.0.tar.gz (118.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uralicnlp-2.1.0-py2.py3-none-any.whl (117.0 kB view details)

Uploaded Python 2Python 3

File details

Details for the file uralicnlp-2.1.0.tar.gz.

File metadata

  • Download URL: uralicnlp-2.1.0.tar.gz
  • Upload date:
  • Size: 118.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for uralicnlp-2.1.0.tar.gz
Algorithm Hash digest
SHA256 37d134fcbb8cddeb030dee6dbb5897746805c7928140e568f3f89a9d338e66f5
MD5 42dcdc0021bfb5fd2d2c027fbd8aa453
BLAKE2b-256 7b449aff083e744051a2b1c561cdc5b401fc2798ace6d11c401dbc500a6420ab

See more details on using hashes here.

File details

Details for the file uralicnlp-2.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: uralicnlp-2.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 117.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for uralicnlp-2.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 7d40cbe7710addc2df18502c8aa9726e241e3c2112e9de5de4eb99bac2f39ad6
MD5 68fabf4ea20d4fddaeb58c50e5557939
BLAKE2b-256 cbfbc68089e709ddb458742097a5d028c5c192d2283dbad024e7d7c84b36f316

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page