simplemma

A simple multilingual lemmatizer for Python.

These details have not been verified by PyPI

Project links

Homepage

Project description

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form. Unlike stemming, lemmatization outputs word units that are still valid linguistic forms.

In modern natural language processing (NLP), this task is often indirectly tackled by more complex systems encompassing a whole processing pipeline. However, it appears that there is no straightforward way to address lemmatization in Python although this task is useful in information retrieval and natural language processing.

Simplemma provides a simple and multilingual approach (currently 22 languages, see list below) to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. By design it should be reasonably fast and work in a large majority of cases, without being perfect. With its comparatively small footprint it is especially useful when speed and simplicity matter, for educational purposes or as a baseline system for lemmatization and morphological analysis.

Installation

The current library is written in pure Python with no dependencies:

pip install simplemma (or pip3 where applicable)

Usage

Simplemma is used by selecting a language of interest and then applying the data on a list of words.

>>> import simplemma
# decide which language data to load
>>> langdata = simplemma.load_data('de')
# apply it on a word form
>>> simplemma.lemmatize(myword, langdata)
# grab a list of tokens
>>> mytokens = ['Hier', 'sind', 'Tokens']
>>> for token in mytokens:
>>>     simplemma.lemmatize(token, langdata)
'hier'
'sein'
'tokens'
# list comprehensions can be faster
>>> [simplemma.lemmatize(t, langdata) for t in mytokens]
['hier', 'sein', 'tokens']

Chaining several languages can improve coverage:

>>> langdata = simplemma.load_data('de', 'en')
>>> simplemma.lemmatize('Tokens', langdata)
'token'
>>> langdata = simplemma.load_data('en')
>>> simplemma.lemmatize('spaghetti', langdata)
'spaghetti'
>>> langdata = simplemma.load_data('en', 'it')
>>> simplemma.lemmatize('spaghetti', langdata)
'spaghetto'

There are cases for which a greedier algorithm is better. It is activated by default:

>>> langdata = simplemma.load_data('de')
>>> simplemma.lemmatize('angekündigten', langdata)
'ankündigen' # infinitive verb
>>> simplemma.lemmatize('angekündigten', langdata, greedy=False)
'angekündigt' # past participle

Caveats:

# don't expect too much though
>>> langdata = simplemma.load_data('it')
# this diminutive form isn't in the model data
>>> simplemma.lemmatize('spaghettini', langdata)
'spaghettini' # should read 'spaghettino'

Supported languages

The following languages are available using their ISO 639-1 code:

bg: Bulgarian (low coverage)
ca: Catalan
cs: Czech (low coverage)
cy: Welsh
de: German (see also this list)
en: English (alternative: LemmInflect)
es: Spanish
fa: Persian (low coverage)
fr: French
ga: Irish
gd. Gaelic
gl: Galician
gv: Manx
hu: Hungarian (low coverage)
it: Italian
pt: Portuguese
ro: Romanian
ru: Russian (alternative: pymorphy2)
sk: Slovak
sl: Slovenian (low coverage)
sv: Swedish (alternative: lemmy)
uk: Ukranian (alternative: pymorphy2)

Low coverage mentions means you’d probably be better off with a language-specific library, but simplemma will work to a limited extent. Open-source alternatives for Python are referenced if available.

Free software: MIT license
Documentation: https://github.com/adbar/simplemma

Roadmap

[ ] Integrate further lemmatization lists
[ ] Function as a meta-package?
[ ] Integrate optional, more complex models?

Credits

The current version basically acts as a wrapper for lemmatization lists by Michal Měchura (Open Database License). This rule-based approach based on flexion and lemmatizations dictionaries is to this day an approach used in popular libraries such as spacy.

Contributions

Feel free to contribute, notably by filing issues for feedback, bug reports, or links to further lemmatization lists, rules and tests.

You can also contribute to this lemmatization list repository.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.1.2

Nov 19, 2024

1.1.1

Aug 8, 2024

1.1.0

Aug 6, 2024

1.0.0

May 31, 2024

0.9.1

Jan 20, 2023

0.9.0

Oct 18, 2022

0.8.2

Sep 5, 2022

0.8.1

Sep 1, 2022

0.8.0

Aug 2, 2022

0.7.0

Jun 16, 2022

0.6.0

Apr 6, 2022

0.5.0

Nov 19, 2021

0.4.0

Oct 19, 2021

0.3.0

Apr 8, 2021

0.2.2

Feb 24, 2021

0.2.1

Feb 2, 2021

0.2.0

Jan 25, 2021

This version

0.1.0

Jan 18, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simplemma-0.1.0.tar.gz (6.7 kB view details)

Uploaded Jan 18, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

simplemma-0.1.0-py3-none-any.whl (39.5 MB view details)

Uploaded Jan 18, 2021 Python 3

File details

Details for the file simplemma-0.1.0.tar.gz.

File metadata

Download URL: simplemma-0.1.0.tar.gz
Upload date: Jan 18, 2021
Size: 6.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.25.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.9

File hashes

Hashes for simplemma-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`356cbc67036ffc43b2a5cd5d418e855a42e89e3fe484a0caf379feeca91a6973`
MD5	`6e76ce2ae3b35be95c22de67ffd07da6`
BLAKE2b-256	`db5f47134ab9c7516fdaa66861715d99120d8b7097445ce113a75fce2309c2de`

See more details on using hashes here.

File details

Details for the file simplemma-0.1.0-py3-none-any.whl.

File metadata

Download URL: simplemma-0.1.0-py3-none-any.whl
Upload date: Jan 18, 2021
Size: 39.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.25.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.9

File hashes

Hashes for simplemma-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`32ee14b19b7ab5b43673ef6c1d0571ef2c1cf118c65b350e6967e8609f3a5c6f`
MD5	`e6d14bc2dc851ad0e26c64f91cf6ca1c`
BLAKE2b-256	`a7da73413ba73d186190f04a00ca0f6f1853a8e705e6ff2e5d723ec56fa26b57`

See more details on using hashes here.

simplemma 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Usage

Supported languages

Roadmap

Credits

Contributions

Other solutions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes