Skip to main content

A simple multilingual lemmatizer for Python.

Project description

Python package License Python versions Travis build status

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form. Unlike stemming, lemmatization outputs word units that are still valid linguistic forms.

In modern natural language processing (NLP), this task is often indirectly tackled by more complex systems encompassing a whole processing pipeline. However, it appears that there is no straightforward way to address lemmatization in Python although this task is useful in information retrieval and natural language processing.

Simplemma provides a simple and multilingual approach to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. By design it should be reasonably fast and work in a large majority of cases, without being perfect. Currently, 35 languages are partly or fully supported, see table below.

With its comparatively small footprint it is especially useful when speed and simplicity matter, for educational purposes or as a baseline system for lemmatization and morphological analysis.

Installation

The current library is written in pure Python with no dependencies:

pip install simplemma (or pip3 where applicable)

Usage

Word-by-word

Simplemma is used by selecting a language of interest and then applying the data on a list of words.

>>> import simplemma
# get a word
myword = 'masks'
# decide which language data to load
>>> langdata = simplemma.load_data('en')
# apply it on a word form
>>> simplemma.lemmatize(myword, langdata)
'mask'
# grab a list of tokens
>>> mytokens = ['Hier', 'sind', 'Vaccines']
>>> langdata = simplemma.load_data('de')
>>> for token in mytokens:
>>>     simplemma.lemmatize(token, langdata)
'hier'
'sein'
'Vaccines'
# list comprehensions can be faster
>>> [simplemma.lemmatize(t, langdata) for t in mytokens]
['hier', 'sein', 'Vaccines']

Chaining several languages can improve coverage:

>>> langdata = simplemma.load_data('de', 'en')
>>> simplemma.lemmatize('Vaccines', langdata)
'vaccine'
>>> langdata = simplemma.load_data('it')
>>> simplemma.lemmatize('spaghettis', langdata)
'spaghettis'
>>> langdata = simplemma.load_data('it', 'fr')
>>> simplemma.lemmatize('spaghettis', langdata)
'spaghetti'
>>> simplemma.lemmatize('spaghetti', langdata)
'spaghetto'

There are cases in which a greedier decomposition and lemmatization algorithm is better. It is deactivated by default:

# same example as before, comes to this result in one step
>>> simplemma.lemmatize('spaghettis', mydata, greedy=True)
'spaghetto'
# a German case
>>> langdata = simplemma.load_data('de')
>>> simplemma.lemmatize('angekündigten', langdata)
'ankündigen' # infinitive verb
>>> simplemma.lemmatize('angekündigten', langdata, greedy=False)
'angekündigt' # past participle

Tokenization

A simple tokenization is included for convenience:

>>> from simplemma import simple_tokenizer
>>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']

The function text_lemmatizer() chains tokenization and lemmatization. It can take greedy and silent as arguments:

>>> from simplemma import text_lemmatizer
>>> langdata = simplemma.load_data('pt')
>>> text_lemmatizer('Sou o intervalo entre o que desejo ser e os outros me fizeram.', langdata)
# caveat: desejo is also a noun, should be desejar here
['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']

Caveats

# don't expect too much though
>>> langdata = simplemma.load_data('it')
# this diminutive form isn't in the model data
>>> simplemma.lemmatize('spaghettini', langdata)
'spaghettini' # should read 'spaghettino'
# the algorithm cannot choose between valid alternatives yet
>>> langdata = simplemma.load_data('es')
>>> simplemma.lemmatize('son', langdata)
'son' # valid common name, but what about the verb form?

As the focus lies on overall coverage, some short frequent words (typically: pronouns) can need post-processing, this generally concerns 10-20 tokens per language.

The greedy algorithm rarely produces forms that are not valid. Still, it is mainly useful on long words and neologisms, not for general approaches.

Bug reports over the issues page are welcome.

Supported languages

The following languages are available using their ISO 639-1 code:

Available languages (2021-02-02)

Code

Language

Word pairs

Scores

Comments

bg

Bulgarian

69,680

low coverage

ca

Catalan

583,969

cs

Czech

35,021

low coverage

cy

Welsh

349,638

da

Danish

555,559

alternative: lemmy

de

German

623,249

0.94

on UD DE-GSD. See also this list

en

English

136,226

0.93

on UD EN-GUM. Alternative: LemmInflect

es

Spanish

666,016

0.87

on UD ES-GSD.

et

Estonian

112,501

low coverage

fa

Persian

9,333

low coverage

fi

Finnish

2,096,328

alternative: voikko

fr

French

217,091

0.93

on UD FR-GSD.

ga

Irish

366,086

gd

Gaelic

49,080

gl

Galician

386,714

gv

Manx

63,667

hu

Hungarian

446,650

id

Indonesian

36,461

it

Italian

333,682

ka

Georgian

65,938

la

Latin

96,409

low coverage

lb

Luxembourgish

305,398

lt

Lithuanian

247,418

lv

Latvian

57,154

nl

Dutch

228,123

pt

Portuguese

933,730

ro

Romanian

313,181

ru

Russian

608,770

alternative: pymorphy2

sk

Slovak

847,383

sl

Slovene

97,460

low coverage

sv

Swedish

663,984

alternative: lemmy

tr

Turkish

1,333,970

uk

Ukranian

190,725

alternative: pymorphy2

ur

Urdu

28,848

Low coverage mentions means you’d probably be better off with a language-specific library, but simplemma will work to a limited extent. Open-source alternatives for Python are referenced if available.

The scores are calculated on Universal Dependencies treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form.

Roadmap

  • [-] Add further lemmatization lists

  • [ ] Grammatical categories as option

  • [ ] Function as a meta-package?

  • [ ] Integrate optional, more complex models?

Credits

The current version basically acts as a wrapper for lemmatization lists:

This rule-based approach based on flexion and lemmatizations dictionaries is to this day an approach used in popular libraries such as spacy.

Contributions

Feel free to contribute, notably by filing issues for feedback, bug reports, or links to further lemmatization lists, rules and tests.

You can also contribute to this lemmatization list repository.

Other solutions

See lists: German-NLP and other awesome-NLP lists.

For a more complex but universal approach in Python see universal-lemmatizer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simplemma-0.2.1.tar.gz (11.8 kB view hashes)

Uploaded Source

Built Distribution

simplemma-0.2.1-py3-none-any.whl (46.3 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page