Skip to main content

A Python2/3 wrapper for Lemmagen lemmatizer supporting 19 languages.

Project description

About

lemmagen3 is a Python 2/3 wrapper for the Lemmagen lemmatizer (version 2.2).

It is different from other Lemmagen wrappers like this one on PyPi because it offers a clean, fast OO interface built with the excellent pybind11 library and supports an additional language (Croatian).

Models for Slovene, Croatian and Serbian are significantly updated and make use of frequency data to prefer most frequent lemmas, e.g., for Slovene: je->biti instead of je->jesti, mene->jaz instead od mene->mena, od->od instead of od->oda etc.

In total, 19 languages are supported:

  1. Bulgarian: bg
  2. Croatian: hr
  3. Czech: cs
  4. English: en
  5. Estonian: et
  6. Farsi/Persian: fa
  7. French: fr
  8. German: de
  9. Hungarian: hu
  10. Italian: it
  11. Macedonian: mk
  12. Polish: pl
  13. Romanian: ro
  14. Russian: ru
  15. Serbian: sr
  16. Slovak: sk
  17. Slovene: sl
  18. Spanish: es
  19. Ukrainian: uk

Installation and requirements

pip install lemmagen3

will install the module and language model files. Please note that on python <=3.5 and python 2.7 the package will be built from source so you will need a C++ compiler.

Note: If you use python 3.5.0 or 3.5.1 you will likely get the error shown below. This is a known bug in these two versions so please consider upgrading your Python.

ImportError: ..._lemmagen.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _PyThreadState_UncheckedGet

How to use

The following snippet illustrates how to use lemmagen3.

from lemmagen3 import Lemmatizer

# first, we can list all supported languages
print(Lemmatizer.list_supported_languages())

# then, create few lemmatizer objects using ISO 639-1 language codes
# (English, Slovene and Russian)

lem_en = Lemmatizer('en')
lem_sl = Lemmatizer('sl')
lem_ru = Lemmatizer('ru')

# now lemmatize the word "cats" in all three languages
print(lem_en.lemmatize('cats'))
print(lem_sl.lemmatize('mačke'))
print(lem_ru.lemmatize('коты'))

# you can also change the language for an existing Lemmatizer object
# lem_en will now become a French lemmatizer:
lem_en.load_language('fr')

# finally, you can also load your own Lemmagen model
my_lem = Lemmatizer()
my_lem.load_model('/path/to/my/model')

Note that the function lemmatize accepts single string tokens and does not split the input string! If you want to lemmatize a chunk of text you will have to tokenize it first, e.g.:

sentence = 'cats hate dogs'
tokens = sentence.split()
sentence_lemmatized = ' '.join([lem_en.lemmatize(token) for token in tokens])

Note also that lemmagen3 operates on unicode encoded strings so if you use python 2 make sure that your input string is encoded as unicode.

License

Please note that this repository contains code and binary models compiled and built from different sources which are under different licenses:

  1. C++ files and headers come from Lemmagen and are modified and adapted to work as a Python module (LGPL)
  2. Binary models are built from Multext and Multext-east sources:
    • Language resources used to build Farsi/Persian, Macedonian, Polish, and Russian models are for non-commercial use only.
    • Language resource for other supported languages are released under CC BY-SA 4.0.

The rest of the code in this repository was created by the author and is licensed under the MIT license.

Authors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lemmagen3-3.5.1.tar.gz (11.6 MB view hashes)

Uploaded Source

Built Distributions

lemmagen3-3.5.1-pp310-pypy310_pp73-win_amd64.whl (11.7 MB view hashes)

Uploaded PyPy Windows x86-64

lemmagen3-3.5.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

lemmagen3-3.5.1-pp310-pypy310_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (11.7 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

lemmagen3-3.5.1-pp310-pypy310_pp73-macosx_10_9_x86_64.whl (11.7 MB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

lemmagen3-3.5.1-pp39-pypy39_pp73-win_amd64.whl (11.7 MB view hashes)

Uploaded PyPy Windows x86-64

lemmagen3-3.5.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

lemmagen3-3.5.1-pp39-pypy39_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (11.7 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

lemmagen3-3.5.1-pp39-pypy39_pp73-macosx_10_9_x86_64.whl (11.7 MB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

lemmagen3-3.5.1-pp38-pypy38_pp73-win_amd64.whl (11.7 MB view hashes)

Uploaded PyPy Windows x86-64

lemmagen3-3.5.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

lemmagen3-3.5.1-pp38-pypy38_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (11.7 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

lemmagen3-3.5.1-pp38-pypy38_pp73-macosx_10_9_x86_64.whl (11.7 MB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

lemmagen3-3.5.1-cp312-cp312-win_amd64.whl (11.7 MB view hashes)

Uploaded CPython 3.12 Windows x86-64

lemmagen3-3.5.1-cp312-cp312-win32.whl (11.6 MB view hashes)

Uploaded CPython 3.12 Windows x86

lemmagen3-3.5.1-cp312-cp312-musllinux_1_1_x86_64.whl (13.1 MB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.1+ x86-64

lemmagen3-3.5.1-cp312-cp312-musllinux_1_1_i686.whl (13.1 MB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.1+ i686

lemmagen3-3.5.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.6 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

lemmagen3-3.5.1-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl (12.6 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ i686

lemmagen3-3.5.1-cp312-cp312-macosx_11_0_arm64.whl (11.7 MB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

lemmagen3-3.5.1-cp312-cp312-macosx_10_9_x86_64.whl (11.7 MB view hashes)

Uploaded CPython 3.12 macOS 10.9+ x86-64

lemmagen3-3.5.1-cp311-cp311-win_amd64.whl (11.7 MB view hashes)

Uploaded CPython 3.11 Windows x86-64

lemmagen3-3.5.1-cp311-cp311-win32.whl (11.6 MB view hashes)

Uploaded CPython 3.11 Windows x86

lemmagen3-3.5.1-cp311-cp311-musllinux_1_1_x86_64.whl (13.1 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

lemmagen3-3.5.1-cp311-cp311-musllinux_1_1_i686.whl (13.1 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

lemmagen3-3.5.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.6 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

lemmagen3-3.5.1-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl (12.6 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686

lemmagen3-3.5.1-cp311-cp311-macosx_11_0_arm64.whl (11.7 MB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

lemmagen3-3.5.1-cp311-cp311-macosx_10_9_x86_64.whl (11.7 MB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

lemmagen3-3.5.1-cp310-cp310-win_amd64.whl (11.7 MB view hashes)

Uploaded CPython 3.10 Windows x86-64

lemmagen3-3.5.1-cp310-cp310-win32.whl (11.6 MB view hashes)

Uploaded CPython 3.10 Windows x86

lemmagen3-3.5.1-cp310-cp310-musllinux_1_1_x86_64.whl (13.1 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

lemmagen3-3.5.1-cp310-cp310-musllinux_1_1_i686.whl (13.1 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

lemmagen3-3.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.6 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

lemmagen3-3.5.1-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl (12.6 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686

lemmagen3-3.5.1-cp310-cp310-macosx_11_0_arm64.whl (11.7 MB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

lemmagen3-3.5.1-cp310-cp310-macosx_10_9_x86_64.whl (11.7 MB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

lemmagen3-3.5.1-cp39-cp39-win_amd64.whl (11.7 MB view hashes)

Uploaded CPython 3.9 Windows x86-64

lemmagen3-3.5.1-cp39-cp39-win32.whl (11.6 MB view hashes)

Uploaded CPython 3.9 Windows x86

lemmagen3-3.5.1-cp39-cp39-musllinux_1_1_x86_64.whl (13.1 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

lemmagen3-3.5.1-cp39-cp39-musllinux_1_1_i686.whl (13.1 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ i686

lemmagen3-3.5.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.6 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

lemmagen3-3.5.1-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (12.5 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686

lemmagen3-3.5.1-cp39-cp39-macosx_11_0_arm64.whl (11.7 MB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

lemmagen3-3.5.1-cp39-cp39-macosx_10_9_x86_64.whl (11.7 MB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

lemmagen3-3.5.1-cp38-cp38-win_amd64.whl (11.7 MB view hashes)

Uploaded CPython 3.8 Windows x86-64

lemmagen3-3.5.1-cp38-cp38-win32.whl (11.6 MB view hashes)

Uploaded CPython 3.8 Windows x86

lemmagen3-3.5.1-cp38-cp38-musllinux_1_1_x86_64.whl (13.1 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

lemmagen3-3.5.1-cp38-cp38-musllinux_1_1_i686.whl (13.1 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ i686

lemmagen3-3.5.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.6 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

lemmagen3-3.5.1-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl (12.5 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ i686

lemmagen3-3.5.1-cp38-cp38-macosx_11_0_arm64.whl (11.7 MB view hashes)

Uploaded CPython 3.8 macOS 11.0+ ARM64

lemmagen3-3.5.1-cp38-cp38-macosx_10_9_x86_64.whl (11.7 MB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page