Skip to main content

Polish stemmer.

Project description

Python port of Stempel, an algorithmic stemmer for Polish language, originally written in Java.

The original stemmer has been implemented as part of Egothor Project, taken virtually unchanged to Stempel Stemmer Java library by Andrzej Białecki and next included as part of Apache Lucene, a free and open-source search engine library.

This package includes also high-quality stemming table for Polish with 20,000 training sets, pretrained by Andrzej Białecki.

The port does not include code for compiling stemming tables.

How to use

Install in your local environment:

pip install pystempel

Use in your code:

>>> from stempel import StempelStemmer
>>> stemmer = StempelStemmer.default()
>>> for word in ['książki', 'książki', 'książkami', 'książkowa', 'książkowymi']:
...   print(stemmer.stem(word))
...
książek
książek
książek
książkowy
książkowy

Choosing between port and wrapper

If you work on an NLP project in Python you can choose between Python port and Python wrapper. Python port is what pystempel tries to achieve: translation from Java implementation to Python. Python wrapper is what I used in tests: Python functions to call the original Java implementation of stemmer. You can find more about wrappers and ports in Stackoverflow comparision post. Here, I compare both approaches to help you decide:

  • Same accuracy. I have verified Python port by comparing its output with output of original Java implementation for 331224 words from Free Polish dictionary (sjp.pl) and for 100% of words it returns same output.

  • Similar performance. For mentioned dataset both stemmer versions achieved comparable performance. Python port completed stemming in 4.4 seconds, while Python wrapper – in 5 seconds (Intel Core i5-6000 3.30 GHz, 16GB RAM, Windows 10, OpenJDK)

  • Different setup. Python wrapper requires additionally installation of Cython and pyjnius. Python wrapper will make also debugging harder (switching between two programming languages).

Development setup

To setup environment for development you will need Anaconda installed.

conda create -n stempel-stemmer
conda activate stempel-stemmer
conda install -c conda-forge --file requirements.txt

To run tests:

curl https://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers-stempel/8.1.1/lucene-analyzers-stempel-8.1.1.jar > stempel-8.1.1.jar
python -m pytest ./

To run benchmark:

python tests\test_benchmark.py

Licensing

Most of the code is covered by Egothor Open Source License, an Apache-style license. The rest of the code and pretrained stemming table are covered by the Apache License 2.0. Unit tests use the Free Polish dictionary for use in spell-checking from sjp.pl , covered by Apache License 2.0 as well.

Other languages

  • Estem is Erlang wrapper (not port) for Stempel stemmer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pystempel-1.0.1.tar.gz (428.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page