Skip to main content

Polish stemmer.

Project description

https://badge.fury.io/py/pystempel.svg

Python port of Stempel, an algorithmic stemmer for Polish language, originally written in Java.

The original stemmer has been implemented as part of Egothor Project, taken virtually unchanged to Stempel Stemmer Java library by Andrzej Białecki and next included as part of Apache Lucene, a free and open-source search engine library. It is also used by Elastic Search search engine.

This package includes also high-quality stemming tables for Polish: original one pretrained by Andrzej Białecki on 20,000 training sets, and new one, pretrained on 259,080 training sets from Polimorf dictionary by me.

The port does not include code for compiling stemming tables.

How to use

Install in your local environment:

pip install pystempel

Use in your code:

>>> from stempel import StempelStemmer

Choose original (called default) version of a stemmer:

>>> stemmer = StempelStemmer.default()

or a version with new stemming table pretrained on training sets from Polimorf dictionary:

>>> stemmer = StempelStemmer.polimorf()

Stem:

>>> for word in ['książka', 'książki', 'książkami', 'książkowa', 'książkowymi']:
...   print(stemmer.stem(word))
...
książek
książek
książek
książkowy
książkowy

Choosing stemming table

Performance between original (default) and new stemming table (Polimorf-based) varies significantly. The stemmer for the default stemming table is understemming, i.e., for multiple forms of the same lemma provides different stems more often (63%) than when using Polimorf-based stemming table (13%). However, the file footprint of the latter is bigger (2.2MB vs 0.3MB). Also loading takes longer (7.5 seconds vs. 1.3 seconds), though this happens only once, when a stemmer is created. Also, for original stemming table, the stemmer stems slightly faster: ~60000 vs ~51000 words per second. See Evaluation Jupyter Notebook for the detailed evaluation results.

Note also, that the licensing schema of both stemming tables differs, and hence licensing of data generated with each one. See “Licensing” section for the details.

Choosing between port and wrapper

If you work on an NLP project in Python you can choose between Python port and Python wrapper. Python port is what pystempel tries to achieve: translation from Java implementation to Python. Python wrapper is what I used in tests: Python functions to call the original Java implementation of stemmer. You can find more about wrappers and ports in Stackoverflow comparision post. Here, I compare both approaches to help you decide:

  • Same accuracy. I have verified Python port by comparing its output with output of original Java implementation for 331224 words from Free Polish dictionary (sjp.pl) and for 100% of words it returns same output.

  • Similar performance. For mentioned dataset both stemmer versions achieved comparable performance. Python port completed stemming in 4.4 seconds, while Python wrapper – in 5 seconds (Intel Core i5-6000 3.30 GHz, 16GB RAM, Windows 10, OpenJDK)

  • Different setup. Python wrapper requires additionally installation of Cython and pyjnius. Python wrapper will make also debugging harder (switching between two programming languages).

Development setup

To setup environment for development you will need Anaconda installed.

conda env create --file environment.yml
conda activate pystempel-env

To run tests:

curl https://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers-stempel/8.1.1/lucene-analyzers-stempel-8.1.1.jar > stempel-8.1.1.jar
python -m pytest ./

To run benchmark:

set PYTHONPATH=%PYTHONPATH%;%cd%
python tests\test_benchmark.py

Licensing

Alternatives

  • Estem is Erlang wrapper (not port) for Stempel stemmer.

  • pl_stemmer is a Python stemmer based on Porter’s Algorithm.

  • polish-stem is a Python stemmer using Finite State Transducers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pystempel-1.1.0.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

pystempel-1.1.0-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file pystempel-1.1.0.tar.gz.

File metadata

  • Download URL: pystempel-1.1.0.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3

File hashes

Hashes for pystempel-1.1.0.tar.gz
Algorithm Hash digest
SHA256 98b30e9f702647c6788361b266b2df46c72a2c9ab899a8412fb028fbcc2046fa
MD5 e23c71686a6cf2c1ed4257fe509a9576
BLAKE2b-256 cb1aef339caef849b3c543211274d89fb218aee42c30b3c2e39eed0ddb330c44

See more details on using hashes here.

File details

Details for the file pystempel-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pystempel-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3

File hashes

Hashes for pystempel-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d5817b890b7221913bee5a3a73772d46bf994c5524b2d35be8a55428dfeabf89
MD5 a8882f56156266ec3b3dc8b6cdb5bff5
BLAKE2b-256 0ac5292b9423cfd103e4a224ff76f81633cc19a476160eb077b27b4bb98cfd75

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page