pystempel

Polish stemmer.

These details have not been verified by PyPI

Project description

Python port of Stempel, an algorithmic stemmer for the Polish language, originally written in Java.

The original stemmer has been implemented as part of Egothor Project, taken virtually unchanged to Stempel Stemmer Java library by Andrzej Białecki and next included as part of Apache Lucene, a free and open-source search engine library. It is also used by Elastic Search search engine.

This package includes also high-quality stemming tables for Polish: the original one pretrained by Andrzej Białecki on 20,000 training sets, and a new one, pretrained on 259,080 training sets from Polimorf dictionary by me.

The port does not include code for compiling stemming tables.

How to use

Install in your local environment:

pip install pystempel

Use in your code:

from pystempel import Stemmer

Choose original (called default) version of a stemmer:

stemmer = Stemmer.default()

or a version with a new stemming table pretrained on training sets from Polimorf dictionary:

stemmer = Stemmer.polimorf()

Stem:

>>> for word in ['książka', 'książki', 'książkami', 'książkowa', 'książkowymi']:
...   print(stemmer(word))
...
książek
książek
książek
książkowy
książkowy

Choosing stemming table

Performance between the original (default) and the new stemming table (Polimorf-based) varies significantly. The stemmer for the default stemming table is understemming, i.e., multiple forms of the same lemma provide different stems more often (63%) than when using a Polimorf-based stemming table (13%). However, the file footprint of the latter is bigger (2.2MB vs 0.3MB). Also, loading takes longer (7.5 seconds vs. 1.3 seconds), though this happens only once when a stemmer is created. Also, the stemmer stems slightly faster for the original stemming table: ~60000 vs ~51000 words per second. See Evaluation Jupyter Notebook for the detailed evaluation results.

Also, please note that the licensing schema of both stemming tables differs, and hence licensing of data generated with each one. See the “Licensing” section for the details.

Choosing between port and wrapper

If you work on an NLP project in Python you can choose between Python port and Python wrapper. Python port is what pystempel tries to achieve: translation from Java implementation to Python. Python wrapper is what I used in tests: Python functions to call the original Java implementation of stemmer. You can find more about wrappers and ports in Stackoverflow comparison post. Here, I compare both approaches to help you decide:

Same accuracy. I have verified the Python port by comparing its output with the output of the original Java implementation for 331224 words from the Free Polish dictionary (sjp.pl) and for 100% of words, it returns same output.
Similar performance. For the mentioned dataset, both stemmer versions achieved comparable performance. Python port completed stemming in 4.4 seconds, while Python wrapper – in 5 seconds (Intel Core i5-6000 3.30 GHz, 16GB RAM, Windows 10, OpenJDK)
Different setup. Python wrapper requires additional installation of Cython and pyjnius. Python wrapper will make also debugging harder (switching between two programming languages).

Options

To disable a progress bar when loading stemming tables, set environment variable DISABLE_TQDM=True.

Development setup

To setup environment for development you will need poetry 1.4.0 or higher installed.

poetry install
poetry shell
pre-commit install

To run tests download original stemmer in Java:

curl https://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers-stempel/8.1.1/lucene-analyzers-stempel-8.1.1.jar > stempel-8.1.1.jar

and run:

poetry run pytest

To run performance benchmark:

PYTHONPATH=$PWD poetry run python tests/test_benchmark.py

Licensing

Code. Most of the code is covered by Egothor Open Source License, an Apache-style license. The Apache License 2.0 covers the rest of the code. This should be clear from the preamble of each file.
Data.
- The original pretrained stemming table is covered by Apache License 2.0.
- The new pretrained stemming table is covered by 2-Clause BSD License, similarly to the Polimorf dictionary copy it has been derived from. The copyright owner of both the stemming table and the dictionary is Institute of Computer Science at Polish Academy of Science (IPI PAN).
- The Polish dictionary used by the unit tests comes from sjp.pl and is covered by Apache License 2.0 as well.

Alternatives

Estem is Erlang wrapper (not port) for Stempel stemmer.
pl_stemmer is a Python stemmer based on Porter’s Algorithm.
polish-stem is a Python stemmer using Finite State Transducers.

Release notes

2.0.0: API backward incompatible changes - Refactor stempel to pystempel package (#26) - Refactor StempelStemmer to Stemmer and Stemmer.stem to callable (#26)

1.2.0: Stable version

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.0.0

Jul 10, 2024

1.2.0

Oct 29, 2020

1.1.0

Oct 15, 2019

1.0.1

Jul 23, 2019

1.0.post1

Jul 23, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pystempel-2.0.0.tar.gz (2.7 MB view details)

Uploaded Jul 10, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pystempel-2.0.0-py3-none-any.whl (2.7 MB view details)

Uploaded Jul 10, 2024 Python 3

File details

Details for the file pystempel-2.0.0.tar.gz.

File metadata

Download URL: pystempel-2.0.0.tar.gz
Upload date: Jul 10, 2024
Size: 2.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.0 CPython/3.10.12 Linux/6.5.0-41-generic

File hashes

Hashes for pystempel-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`201f68397c08a1ed5a7b6751355d40576acbbf0cad6d519836210175ed62a80e`
MD5	`b42ee7eb78eef4296c8fd9da8ba2005e`
BLAKE2b-256	`d0d4ea46cc1cff1ad84bf0ba838405b8a527e5b13ed820ed1c6374e332dcc41b`

See more details on using hashes here.

File details

Details for the file pystempel-2.0.0-py3-none-any.whl.

File metadata

Download URL: pystempel-2.0.0-py3-none-any.whl
Upload date: Jul 10, 2024
Size: 2.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.0 CPython/3.10.12 Linux/6.5.0-41-generic

File hashes

Hashes for pystempel-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5271ab3d8640372567aaf9b8feaa4824ebede08362e4e34c4a24c64ad910abd3`
MD5	`f97610abba54281f68d7b3d5384a526b`
BLAKE2b-256	`0d9684f748e2f0368c9f7259b23838413fe5d2b4f76a31738c07a00de92747f9`

See more details on using hashes here.

pystempel 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

How to use

Choosing stemming table

Choosing between port and wrapper

Options

Development setup

Licensing

Alternatives

Release notes

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes