Python version of the BulStem stemming algorithm
Project description
BulStem-py: A Python Re-implementation of BulStem - inflectional stemmer for Bulgarian
Introduction
This is the Python version of the BulStem stemming algorithm. It follows the algorithm presented in
Nakov, P. BulStem: Design and evaluation of inflectional stemmer for Bulgarian. In Workshop on
Balkan Language Resources and Tools (Balkan Conference in Informatics).
See http://people.ischool.berkeley.edu/~nakov/bulstem/ for the homepage of the algorithm. Also, check the original paper for more details and examples.
Implementation
This implementation, in contrast of the other available uses a Trie, instead of Dictionary/Hashtable/, in order to find the longest possible rule, that can be applied to a token.
Basic algorithm steps:
- Find the position of the first vowel in the token.
- Find the longest possible rule by traversing the string in reverse order until there is a matching suffix, or down to the position of the first vowel (found in Step. 1).
- Prepend the non-stemmed prefix to the stemmed suffix (Step. 2).
Installation
This library is compatible with Python >= 3.6.
Clone the repository and run:
With pip
pip install -e .
pip install -r requirements.txt
Test
A set of tests are included in the project, under the tests folder. The test suit can be run as follows:
pip install -e ".[testing]"
pip install -r requirements-test.txt
python -m unittest
Usage
The library works with a set of rules used for stemming. The rules can be either passed as a list to the BulStemmer
constructor, or as a path to a file.
For both options the rules need to be formatted as follows:
word ==> stem ==> freq
A pre-defined set of rules is included in the package, and can be used directly. The stemming rules can be found here. (examples: Reading the rules from an external file)
Manually loading rules
from bulstem.stem import BulStemmer
stemmer = BulStemmer(["ой ==> о 10"], min_freq=0, left_context=2)
stemmer.stem('порой')# Excepted output: 1. 'поро'
BulStemmer
constructor params:
rules
- Iterable of strings containing rules.min_freq
- The minimum frequency of a rule to be used when stemming.left_context
- Size of the prefix which will not be stemmed.
Reading the rules from an external file
from bulstem.stem import BulStemmer
# Pre-defined names of rule sets
PRE_DEFINED_RULES = ['stem-context-1',
'stem-context-2',
'stem-context-3']
# Excepted output:
# 1 втор
# 2 втори
# 3 вторият
for i, rules_name in enumerate(PRE_DEFINED_RULES, start=1):
stemmer = BulStemmer.from_file(rules_name, min_freq=2, left_context=i)
print(i, stemmer.stem('вторият'))
stemmer = BulStemmer.from_file('stem_rules_context_2_utf8.txt', min_freq=2, left_context=i)
stemmer.stem('вторият') # Excepted output: 1. 'втори'
stemmer.stem('вероятен') # Excepted output: 1. 'вероят'
BulStemmer.from_file
params:
path
- Path (or pre-defined name) to the rules file formatted as follows: word ==> stem ==> freq.min_freq
- The minimum frequency of a rule to be used when stemming.left_context
- Size of the prefix which will not be stemmed.
Other implementations
Perl (Original), Java (JDK 1.4), Ruby, C#, Python2, GATE plugin (Java)
License
For license information, see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bulstem-0.3.3.tar.gz
.
File metadata
- Download URL: bulstem-0.3.3.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bfac590eb3ac8ecafca6e618cbad473fb78f3f4fc2f484d53ed0f136a49df5fa |
|
MD5 | 2683a993783804bb6110bdbcab9f4ff0 |
|
BLAKE2b-256 | 6ac97b7b451e20accf2cb1087f95aaf335dcf01f36e9284ac0fee2253ff391b1 |
File details
Details for the file bulstem-0.3.3-py3-none-any.whl
.
File metadata
- Download URL: bulstem-0.3.3-py3-none-any.whl
- Upload date:
- Size: 831.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb1cb8ddc4d46149292f1f5aee3837c708447d1747f5b28891f07de5b3e4382d |
|
MD5 | c863a74135dfd11a881ba4355d594dbb |
|
BLAKE2b-256 | 14516bea2dfe7088dcb5faa33bd7491753c30cbebd6e9bea4af8de662bd26463 |