Skip to main content

Python version of the BulStem stemming algorithm

Project description

BulStem-py: A Python Re-implementation of BulStem - inflectional stemmer for Bulgarian

Build PyPI License

Introduction

This is the Python version of the BulStem stemming algorithm. It follows the algorithm presented in

Nakov, P. BulStem: Design and evaluation of inflectional stemmer for Bulgarian. In Workshop on 
Balkan Language Resources and Tools (Balkan Conference in Informatics).

See http://people.ischool.berkeley.edu/~nakov/bulstem/ for the homepage of the algorithm. Also, check the original paper for more details and examples.

Implementation

This implementation, in contrast of the other available uses a Trie, instead of Dictionary/Hashtable/, in order to find the longest possible rule, that can be applied to a token.

Basic algorithm steps:

  1. Find the position of the first vowel in the token.
  2. Find the longest possible rule by traversing the string in reverse order until there is a matching suffix, or down to the position of the first vowel (found in Step. 1).
  3. Prepend the non-stemmed prefix to the stemmed suffix (Step. 2).

Installation

This library is compatible with Python >= 3.6.

Clone the repository and run:

With pip

pip install -e .
pip install -r requirements.txt

Test

A set of tests are included in the project, under the tests folder. The test suit can be run as follows:

pip install -e ".[testing]"
pip install -r requirements-test.txt
python -m unittest

Usage

The library works with a set of rules used for stemming. The rules can be either passed as a list to the BulStemmer constructor, or as a path to a file.

For both options the rules need to be formatted as follows:

word ==> stem ==> freq

A pre-defined set of rules is included in the package, and can be used directly. The stemming rules can be found here. (examples: Reading the rules from an external file)

Manually loading rules

from bulstem.stem import BulStemmer

stemmer = BulStemmer(["ой ==> о 10"], min_freq=0, left_context=2)
stemmer.stem('порой')# Excepted output: 1. 'поро'

BulStemmer constructor params:

  1. rules - Iterable of strings containing rules.
  2. min_freq - The minimum frequency of a rule to be used when stemming.
  3. left_context - Size of the prefix which will not be stemmed.

Reading the rules from an external file

from bulstem.stem import BulStemmer


# Pre-defined names of rule sets
PRE_DEFINED_RULES = ['stem-context-1', 
                     'stem-context-2',
                     'stem-context-3']

# Excepted output:
# 1 втор
# 2 втори
# 3 вторият
for i, rules_name in enumerate(PRE_DEFINED_RULES, start=1):
    stemmer = BulStemmer.from_file(rules_name, min_freq=2, left_context=i)
    print(i, stemmer.stem('вторият'))

stemmer = BulStemmer.from_file('stem_rules_context_2_utf8.txt', min_freq=2, left_context=i)
stemmer.stem('вторият') # Excepted output: 1. 'втори'
stemmer.stem('вероятен') # Excepted output: 1. 'вероят'

BulStemmer.from_file params:

  1. path - Path (or pre-defined name) to the rules file formatted as follows: word ==> stem ==> freq.
  2. min_freq - The minimum frequency of a rule to be used when stemming.
  3. left_context - Size of the prefix which will not be stemmed.

Other implementations

Perl (Original), Java (JDK 1.4), Ruby, C#, Python2, GATE plugin (Java)

License

For license information, see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bulstem-0.3.3.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

bulstem-0.3.3-py3-none-any.whl (831.5 kB view details)

Uploaded Python 3

File details

Details for the file bulstem-0.3.3.tar.gz.

File metadata

  • Download URL: bulstem-0.3.3.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.10

File hashes

Hashes for bulstem-0.3.3.tar.gz
Algorithm Hash digest
SHA256 bfac590eb3ac8ecafca6e618cbad473fb78f3f4fc2f484d53ed0f136a49df5fa
MD5 2683a993783804bb6110bdbcab9f4ff0
BLAKE2b-256 6ac97b7b451e20accf2cb1087f95aaf335dcf01f36e9284ac0fee2253ff391b1

See more details on using hashes here.

File details

Details for the file bulstem-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: bulstem-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 831.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.10

File hashes

Hashes for bulstem-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fb1cb8ddc4d46149292f1f5aee3837c708447d1747f5b28891f07de5b3e4382d
MD5 c863a74135dfd11a881ba4355d594dbb
BLAKE2b-256 14516bea2dfe7088dcb5faa33bd7491753c30cbebd6e9bea4af8de662bd26463

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page