Skip to main content

Fast and parallel snowball stemmer

Project description

py-rust-stemmers

py-rust-stemmers is a high-performance Python wrapper around the rust-stemmers library, utilizing the Snowball stemming algorithm. This library allows for efficient stemming of words with support for parallel processing, making it a powerful tool for text processing tasks. The library is built using maturin to compile the Rust code into a Python package.

Features

  • Snowball Stemmer: Uses the well-known Snowball stemming algorithms for efficient word stemming in multiple languages.
  • Parallelism Support: Offers parallel processing for batch stemming, providing significant speedup for larger text sequences.
  • Rust Performance: Leverages the performance of Rust for fast, reliable text processing.

Installation

You can install py-rust-stemmers via pip:

pip install py-rust-stemmers

Usage

Here's a simple example showing how to use py-rust-stemmers to stem words using the Snowball algorithm:

from py_rust_stemmers import SnowballStemmer

# Initialize the stemmer for the English language
s = SnowballStemmer('english')

# Input text
text = """This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer."""
words = text.split()

# Example usage of the methods
stemmed = s.stem_word(words[0])
print(f"Stemmed word: {stemmed}")

# Stem a list of words
stemmed_words = s.stem_words(words)
print(f"Stemmed words: {stemmed_words}")

# Stem words in parallel
stemmed_words_parallel = s.stem_words_parallel(words)
print(f"Stemmed words (parallel): {stemmed_words_parallel}")

Methods

stem_word(word: str) -> str

This method stems a single word. It is best used for small or isolated stemming tasks.

Example:

s.stem_word("running")  # Output: "run"

stem_words(words: List[str]) -> List[str]

This method stems a list of words sequentially. It is ideal for processing short to moderately sized text sequences.

Example:

s.stem_words(["running", "jumps", "easily"])  # Output: ["run", "jump", "easili"]

stem_words_parallel(words: List[str]) -> List[str]

This method stems a list of words in parallel. It provides significant speedup for longer text sequences (e.g., sequences longer than 512 tokens) by utilizing parallel processing. It is ideal for batch processing of large datasets.

Example:

s.stem_words_parallel(["running", "jumps", "easily"])  # Output: ["run", "jump", "easili"]

Build from source

  • Install maturin
  • Go to project dir
maturin build --release
pip install target/wheels/py_rust_stemmers-<your os/architecture/etc>.whl

Development

  • Install uv
  • Install dependencies
uv sync
uv pip install maturin pytest tqdm snowballstemmer
  • Develop the package using maturin develop command
uv run maturin develop
  • Build the package using maturin build command
uv run maturin build
  • Test the package using maturin test command
uv run pytest
  • Run speedtest
uv run python tests/speedtest.py
  • Run benchmark for quantile
uv run python tests/benchmark_for_quantile.py

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

py_rust_stemmers_tuned-0.1.7-cp314-cp314t-manylinux_2_34_x86_64.whl (334.3 kB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.34+ x86-64

py_rust_stemmers_tuned-0.1.7-cp314-cp314t-manylinux_2_34_aarch64.whl (325.8 kB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.34+ ARM64

py_rust_stemmers_tuned-0.1.7-cp314-cp314t-macosx_11_0_arm64.whl (287.3 kB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

File details

Details for the file py_rust_stemmers_tuned-0.1.7-cp314-cp314t-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for py_rust_stemmers_tuned-0.1.7-cp314-cp314t-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 15e36742224f06b2dc7a61806162b58cc046331f2d8de049c553184b7d4d9489
MD5 1d78a26847f222b697164aa62c3c92b1
BLAKE2b-256 232ee4ddd8eb8a86bc74a55054f64f36080f032bc30c0bd25af1f1407e849c02

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_rust_stemmers_tuned-0.1.7-cp314-cp314t-manylinux_2_34_x86_64.whl:

Publisher: publish.yaml on andreribeiro87/py-rust-stemmers-tuned

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file py_rust_stemmers_tuned-0.1.7-cp314-cp314t-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for py_rust_stemmers_tuned-0.1.7-cp314-cp314t-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 332ad69490ef3e3a596ad6a3129671406092c8b0f662aa905025957c53d68179
MD5 99de0f6b57a4f94230ab91296587830f
BLAKE2b-256 cf9eb3d321c0aa0ee6ad3e8a5921a59cc6b0e632d0e07a48e5c56dc0e5b2ec34

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_rust_stemmers_tuned-0.1.7-cp314-cp314t-manylinux_2_34_aarch64.whl:

Publisher: publish.yaml on andreribeiro87/py-rust-stemmers-tuned

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file py_rust_stemmers_tuned-0.1.7-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for py_rust_stemmers_tuned-0.1.7-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c8a6873af9a63cd8c3c5408173484861c9cacb93e321e9429e61e6265c74d2ef
MD5 99cdaed25d9df7cf683d348b2502b1d3
BLAKE2b-256 ec6a2ebb1539374bfee443ee75cdc383bf13e47d02fde0f84c146cb0f343340d

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_rust_stemmers_tuned-0.1.7-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: publish.yaml on andreribeiro87/py-rust-stemmers-tuned

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page