Skip to main content

This package provides 36 stemmers for 34 languages generated from Snowball algorithms.

Project description

Python 3 (>= 3.3) is supported. We no longer support Python 2 as the Python developers stopped supporting it at the start of 2020. Snowball 2.1.0 was the last release to officially support Python 2; Snowball 3.0.1 was the last release which had the code to support Python 2, but we were no longer testing it.

What is Stemming?

Stemming maps different forms of the same word to a common “stem” - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a search for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don’t have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball’s stemming algorithms likely aren’t the right answer.

How to use library

The stemming algorithms generally expect the input text to use composed accents (Unicode NFC or NFKC) and to have been folded to lower case already.

The snowballstemmer module has two functions.

The snowballstemmer.algorithms function returns a list of available algorithm names.

The snowballstemmer.stemmer function takes an algorithm name and returns a Stemmer object.

Stemmer objects have a Stemmer.stemWord(word) method and a Stemmer.stemWords(word[]) method.

import snowballstemmer

stemmer = snowballstemmer.stemmer('english')
print(stemmer.stemWords("We are the world".split()))

Generally you should create a stemmer object and reuse it rather than creating a fresh object for each word stemmed (since there’s some cost to creating and destroying the object).

The stemmer code is re-entrant, but not thread-safe if the same stemmer object is used concurrently in different threads.

If you want to perform stemming concurrently in different threads, we suggest creating a new stemmer object for each thread. The alternative is to share stemmer objects between threads and protect access using a mutex or similar (e.g. threading.Lock in Python) but that’s liable to slow your program down as threads can end up waiting for the lock.

Automatic Acceleration

PyStemmer is a wrapper module for Snowball’s libstemmer_c and should provide results 100% compatible to snowballstemmer.

PyStemmer is faster because it wraps generated C versions of the stemmers; snowballstemmer uses generate Python code and is slower but offers a pure Python solution.

If PyStemmer is installed, snowballstemmer.stemmer returns a PyStemmer Stemmer object which provides the same Stemmer.stemWord() and Stemmer.stemWords() methods.

Benchmark

This is a crude benchmark which measures the time for running each stemmer on every word in its sample vocabulary (10,787,583 words over 26 languages). It’s not a realistic test of normal use as a real application would do much more than just stemming. It’s also skewed towards the stemmers which do more work per word and towards those with larger sample vocabularies.

  • Python 2.7 + snowballstemmer : 13m00s (15.0 * PyStemmer)

  • Python 3.7 + snowballstemmer : 12m19s (14.2 * PyStemmer)

  • PyPy 7.1.1 (Python 2.7.13) + snowballstemmer : 2m14s (2.6 * PyStemmer)

  • PyPy 7.1.1 (Python 3.6.1) + snowballstemmer : 1m46s (2.0 * PyStemmer)

  • Python 2.7 + PyStemmer : 52s

For reference the equivalent test for C runs in 9 seconds.

These results are for Snowball 2.0.0. They’re likely to evolve over time as the code Snowball generates for both Python and C continues to improve (for a much older test over a different set of stemmers using Python 2.7, snowballstemmer was 30 times slower than PyStemmer, or 9 times slower with PyPy).

The message to take away is that if you’re stemming a lot of words you should either install PyStemmer (which snowballstemmer will then automatically use for you as described above) or use PyPy.

The TestApp example

The testapp.py example program allows you to run any of the stemmers on a sample vocabulary.

Usage:

testapp.py <algorithm> "sentences ... "
$ python testapp.py English "sentences... "

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snowballstemmer-3.1.0.tar.gz (122.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

snowballstemmer-3.1.0-py3-none-any.whl (104.5 kB view details)

Uploaded Python 3

File details

Details for the file snowballstemmer-3.1.0.tar.gz.

File metadata

  • Download URL: snowballstemmer-3.1.0.tar.gz
  • Upload date:
  • Size: 122.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for snowballstemmer-3.1.0.tar.gz
Algorithm Hash digest
SHA256 fd9e34526b23340cd23ffea6c9f9760974ecc2c2ac9e1d81401443ccdb2a801f
MD5 31f57eca1efcef1a355e4a4be916abbc
BLAKE2b-256 63ee67eef9600338e245ad7838230969a34c823ddbdbccc5e1fc43cd75b55bc9

See more details on using hashes here.

File details

Details for the file snowballstemmer-3.1.0-py3-none-any.whl.

File metadata

  • Download URL: snowballstemmer-3.1.0-py3-none-any.whl
  • Upload date:
  • Size: 104.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for snowballstemmer-3.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 17e6d1da216aa07db6dad37139ea70cf13c4b2e9a096f6e64a9648fc657d3154
MD5 ba83ff525e9e6f9c6659194bbf1869f8
BLAKE2b-256 4983ddbf4533c62dd32667ef1238952abef155f3d3391f5be69a352ad1638a42

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page