This package provides 29 stemmers for 28 languages generated from Snowball algorithms.
Both Python 2 and Python 3 (>= 3.3) are supported.
What is Stemming?
Stemming maps different forms of the same word to a common “stem” - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.
This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don’t have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball’s stemming algorithms likely aren’t the right answer.
How to use library
The snowballstemmer module has two functions.
The snowballstemmer.algorithms function returns a list of available algorithm names.
The snowballstemmer.stemmer function takes an algorithm name and returns a Stemmer object.
Stemmer objects have a Stemmer.stemWord(word) method and a Stemmer.stemWords(word) method.
import snowballstemmer stemmer = snowballstemmer.stemmer('english'); print(stemmer.stemWords("We are the world".split()));
If PyStemmer is installed, snowballstemmer.stemmer returns a PyStemmer Stemmer object which provides the same Stemmer.stemWord() and Stemmer.stemWords() methods.
PyStemmer is a wrapper module for Snowball’s libstemmer_c and should provide results 100% compatible to snowballstemmer.
PyStemmer is faster because it wraps generated C versions of the stemmers; snowballstemmer uses generate Python code and is slower but offers a pure Python solution.
This is a crude benchmark which measures the time for running each stemmer on every word in its sample vocabulary (10,787,583 words over 26 languages). It’s not a realistic test of normal use as a real application would do much more than just stemming. It’s also skewed towards the stemmers which do more work per word and towards those with larger sample vocabularies.
- Python 2.7 + snowballstemmer : 13m00s (15.0 * PyStemmer)
- Python 3.7 + snowballstemmer : 12m19s (14.2 * PyStemmer)
- PyPy 7.1.1 (Python 2.7.13) + snowballstemmer : 2m14s (2.6 * PyStemmer)
- PyPy 7.1.1 (Python 3.6.1) + snowballstemmer : 1m46s (2.0 * PyStemmer)
- Python 2.7 + PyStemmer : 52s
For reference the equivalent test for C runs in 9 seconds.
These results are for Snowball 2.0.0. They’re likely to evolve over time as the code Snowball generates for both Python and C continues to improve (for a much older test over a different set of stemmers using Python 2.7, snowballstemmer was 30 times slower than PyStemmer, or 9 times slower with PyPy).
The message to take away is that if you’re stemming a lot of words you should either install PyStemmer (which snowballstemmer will then automatically use for you as described above) or use PyPy.
The TestApp example
The testapp.py example program allows you to run any of the stemmers on a sample vocabulary.
testapp.py <algorithm> "sentences ... "
$ python testapp.py English "sentences... "
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size snowballstemmer-2.1.0-py2.py3-none-any.whl (93.5 kB)||File type Wheel||Python version py2.py3||Upload date||Hashes View|
|Filename, size snowballstemmer-2.1.0.tar.gz (85.7 kB)||File type Source||Python version None||Upload date||Hashes View|
Hashes for snowballstemmer-2.1.0-py2.py3-none-any.whl