A `set` subclass providing fuzzy search based on N-grams.
The NGram class extends the Python ‘set’ class with efficient fuzzy search for members by means of an N-gram similarity measure. It also has static methods to compare a pair of strings.
The N-grams are character based not word-based, and the class does not implement a language model, merely searching for members by string similarity.
See the documentation, which includes a tutorial and release notes.
Use the GitHub issue tracker to report issues.
To install python-ngram from PyPI:
pip install ngram
How does it work?
The set stores arbitrary items, but for non-string items a key function (such as str) must be specified to provide a string represenation. The key function can also be used to normalise string items (e.g. lower-casing) prior to N-gram indexing.
To index a string it pads the string with a specified dummy character, then splits it into overlapping substrings of N (default N=3) characters in length and associates each N-gram to the items that use it.
To find items similar to a query string, it splits the query into N-grams, collects all items sharing at least one N-gram with the query, and ranks the items by score based on the ratio of shared to unshared N-grams between strings.
In 2007, Michel Albert (exhuma) wrote the python-ngram module based on Perl’s String::Trigram module by Tarek Ahmed, and committed the code for 2.0.0b2 to a now-disused Sourceforge subversion repo.
Since late 2008, Graham Poulter has maintained python-ngram, initially refactoring it to build on the set class, and also adding features, documentation, tests, performance improvements and Python 3 support.
Development takes place on Github. On checking out the repo run tox to build the Sphinx documentation and run tests. Run pip install -e . to install the module in editable mode, inside a virtualenv.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.