N-grams based similarity score
A method for similarity scoring of two strings.
The method, namely
nratio, belongs to the class
SequenceMatcherExtended, which is an extension of the
SequenceMatcher class of the difflib package. In particular,
nratio (method of
SequenceMatcherExtended) is an augmenation of
ratio (method of
ngramratio is to be pronounced as "n gram ratio". The library uses n-grams to find a similarity score via a division (ratio) of the number of matched characters by the total number of characters. See below for more details.
To compute a similarity score based on matching n-grams (with n>=1 chosen by the user) rather than matching single characters (as in the case of the
To install the Python library run:
pip install ngramratio
The library will be installed as
/usr/bin); or as
Scripts in your
Python installation on Windows (e.g.
You may consider installing the library only for the current user:
pip install ngramratio --user
In this case the library will be installed to
~/.local/bin/ngramratio on Linux and to
%APPDATA%\Python\Scripts\ngramratio.exe on Windows.
The module provides a method,
nratio, which takes an integer number (the user's required minimum n-gram length, i.e. number of consecutive characters, to be matched) and outputs a similarity index (float number in [0,1]).
First step: initialize an object of class SequenceMatcherExtended specifying the two strings to be compared:
>>> import ngramratio from ngramratio >>> SequenceMatcherExtended = ngrmaratio.SequenceMatcherExtended >>> string_one = "ab cde" >>> string_two = "bcde" >>> s = SequenceMatcherExtended(None, string_one, string_two, None) >>> # The "None" arguments prevents from any character being considered junk.. >>> # .. see the difflib documentation for more information on this.
Second step: apply the
nratio methods and compare similarity scores:
>>> s.ratio() >>> # Matches any character. Matches: "b" (length 1), "cde"(length 3). Score: (3+1)*2/10. 0.8 >>> s.nratio(1) >>> # Matches substring of length 1 or more. It replicates `ratio()`'s functionality. 0.8 >>> s.nratio(2) >>> # Matches substring of length 2 or more. Matches: "cde"(length 3). Score: 3*2/10. 0.6 >>> s.nratio(3) >>> # Matches substring of length 3 or more. Matches: "cde"(length 3). Score: 3*2/10. 0.6 >>> s.nratio(4) >>> # Matches substring of length 3 or more. Score 0/10. 0.0
The similarity score is computed as
the number of characters matched (m) mutiplied by
two (2) and divided by
the total numer of characters (T) of the two strings, i.e. similarity score = 2m/T. Note that Python always returns a float upon computing a division.
Testing in a virtual environment
To test a specific python version, for example version 3.6, edit the last few characters of the
startTest.sh script to py36 AND change the image to python 3.6 on line 4 of the
To run tests, run
bash _scripts/startTest.sh. This will start a docker container using the specified python image. After testing, or before testing a different python version, run
bash _scripts/teardown.sh to remove the docker container.
The library has been tested successfully for python >= 3.6.
Testing on your local machine with no v.e.
You can use
tox directly in your local machine. Make sure to install
pytest before testing.
tox expects to find executables like
python3.10 etc. On Windows it looks for
To test a specific Python environment, use the
-e option. For example, to
test against Python 3.7 run:
tox -e py37
in the root of the project source tree.
To fix code formatting (this will install
pre-commit as a dependency), run:
tox -e lint
tox.ini file in the repository to learn more about the testing instructions being used.
Contributions should include tests and an explanation for the changes they propose. Documentation (examples, docstrings, README.md) should be updated accordingly.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Hashes for ngramratio-0.0.5-py3-none-any.whl