Skip to main content

N-grams based similarity score

Project description

python-ngramratio

A method for similarity scoring of two strings.

The method, namely nratio, belongs to the class SequenceMatcherExtended, which is an extension of the SequenceMatcher class of the difflib package. In particular, nratio (method of SequenceMatcherExtended) is an augmenation of ratio (method of SequenceMatcher).

ngramratio is to be pronounced as "n gram ratio". The library uses n-grams to find a similarity score via a division (ratio) of the number of matched characters by the total number of characters. See below for more details.

Motivation

To compute a similarity score based on matching n-grams (with n>=1 chosen by the user) rather than matching single characters (as in the case of the ratio method).

Installation

To install the Python library run:

pip install ngramratio

The library will be installed as ngramratio to bin on Linux (e.g. /usr/bin); or as ngramratio.exe to Scripts in your Python installation on Windows (e.g. C:\Python27\Scripts\ngramratio.exe).

You may consider installing the library only for the current user:

pip install ngramratio --user

In this case the library will be installed to ~/.local/bin/ngramratio on Linux and to %APPDATA%\Python\Scripts\ngramratio.exe on Windows.

Library usage

The module provides a method, nratio, which takes an integer number (the user's required minimum n-gram length, i.e. number of consecutive characters, to be matched) and outputs a similarity index (float number in [0,1]).

First step: initialize an object of class SequenceMatcherExtended specifying the two strings to be compared:

    >>> import ngramratio from ngramratio

    >>> SequenceMatcherExtended = ngrmaratio.SequenceMatcherExtended

    >>> string_one = "ab cde"
    >>> string_two = "bcde"

    >>> s = SequenceMatcherExtended(None, string_one, string_two, None)
    >>> # The "None" arguments prevents from any character being considered junk..
    >>> # .. see the difflib documentation for more information on this.

Second step: apply the ratio and nratio methods and compare similarity scores:

    >>> s.ratio()
    >>> # Matches any character. Matches: "b" (length 1), "cde"(length 3). Score: (3+1)*2/10.
    0.8
    >>> s.nratio(1)
    >>> # Matches substring of length 1 or more. It replicates `ratio()`'s functionality.
    0.8
    >>> s.nratio(2)
    >>> # Matches substring of length 2 or more. Matches: "cde"(length 3). Score: 3*2/10.
    0.6
    >>> s.nratio(3)
    >>> # Matches substring of length 3 or more. Matches: "cde"(length 3). Score: 3*2/10.
    0.6
    >>> s.nratio(4)
    >>> # Matches substring of length 3 or more. Score 0/10.
    0.0

The similarity score is computed as the number of characters matched (m) mutiplied by two (2) and divided by the total numer of characters (T) of the two strings, i.e. similarity score = 2m/T. Note that Python always returns a float upon computing a division.

Testing in a virtual environment

This project uses pytest testing framework with tox and docker to automate testing in different python environments. Tests are stored in the test/ folder.

To test a specific python version, for example version 3.6, edit the last few characters of the startTest.sh script to py36 AND change the image to python 3.6 on line 4 of the docker-compose.yaml file.

To run tests, run bash _scripts/startTest.sh. This will start a docker container using the specified python image. After testing, or before testing a different python version, run bash _scripts/teardown.sh to remove the docker container.

The library has been tested successfully for python >= 3.6.

Testing on your local machine with no v.e.

You can use tox directly in your local machine. Make sure to install tox, pytest before testing.

On Linux tox expects to find executables like python3.6, python3.10 etc. On Windows it looks for C:\Python36\python.exe and C:\Python310\python.exe respectively.

To test a specific Python environment, use the -e option. For example, to test against Python 3.7 run:

tox -e py37

in the root of the project source tree.

To fix code formatting (this will install pre-commit as a dependency), run:

tox -e lint

See the tox.ini file in the repository to learn more about the testing instructions being used.

Contributions

Contributions should include tests and an explanation for the changes they propose. Documentation (examples, docstrings, README.md) should be updated accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ngramratio-0.0.5.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

ngramratio-0.0.5-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file ngramratio-0.0.5.tar.gz.

File metadata

  • Download URL: ngramratio-0.0.5.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.0

File hashes

Hashes for ngramratio-0.0.5.tar.gz
Algorithm Hash digest
SHA256 b955011987bc9d0cf59aec36e3d2cfd26d2dd941f9389ddc9acead356a9b9f8f
MD5 2d3aabe194980e5cfcbd788ecea64738
BLAKE2b-256 51413e9c81cbf6fbde6dad1ba3c266da35b54bdf4cccfbb897f873baa54791b8

See more details on using hashes here.

File details

Details for the file ngramratio-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: ngramratio-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 4.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.0

File hashes

Hashes for ngramratio-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 e446c22abb25d7a1dae13c0e124d5f649efd30c8870b0cd0cd96104e0729ebd1
MD5 97f71bc42fde05b76a88621dc831ab23
BLAKE2b-256 83b9a8340e830cb8ab6c441a1a53d381ff2a921eb4e46f37300432cb0e134fa3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page