Skip to main content

Finds the Jaro Winkler Distance indicating a distance or similarity score between two strings.

Project description

Jaro Winkler Distance

PyPI - Version License PyPI - Python Version GitHub Actions Workflow Status

Finds a non-euclidean distance or similarity between two strings.

Jaro and Jaro-Winkler equations provide a score between two short strings where errors are more prone at the end of the string. Jaro's equation measure is the weighted sum of the percentage of matching and transposed characters from each string. Winkler's factor adds weight in Jaro's formula to increase the calculated measure when there is a sequence of characters (a prefix) in both strings.

This version is based on the original C implementation of strcmp95 implementation but does not attempt to normalize homoglyph (e.g.: O vs. 0).

  • Impact of the prefix is limited to 4 characters, as originally defined by Winkler.
  • Input strings are not modified beyond whitespace trimming.
  • In-word whitespace and characters case will optionally impact score.
  • Supports optional UTF-8 normalization and homoglyph sanitization.
  • Returns a floating point number rounded to the desired decimals (defaults to 2) using Python's round.
  • Consider usual floating point arithmetic characteristics when working with this module.

Implementation

The complexity of this algorithm resides in finding the matching and transposed characters. That is because of the interpretation of what are the matching conditions and the definition of transposed. Definitions of those two will make the score vary between implementations of this algorithm.

Here is how matching and transposed are defined in this module:

  • A character of the first string at position N is matching if found at position N or within distance on either side in the second string.
  • The distance is calculated using the rounded down length of the longest string divided by two minus one.
  • Characters in the first string are matched only once against characters of the second string.
  • Two characters are transposed if they previously matched and aren't at the same position in the matching character subset.
  • Decimals are rounded according to the scientific method.

Example

Calculate the Jaro Winkler similarity ($sim_{w}$) between PENNSYLVANIA and PENNCISYLVNIA:

$$ s_{1}=\text{PENNSYLVANIA} \qquad\text{and}\qquad s_{2}=\text{PENNCISYLVNIA} $$

    P E N N C I S Y L V N I A
  ┌-─────────────────────────
P │ 1          ╎
E │   1          ╎
N │     1          ╎
N │       1          ╎           Symbols '╎' represent the sliding windows
S │             1      ╎        boundary in the second string where we look
Y │ ╎             1      ╎           for the first string's character.
L │   ╎             1      ╎
V │     ╎             1                   d = 5 in this example.
A │       ╎                 1
N │         ╎           1
I │           ╎           1
A │             ╎

$$ \begin{split} d &= \left\lfloor {\max(12, 13) \over 2} \right\rfloor - 1 \newline &= 5 \newline \end{split} \qquad \text{ and } \qquad \begin{split} |s_{1}| &= 12 \newline |s_{2}| &= 13 \newline \end{split} \qquad \text{ and } \qquad \begin{split} \ell &= 4 \newline m &= 11 \newline t &= 3 \newline p &= 0.1 \newline \end{split} $$

Considering the input parameters calculated above:

$$ \begin{split} sim_{j} &=\begin{cases} 0 & \text{if } m = 0 \newline {1 \over 3} \times \left({m \over |s_{1}|} + {m \over |s_{2}|} + {{m - t} \over m} \right) & \text{otherwise} \end{cases} \newline &={1 \over 3} \times \left({11 \over 12} + {11 \over 13} + {{11 - 3} \over 11}\right) \newline &= 0.83003108003 \newline \end{split} \qquad \text{then} \qquad \begin{split} sim_{w} &= sim_{j} + \ell \times p \times (1 - sim_{j}) \newline &= 0.83003108003 + 4 \times 0.1 \times (1 - 0.83003108003) \newline &= 0.89801864801 \newline \end{split} $$

We found that the $\lceil sim_{w} \rceil$ is $0.9$.

Benchmark

Function Minimum Time (1k runs of 10 pairs)
get_jaro_distance(s1, s2) 0.0149s
get_jaro_similarity(s1, s2) 0.0148s
get_jaro_winkler_distance(s1, s2) 0.0176s
get_jaro_winkler_similarity(s1, s2) 0.0172s

Benchmarking ran on a 2024 Macbook Pro with an M4 Pro chip running macOS 26.2.

Usage

from pyjarowinkler import distance

distance.get_jaro_similarity("PENNSYLVANIA", "PENNCISYLVNIA", decimals=12)
# 0.830031080031
distance.get_jaro_winkler_similarity("PENNSYLVANIA", "PENNCISYLVNIA", decimals=12)
# 0.898018648019
distance.get_jaro_distance("hello", "haloa", decimals=4)
# 0.2667
distance.get_jaro_similarity("hello", "haloa", decimals=2)
# 0.73
distance.get_jaro_winkler_distance("hello", "Haloa", scaling=0.1, norm_case=False)
# 0.4
distance.get_jaro_winkler_distance("hello", "HaLoA", scaling=0.1, norm_case=True)
# 0.24
distance.get_jaro_winkler_similarity("café", "cafe\u0301", norm_utf8=True)
# 1.0
distance.get_jaro_winkler_similarity("pаypal", "paypal", norm_ambiguous=True)
# 1.0
distance.get_jaro_winkler_similarity("hello", "haloa", decimals=2)
# 0.76

Contribute

You need to have installed mise on your system. Then, running the commands below will install python, uv, and github-cli.

Typical order of execution is as follow:

$ cd ./jaro-winkler-distance
$ mise install
$ uv venv
$ source .venv/bin/activate
$ uv pip install '.[dev]'

Other helpful commands:

  • uvx --python=3.12 python -m unittest discover -s tests/
  • uvx ruff check --diff
  • uvx ruff format --diff
  • uvx mypy
  • uvx coverage run -m unittest discover -s tests/
  • uvx coverage report

Release

$ ./release.sh help
Usage: release.sh [help|major|minor|patch]
$ PYPI_REPO=main ./release.sh minor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyjarowinkler-3.0.0.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyjarowinkler-3.0.0-py3-none-any.whl (18.6 kB view details)

Uploaded Python 3

File details

Details for the file pyjarowinkler-3.0.0.tar.gz.

File metadata

  • Download URL: pyjarowinkler-3.0.0.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pyjarowinkler-3.0.0.tar.gz
Algorithm Hash digest
SHA256 f9d1005c562b21b0815e0e23d0386442772ca57d444b37a1adfed2407c317b21
MD5 f79adfe112f4e2f399459976871ae0f9
BLAKE2b-256 14b9076418d17a751baacb1e1e9b0b5116285b3cfc758a921c6e44702aee781c

See more details on using hashes here.

File details

Details for the file pyjarowinkler-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: pyjarowinkler-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 18.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pyjarowinkler-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0622f704e8467d9471b33865ed7af480632d50a5c5d1799f51f46da5c4f0b47a
MD5 3de31e68b44dffe5666f4770ee45ed20
BLAKE2b-256 647e2df527163d503218c8eb8914e548f27d4b2a4b5ccd6948c8dab175b25598

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page