Finds the Jaro Winkler Distance indicating a distance or similarity score between two strings.
Project description
Jaro Winkler Distance
Finds a non-euclidean distance or similarity between two strings.
Jaro and Jaro-Winkler equations provide a score between two short strings where errors are more prone at the end of the string. Jaro's equation measure is the weighted sum of the percentage of matching and transposed characters from each string. Winkler's factor adds weight in Jaro's formula to increase the calculated measure when there is a sequence of characters (a prefix) in both strings.
This version is based on the original C implementation of strcmp95 implementation but does not attempt to normalize homoglyph (e.g.: O vs. 0).
- Impact of the prefix is limited to 4 characters, as originally defined by Winkler.
- Input strings are not modified beyond whitespace trimming.
- In-word whitespace and characters case will optionally impact score.
- Supports optional UTF-8 normalization and homoglyph sanitization.
- Returns a floating point number rounded to the desired decimals (defaults to
2) using Python'sround. - Consider usual floating point arithmetic characteristics when working with this module.
Implementation
The complexity of this algorithm resides in finding the matching and transposed characters. That is because of the interpretation of what are the matching conditions and the definition of transposed. Definitions of those two will make the score vary between implementations of this algorithm.
Here is how matching and transposed are defined in this module:
- A character of the first string at position
Nismatchingif found at positionNor withindistanceon either side in the second string. - The
distanceis calculated using the rounded down length of the longest string divided by two minus one. - Characters in the first string are matched only once against characters of the second string.
- Two characters are
transposedif they previously matched and aren't at the same position in the matching character subset. - Decimals are rounded according to the scientific method.
Example
Calculate the Jaro Winkler similarity ($sim_{w}$) between PENNSYLVANIA and PENNCISYLVNIA:
$$ s_{1}=\text{PENNSYLVANIA} \qquad\text{and}\qquad s_{2}=\text{PENNCISYLVNIA} $$
P E N N C I S Y L V N I A
┌-─────────────────────────
P │ 1 ╎
E │ 1 ╎
N │ 1 ╎
N │ 1 ╎ Symbols '╎' represent the sliding windows
S │ 1 ╎ boundary in the second string where we look
Y │ ╎ 1 ╎ for the first string's character.
L │ ╎ 1 ╎
V │ ╎ 1 d = 5 in this example.
A │ ╎ 1
N │ ╎ 1
I │ ╎ 1
A │ ╎
$$ \begin{split} d &= \left\lfloor {\max(12, 13) \over 2} \right\rfloor - 1 \newline &= 5 \newline \end{split} \qquad \text{ and } \qquad \begin{split} |s_{1}| &= 12 \newline |s_{2}| &= 13 \newline \end{split} \qquad \text{ and } \qquad \begin{split} \ell &= 4 \newline m &= 11 \newline t &= 3 \newline p &= 0.1 \newline \end{split} $$
Considering the input parameters calculated above:
$$ \begin{split} sim_{j} &=\begin{cases} 0 & \text{if } m = 0 \newline {1 \over 3} \times \left({m \over |s_{1}|} + {m \over |s_{2}|} + {{m - t} \over m} \right) & \text{otherwise} \end{cases} \newline &={1 \over 3} \times \left({11 \over 12} + {11 \over 13} + {{11 - 3} \over 11}\right) \newline &= 0.83003108003 \newline \end{split} \qquad \text{then} \qquad \begin{split} sim_{w} &= sim_{j} + \ell \times p \times (1 - sim_{j}) \newline &= 0.83003108003 + 4 \times 0.1 \times (1 - 0.83003108003) \newline &= 0.89801864801 \newline \end{split} $$
We found that the $\lceil sim_{w} \rceil$ is $0.9$.
Benchmark
| Function | Minimum Time (1k runs of 10 pairs) |
|---|---|
get_jaro_distance(s1, s2) |
0.0149s |
get_jaro_similarity(s1, s2) |
0.0148s |
get_jaro_winkler_distance(s1, s2) |
0.0176s |
get_jaro_winkler_similarity(s1, s2) |
0.0172s |
Benchmarking ran on a 2024 Macbook Pro with an M4 Pro chip running macOS 26.2.
Usage
from pyjarowinkler import distance
distance.get_jaro_similarity("PENNSYLVANIA", "PENNCISYLVNIA", decimals=12)
# 0.830031080031
distance.get_jaro_winkler_similarity("PENNSYLVANIA", "PENNCISYLVNIA", decimals=12)
# 0.898018648019
distance.get_jaro_distance("hello", "haloa", decimals=4)
# 0.2667
distance.get_jaro_similarity("hello", "haloa", decimals=2)
# 0.73
distance.get_jaro_winkler_distance("hello", "Haloa", scaling=0.1, norm_case=False)
# 0.4
distance.get_jaro_winkler_distance("hello", "HaLoA", scaling=0.1, norm_case=True)
# 0.24
distance.get_jaro_winkler_similarity("café", "cafe\u0301", norm_utf8=True)
# 1.0
distance.get_jaro_winkler_similarity("pаypal", "paypal", norm_ambiguous=True)
# 1.0
distance.get_jaro_winkler_similarity("hello", "haloa", decimals=2)
# 0.76
Contribute
You need to have installed mise on your system. Then, running the commands below will install python, uv, and github-cli.
Typical order of execution is as follow:
$ cd ./jaro-winkler-distance
$ mise install
$ uv venv
$ source .venv/bin/activate
$ uv pip install '.[dev]'
Other helpful commands:
uvx --python=3.12 python -m unittest discover -s tests/uvx ruff check --diffuvx ruff format --diffuvx mypyuvx coverage run -m unittest discover -s tests/uvx coverage report
Release
$ ./release.sh help
Usage: release.sh [help|major|minor|patch]
$ PYPI_REPO=main ./release.sh minor
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyjarowinkler-3.0.0.tar.gz.
File metadata
- Download URL: pyjarowinkler-3.0.0.tar.gz
- Upload date:
- Size: 19.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9d1005c562b21b0815e0e23d0386442772ca57d444b37a1adfed2407c317b21
|
|
| MD5 |
f79adfe112f4e2f399459976871ae0f9
|
|
| BLAKE2b-256 |
14b9076418d17a751baacb1e1e9b0b5116285b3cfc758a921c6e44702aee781c
|
File details
Details for the file pyjarowinkler-3.0.0-py3-none-any.whl.
File metadata
- Download URL: pyjarowinkler-3.0.0-py3-none-any.whl
- Upload date:
- Size: 18.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0622f704e8467d9471b33865ed7af480632d50a5c5d1799f51f46da5c4f0b47a
|
|
| MD5 |
3de31e68b44dffe5666f4770ee45ed20
|
|
| BLAKE2b-256 |
647e2df527163d503218c8eb8914e548f27d4b2a4b5ccd6948c8dab175b25598
|