library for fast approximate string matching using Jaro and Jaro-Winkler similarity
Project description
JaroWinkler
JaroWinkler is a library to calculate the Jaro and Jaro-Winkler similarity. It is easy to use, is far more performant than all alternatives and is designed to integrate seemingless with RapidFuzz.
:zap: Quickstart
>>> from jarowinkler import *
>>> jaro_similarity("Johnathan", "Jonathan")
0.8796296296296297
>>> jarowinkler_similarity("Johnathan", "Jonathan")
0.9037037037037037
🚀 Benchmarks
The implementation is based on a novel approach to calculate the Jaro-Winkler similarity using bitparallelism. This is significantly faster than the original approach used in other libraries. The following benchmark shows the performance difference to jellyfish and python-Levenshtein.
⚙️ Installation
You can install this library from PyPI with pip:
pip install jarowinkler
JaroWinkler provides binary wheels for all common platforms.
Source builds
For a source build (for example from a SDist packaged) you only require a C++14 compatible compiler. You can install directly from GitHub if you would like.
pip install git+https://github.com/maxbachmann/JaroWinkler.git@main
📖 Usage
Any algorithms in JaroWinkler can not only be used with strings, but with any arbitary sequences of hashable objects:
from jarowinkler import jarowinkler_similarity
jarowinkler_similarity("this is an example".split(), ["this", "is", "a", "example"])
# 0.8666666666666667
So as long as two objects have the same hash they are treated as similar. You can provide a __hash__
method for your own object instances.
class MyObject:
def __init__(self, hash):
self.hash = hash
def __hash__(self):
return self.hash
jarowinkler_similarity([MyObject(1), MyObject(2)], [MyObject(1), MyObject(2), MyObject(3)])
# 0.9111111111111111
All algorithms provide a score_cutoff
parameter. This parameter can be used to filter out bad matches. Internally this allows JaroWinkler to select faster implementations in some places:
jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.9)
# 0.0
jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.85)
# 0.8796296296296297
JaroWinkler can be used with RapidFuzz, which provides multiple methods to compute string metrics on collections of inputs. JaroWinkler implements the RapidFuzz C-API which allows RapidFuzz to call the functions without any of the usual overhead of python, which makes this even faster.
from rapidfuzz import process
process.cdist(["Johnathan", "Jonathan"], ["Johnathan", "Jonathan"], scorer=jarowinkler_similarity)
array([[1. , 0.9037037],
[0.9037037, 1. ]], dtype=float32)
👍 Contributing
PRs are welcome!
- Found a bug? Report it in form of an issue or even better fix it!
- Can make something faster? Great! Just avoid external dependencies and remember that existing functionality should still work.
- Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).
- Have no time to code? Tell your friends and subscribers about JaroWinkler. More users, more contributions, more amazing features.
Thank you :heart:
⚠️ License
Copyright 2021 - present maxbachmann. JaroWinkler
is free and open-source software licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for jarowinkler-2.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c04d8e761caa643eb9801440ccba12498b958f53146f236aa73a884e66ef23c |
|
MD5 | 44d1bd5da4af4299d4ee317ff01b10bb |
|
BLAKE2b-256 | e8efe6a3a716e5f5fbb32a55ab19384e62427907a37574dd75c4502b09146223 |