Fuzzily biject people's names between two lists.
Project description
names-matcher
Fuzzily biject people's names between two lists.
Let's define an identity as a series of names belonging to the same person. The algorithm is:
- Parse, normalize, and split names in each identity. The result is a set of strings per each.
- Define the similarity between identities as
max(ratio, token_set_ratio)
, whereratio
andtoken_set_ratio
are inspired by string comparison functions from rapidfuzz. - Construct the distance matrix between identities in two specified lists.
- Solve the Linear Assignment Problem (LAP) on that matrix.
Our LAP's solution scales up to ~1000-s of identities.
Example:
>>> from names_matcher import NamesMatcher
>>> NamesMatcher()([["Vadim Markovtsev", "vmarkovtsev"], ["Long, Waren", "warenlg"]], \
[["Warren"], ["VMarkovtsev"], ["Eiso Kant"]])
(array([1, 0], dtype=int32), array([0.75 , 0.57142857]))
The first resulting tuple element is the mapping indexes: of same length as the first sequence, with indexes in the second sequence. The second element is the corresponding confidence values from 0 to 1.
Installation
pip3 install names-matcher
Command line interface
Given one identity per line in two files, print the matches to standard output:
python3 -m names_matcher path/to/file/1 path/to/file/2
Each identity is several names merged with |
, for example:
Vadim Markovtsev|vmarkovtsev|vadim
Contributing
Contributions are very welcome and desired! Please follow the code of conduct and read the contribution guidelines.
License
Apache-2.0, see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for names_matcher-2.0.13-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 89325e430f669cc140e250ed5d474d703ebc39bdae98ba0ce570ed84f7d37b5f |
|
MD5 | 2e12a9edeb471b9c318db7033043873e |
|
BLAKE2b-256 | c2a0329a53db337032dd1f18d0e464c6b1ccf4bf6a6b74776d1fc3d13b47734c |