Fuzzily biject people's names between two lists.
Project description
names-matcher
Fuzzily biject people's names between two lists.
Let's define an identity as a series of names belonging to the same person. The algorithm is:
- Parse, normalize, and split names in each identity. The result is a set of strings per each.
- Define the similarity between identities as
max(ratio, token_set_ratio)
, whereratio
andtoken_set_ratio
are inspired by string comparison functions from FuzzyWuzzy. - Construct the distance matrix between identities in two specified lists.
- Solve the Linear Assignment Problem (LAP) on that matrix.
Our LAP's solution scales up to ~1000-s of identities.
Example:
>>> from names_matcher import NamesMatcher
>>> NamesMatcher()([["Vadim Markovtsev", "vmarkovtsev"], ["Long, Waren", "warenlg"]], \
[["Warren"], ["VMarkovtsev"], ["Eiso Kant"]])
(array([1, 0], dtype=int32), array([0.75 , 0.57142857]))
The first resulting tuple element is the mapping indexes: of same length as the first sequence, with indexes in the second sequence. The second element is the corresponding confidence values from 0 to 1.
Installation
pip3 install names-matcher
Command line interface
Given one identity per line in two files, print the matches to standard output:
python3 -m names_matcher path/to/file/1 path/to/file/2
Each identity is several names merged with |
, for example:
Vadim Markovtsev|vmarkovtsev|vadim
Contributing
Contributions are very welcome and desired! Please follow the code of conduct and read the contribution guidelines.
License
Apache-2.0, see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for names_matcher-2.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc88c36a6971144631c8fcd838fc97f65b151a2084996743c825e2a704599678 |
|
MD5 | 6d8c3a87b9a88d86e40622107a244a0e |
|
BLAKE2b-256 | 46c312d71b67e789e30320a7c22cd3fec7760321405d447a7e7db0e30413cfbb |