Fuzzy Name Matching with Machine Learning

These details have not been verified by PyPI

Project links

Project description

logo

HMNI

GitHub PyPI PyPI - Python Version PyPI - Downloads GitHub repo size

Fuzzy name matching with machine learning. Perform common fuzzy name matching tasks including similarity scoring, record linkage, deduplication and normalization.

HMNI is trained on an internationally-transliterated Latin firstname dataset, where precision is afforded priority.

Model	Accuracy	Precision	Recall	F1-Score
HMNI-Latin	0.9393	0.9255	0.7548	0.8315

For an introduction to the methodology and research behind HMNI, please refer to my blog post.

Requirements

Python 3.5–3.8

tensorflow
scikit-learn
fuzzywuzzy
abydos
unidecode

QUICK USAGE GUIDE

Installation

Using PIP via PyPI

pip install hmni

Initialize a Matcher Object

import hmni
matcher = hmni.Matcher(model='latin')

Single Pair Similarity

matcher.similarity('Alan', 'Al')
# 0.6838303319889133

matcher.similarity('Alan', 'Al', prob=False)
# 1

matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133

Record Linkage

import pandas as pd

df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold']})
df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})

merged = matcher.fuzzymerge(df1, df2, how='left', on='name')

Name Deduplication and Normalization

names_list = ['Alan', 'Al', 'Al', 'James']

matcher.dedupe(names_list, keep='longest')
# ['Alan', 'James']

matcher.dedupe(names_list, keep='frequent')
# ['Al, 'James']

matcher.dedupe(names_list, keep='longest', replace=True)
# ['Alan, 'Alan', 'Alan', 'James']

Matcher Parameters

hmni.Matcher(model='latin', prefilter=True, allow_alt_surname=True, allow_initials=True, allow_missing_components=True)

model (str) -- HMNI statistical model (latin by default)
prefilter (bool) -- Should the matcher prefilter unlikely candidates (True by default)
allow_alt_surname (bool) -- Should the matcher consider phonetic matching surnames e.g. Smith, Schmidt (True by default)
allow_initials (bool) -- Should the matcher consider names with initials (True by default)
allow_missing_components (bool) -- Should the matcher consider names with missing components (True by default)

Matcher Methods

similarity(name_a, name_b, prob=True, surname_first=False)

name_a (str) -- First name for comparison
name_b (str) -- Second name for comparison
prob (bool) -- If True return a predicted probability, else binary class label
threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
surname_first (bool) -- If name strings start with surname (False by default)

fuzzymerge(df1, df2, how='inner', on=None, left_on=None, right_on=None, indicator=False, limit=1, threshold=0.5, allow_exact_matches=True, surname_first=False)

df1 (pandas DataFrame or named Series) -- First/Left object to merge with
df2 (pandas DataFrame or named Series) -- Second/Right object to merge with
how (str) -- Type of merge to be performed
- inner (default): Use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
- left: Use only keys from left frame, similar to a SQL left outer join; preserve key order
- right: Use only keys from right frame, similar to a SQL right outer join; preserve key order
- outer: Use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
on (label or list) -- Column or index level names to join on. These must be found in both DataFrames
left_on (label or list) -- Column or index level names to join on in the left DataFrame
right_on (label or list) -- Column or index level names to join on in the right DataFrame
indicator (bool) -- If True, adds a column to output DataFrame called “_merge” with information on the source of each row (False by default)
limit (int) -- Top number of name matches to consider (1 by default)
threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
allow_exact_matches (bool) -- If True allow merging on exact name matches, else do not consider exact matches (True by default)
surname_first (bool) -- If name strings start with surname (False by default)

dedupe(names, threshold=0.5, keep='longest', reverse=True, limit=3, replace=False, surname_first=False)

names (list) -- List of names to dedupe
threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
keep (str) -- Specifies method for keeping one of multiple alternative names
- longest (default): Keeps longest name
- frequent: Keeps most frequent name in names list
reverse (bool) -- If True will sort matches descending order, else ascending (True by default)
limit (int) -- Top number of name matches to consider (3 by default)
replace (bool) -- If True return normalized name list, else return deduplicated name list (False by default)
surname_first (bool) -- If name strings start with surname (False by default)

assign_similarity(name_a, name_b, score)

name_a (str) -- First name for similarity score assignment
name_b (str) -- Second name for similarity score assignment
score (float) -- Assigned similarity score for pair of names

Contributing

Pull requests are welcome. For developers wishing to build a model using Latin or non-Latin writing systems (Chinese, Cyrillic, Arabic), jupyter notebooks are shared in the dev folder to build models using similar methods.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.8

Sep 14, 2020

0.1.7

Sep 11, 2020

0.1.6

Aug 27, 2020

0.1.5

Aug 19, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hmni-0.1.8.tar.gz (22.2 MB view details)

Uploaded Sep 14, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hmni-0.1.8-py3-none-any.whl (22.2 MB view details)

Uploaded Sep 14, 2020 Python 3

File details

Details for the file hmni-0.1.8.tar.gz.

File metadata

Download URL: hmni-0.1.8.tar.gz
Upload date: Sep 14, 2020
Size: 22.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.6

File hashes

Hashes for hmni-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`7d2d339c6848a509ac5bf99b6f925e1eee4cd858bec1d8233c87f375d9ed0063`
MD5	`290e9969eab1fa04649507ed026d4cca`
BLAKE2b-256	`9b860c5b4406c666ad73feef5d8dab012f04a6e4f1a31eeab60b32f530cb4fb7`

See more details on using hashes here.

File details

Details for the file hmni-0.1.8-py3-none-any.whl.

File metadata

Download URL: hmni-0.1.8-py3-none-any.whl
Upload date: Sep 14, 2020
Size: 22.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.6

File hashes

Hashes for hmni-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f262fa842b0d7a6c7e1b5ae643c32e9ac9f61e33e79c7bab1396b7e5ef8aac36`
MD5	`0928b37676cc7a061114a37ba30f45ca`
BLAKE2b-256	`6010ef662c2d9d01f2fc5b13c0a779259b12b3b916bcde6686841496bcd665a4`

See more details on using hashes here.

hmni 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HMNI

Requirements

Python 3.5–3.8

QUICK USAGE GUIDE

Installation

Initialize a Matcher Object

Single Pair Similarity

Record Linkage

Name Deduplication and Normalization

Matcher Parameters

Matcher Methods

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes