Skip to main content

Fuzzy Name Matching with Machine Learning

Project description

logo

HMNI

GitHub PyPI PyPI - Python Version Documentation Status PyPI - Downloads GitHub repo size

Fuzzy name matching with machine learning. Perform common fuzzy name matching tasks including similarity scoring, record linkage, deduplication and normalization.

HMNI is trained on an internationally-transliterated Latin firstname dataset, where precision is afforded priority.

Model Accuracy Precision Recall F1-Score
HMNI-Latin 0.9393 0.9255 0.7548 0.8315

For an introduction to the methodology and research behind HMNI, please refer to my blog post.

Requirements

Python 3.5–3.8

  • tensorflow
  • scikit-learn
  • fuzzywuzzy
  • abydos
  • unidecode

QUICK USAGE GUIDE

Installation

Using PIP via PyPI

pip install hmni

Initialize a Matcher Object

import hmni
matcher = hmni.Matcher(model='latin')

Single Pair Similarity

matcher.similarity('Alan', 'Al')
# 0.6838303319889133

matcher.similarity('Alan', 'Al', prob=False)
# 1

matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133

Record Linkage

import pandas as pd

df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold']})
df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})

merged = matcher.fuzzymerge(df1, df2, how='left', on='name')

Name Deduplication and Normalization

names_list = ['Alan', 'Al', 'Al', 'James']

matcher.dedupe(names_list, keep='longest')
# ['Alan', 'James']

matcher.dedupe(names_list, keep='frequent')
# ['Al, 'James']

matcher.dedupe(names_list, keep='longest', replace=True)
# ['Alan, 'Alan', 'Alan', 'James']

Matcher Parameters

hmni.Matcher(model='latin', prefilter=True, allow_alt_surname=True, allow_initials=True, allow_missing_components=True)

  • model (str) -- HMNI statistical model (latin by default)
  • prefilter (bool) -- Should the matcher prefilter unlikely candidates (True by default)
  • allow_alt_surname (bool) -- Should the matcher consider phonetic matching surnames e.g. Smith, Schmidt (True by default)
  • allow_initials (bool) -- Should the matcher consider names with initials (True by default)
  • allow_missing_components (bool) -- Should the matcher consider names with missing components (True by default)

Matcher Methods

similarity(name_a, name_b, prob=True, surname_first=False)

  • name_a (str) -- First name for comparison
  • name_b (str) -- Second name for comparison
  • prob (bool) -- If True return a predicted probability, else binary class label
  • threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
  • surname_first (bool) -- If name strings start with surname (False by default)

fuzzymerge(df1, df2, how='inner', on=None, left_on=None, right_on=None, indicator=False, limit=1, threshold=0.5, allow_exact_matches=True, surname_first=False)

  • df1 (pandas DataFrame or named Series) -- First/Left object to merge with
  • df2 (pandas DataFrame or named Series) -- Second/Right object to merge with
  • how (str) -- Type of merge to be performed
    • inner (default): Use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
    • left: Use only keys from left frame, similar to a SQL left outer join; preserve key order
    • right: Use only keys from right frame, similar to a SQL right outer join; preserve key order
    • outer: Use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
  • on (label or list) -- Column or index level names to join on. These must be found in both DataFrames
  • left_on (label or list) -- Column or index level names to join on in the left DataFrame
  • right_on (label or list) -- Column or index level names to join on in the right DataFrame
  • indicator (bool) -- If True, adds a column to output DataFrame called “_merge” with information on the source of each row (False by default)
  • limit (int) -- Top number of name matches to consider (1 by default)
  • threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
  • allow_exact_matches (bool) -- If True allow merging on exact name matches, else do not consider exact matches (True by default)
  • surname_first (bool) -- If name strings start with surname (False by default)

dedupe(names, threshold=0.5, keep='longest', reverse=True, limit=3, replace=False, surname_first=False)

  • names (list) -- List of names to dedupe
  • threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
  • keep (str) -- Specifies method for keeping one of multiple alternative names
    • longest (default): Keeps longest name
    • frequent: Keeps most frequent name in names list
  • reverse (bool) -- If True will sort matches descending order, else ascending (True by default)
  • limit (int) -- Top number of name matches to consider (3 by default)
  • replace (bool) -- If True return normalized name list, else return deduplicated name list (False by default)
  • surname_first (bool) -- If name strings start with surname (False by default)

assign_similarity(name_a, name_b, score)

  • name_a (str) -- First name for similarity score assignment
  • name_b (str) -- Second name for similarity score assignment
  • score (float) -- Assigned similarity score for pair of names

Contributing

Pull requests are welcome. For developers wishing to build a model using Latin or non-Latin writing systems (Chinese, Cyrillic, Arabic), jupyter notebooks are shared in the dev folder to build models using similar methods.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hmni-0.1.8.tar.gz (22.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hmni-0.1.8-py3-none-any.whl (22.2 MB view details)

Uploaded Python 3

File details

Details for the file hmni-0.1.8.tar.gz.

File metadata

  • Download URL: hmni-0.1.8.tar.gz
  • Upload date:
  • Size: 22.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.6

File hashes

Hashes for hmni-0.1.8.tar.gz
Algorithm Hash digest
SHA256 7d2d339c6848a509ac5bf99b6f925e1eee4cd858bec1d8233c87f375d9ed0063
MD5 290e9969eab1fa04649507ed026d4cca
BLAKE2b-256 9b860c5b4406c666ad73feef5d8dab012f04a6e4f1a31eeab60b32f530cb4fb7

See more details on using hashes here.

File details

Details for the file hmni-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: hmni-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 22.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.6

File hashes

Hashes for hmni-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f262fa842b0d7a6c7e1b5ae643c32e9ac9f61e33e79c7bab1396b7e5ef8aac36
MD5 0928b37676cc7a061114a37ba30f45ca
BLAKE2b-256 6010ef662c2d9d01f2fc5b13c0a779259b12b3b916bcde6686841496bcd665a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page