Skip to main content

This is a library used to make fuzzy name comparisons across census records.

Project description

Name Comparator

This package is used for fuzzy name comparisons.

Problem

With most historic records, there is the problem of messy data: nicknames are common, abbreviations are prevalent, mishearings are not rare, and misspellings are everywhere. Because of these and other factors, it is very difficult to automate name comparisons.

Solution

This package attempts to minimize that difficulty. By tokenizing names into their individual words, names can be compared using simple algorithms. Not only that, but this package cleans common indexing errors, understands most common nicknames, takes into account diverse spelling and pronunciation rules, and more in order to better compare messy name data.

The Code

from NameComparator.NameComparator import NameComparator

nc = NameComparator()
# it is recommended to reuse this object as it is expensive to initialize

results = nc.compareTwoNames('Johnny Christians', 'Christian, Jean')

print(results)
# {'match': True, 'tooGeneric': False, 'tooShort': False, 'attempt1': ('jean christians', 'christian jean', [('0', '1', 100), ('1', '0', 100)]), 'attempt2': None, 'attempt3': None, 'attempt4': None}

The above code snippet shows possible example usage of the package. The results variable is a dictionary with various attributes. The keys relevant to most users will be 'match', 'tooGeneric', and 'tooShort'. 'match' identifies whether the comparison was a match, 'tooGeneric' identifies whether either of the names was too generic (e.g. 'john smith'), and 'tooShort' identifies whether either name was too short in regards to number of words (e.g. 'justin').

If you are interrested in the debugging portion of the dictionary, each attempt is the use of different methods to identify if the names are a match or not. Two names might fail one or two methods but eventually be proven to be a match. A simple spelling comparison would fail for 'Maurice' and 'Morris'. The first attempt is simply cleaned up tokens being compared by spelling. The second attempt is a heavier edit of the tokens in order to try to get a cleaner spelling comparison. The third attempt checks if the tokens from attempt one are a match according to pronunciation. The fourth and last attempt identifies if the modified tokens from the second attempt are a match according to pronunciation comparison.

Each attempt's value is a list of tuples

[('0', '0', 80), ('1', '1', 100), ('3', '2', 100)]

Each tuple represents the best pairing of one word in the first provided name, with another word in the second provided name. Each tuple has three values: a string of the index number of the word in the first provided name, a string of the index number of the word in the second provided name, and a score of how well they matched (0-100). In the above example:

  • the 1st word in the 1st provided name matched with the 1st word in the 2nd provided name, with a score of 80.
  • the 2nd word in the 1st provided name matched with the 2nd word in the 2nd provided name, with a score of 100.
  • the 4th word in the 1st provided name matched with the 3rd word in the 2nd provided name, with a score of 100.

The algorithm finds all possible word pairs and chooses the word pairs that result in the highest overall score for the comparison. Names can be a match even if they are as low as 1 word in length each, but how these lists of tuples are interpreted for the boolean 'match' is very different depending on the minimum number of words in either name. For example, if the minimum number of words in each name is 3 or more, the theshold for scores is lower in order to achieve a match, than if there were only two names. This is because there is a much lower chance of a false negative when more words are present that are decent matches. Initials are also taken into account.

Enjoy!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

namecomparator-1.0.0.tar.gz (417.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

NameComparator-1.0.0-py3-none-any.whl (417.5 kB view details)

Uploaded Python 3

File details

Details for the file namecomparator-1.0.0.tar.gz.

File metadata

  • Download URL: namecomparator-1.0.0.tar.gz
  • Upload date:
  • Size: 417.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for namecomparator-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a8954ea1a8ff4adb9c1de68334383f50e95d7bbedf3305f5462af5436088bef1
MD5 9a684c8446b734514c7e770ece6c27c1
BLAKE2b-256 feca36d2cb003affff170004493712caa0ddb648f89f8c56f701dfc359d301e4

See more details on using hashes here.

File details

Details for the file NameComparator-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: NameComparator-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 417.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for NameComparator-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7fd0a159e9028a4b4b746ef55dfa55427374a893b5b4d970c9a2bdf9e36ae23f
MD5 f86190a65c790b40c5dfa3931648fb1e
BLAKE2b-256 25c99212b2f966ec6f6a4a79e07e088fdeadf866d43bcc9121f23991bce71695

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page