Skip to main content

This is a library used to compare name similarity.

Project description

Name Comparator

This package is used for fuzzy name comparisons.

Problem

With most historic records, there is the problem of messy data: nicknames are common, abbreviations are prevalent, mishearings are not rare, and misspellings are everywhere. Because of these and other factors, it is very difficult to automate name comparisons.

Solution

This package attempts to minimize that difficulty. By tokenizing names into their individual words, names can be compared using simple algorithms. Not only that, but this package cleans common indexing errors, understands most common nicknames, takes into account diverse spelling and pronunciation rules, and more in order to better compare messy name data.

The Code

from NameComparator import NameComparator

results = NameComparator.compare_two_names(name_one='Johnny Christians', name_two='Christian, Jean')

print(results)
# ResultsOfNameComparison(
#     name_one='Johnny Christians',
#     name_two='Christian, Jean',
#     match=True,
#     uniqueness=100,
#     too_short=False,
#     attempt_one=Attempt(name_one='jean christians', name_two='christian jean', word_combo=[('0', '1', 100.0), ('1', '0', 100.0)]),
#     attempt_two=None,
#     attempt_three=None,
#     attempt_four=None
# )

The above code snippet shows possible example usage of the package. The results variable is a dictionary with various attributes. The attributes relevant to most users will be match, tooGeneric, and too_short.

  • match identifies whether the comparison was a match
  • uniqueness gives a score out of 100 to the uniqueness of the two names compared to one another. (e.g. 'john smith' compared to 'j smith' would have a very low uniqueness score).
  • too_short identifies whether either name was too short in regards to number of words (e.g. 'justin').

If you are interrested in debugging or looking deeper at what factors went into the comparison being a match or not, please see the attempt attributes. Each attempt is the use of different methods to identify if the names are a match or not. These include cleaning names in reference to one another and spelling rules, and using the names' pronunciation instead of spelling. An attempt is None if a previous attempt discovered the names were a match, as there is no further reason to continue the comparison. This is because two names might fail one or two methods but eventually be proven to be a match.

Let's look at the example of the name comparison of 'Maurice' and 'Morris'.

  • attempt_one is this simple spelling comparison after minimal cleaning. This would fail to identify a match.
  • attempt_two is a heavier edit of the spelling in reference to the other name and to spelling rules. This would also prove ineffective.
  • attempt_three checks if the modified tokens from attempt two are a match according to pronunciation. This would work!
  • attempt_four, the last attempt, (not reached in this scenerio) identifies if the original tokens from the attempt one are a match according to pronunciation comparison.

Finally, when debugging, it is important to understand any attempt after attempt_one will not be undertaken if the names have no chance at matching. This is thanks to the is_worth_continuing function. If it fails this function or gets through all four attempts without passing any attempt, then match is considered false.

Each attempt's word combo is a list of tuples

[('0', '0', 80), ('1', '1', 100), ('3', '2', 100)]

Each tuple in the list represents the best pairing of one word in name_one, with another word in name_two. Each tuple has three values: a string of the index number of the word in the first provided name, a string of the index number of the word in the second provided name, and a score of how well they matched (0-100). In the above example:

  • the 1st word in the name_one matched with the 1st word in the name_two, with a score of 80.
  • the 2nd word in the name_one matched with the 2nd word in the name_two, with a score of 100.
  • the 4th word in the name_one matched with the 3rd word in the name_two, with a score of 100.

The algorithm finds all possible word pairs and chooses the word pairs that result in the highest overall score for the comparison. match can be True even if one of the names is only one word long. It is important to note, though, that the requirements for match to be evaluated as True changes depending number of words in the name with the shortest words. For example, if the minimum number of words in each name is three or more, the theshold for a good word pair is lower in order to achieve a match, than if there were only two words in the shortest name. This is because there is a much lower chance of a false negative when more words are present that are decent matches. Initials are also taken into account.

Enjoy!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

namecomparator-1.0.26.tar.gz (563.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

namecomparator-1.0.26-py3-none-any.whl (36.1 kB view details)

Uploaded Python 3

File details

Details for the file namecomparator-1.0.26.tar.gz.

File metadata

  • Download URL: namecomparator-1.0.26.tar.gz
  • Upload date:
  • Size: 563.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for namecomparator-1.0.26.tar.gz
Algorithm Hash digest
SHA256 1e13d564a036c1fbbc2647e9d305e7c09221745476173427859572fd6f1accb6
MD5 f6b890fe8afe5168296367839f540058
BLAKE2b-256 a468195ccdc97f03285ce9fbce7cd074e2147cd7e350e6fe3bffab7344fdc296

See more details on using hashes here.

File details

Details for the file namecomparator-1.0.26-py3-none-any.whl.

File metadata

File hashes

Hashes for namecomparator-1.0.26-py3-none-any.whl
Algorithm Hash digest
SHA256 b687e055b70930698e537cb2b064d68e0c01eddcde2f6ac4b39e2532b8325274
MD5 c172ad0599db975743b27766615a7b62
BLAKE2b-256 5ae6ccf2571a087527cd363e2cf139552b70458df1f9239e29502e3261d96c56

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page