This is a library used to make fuzzy name comparisons across census records.

Project description

Name Comparator

This package is used for fuzzy name comparisons.

Problem

With most historic records, there is the problem of messy data: nicknames are common, abbreviations are prevalent, mishearings are not rare, and misspellings are everywhere. Because of these and other factors, it is very difficult to automate name comparisons.

Solution

This package attempts to minimize that difficulty. By tokenizing names into their individual words, names can be compared using simple algorithms. Not only that, but this package cleans common indexing errors, understands most common nicknames, takes into account diverse spelling and pronunciation rules, and more in order to better compare messy name data.

The Code

from NameComparator import NameComparator

results = NameComparator.compareTwoNames(nameA='Johnny Christians', nameB='Christian, Jean')

print(results)
# ResultsOfNameComparison(nameA='Johnny Christians', nameB='Christian, Jean', 
# tooShort=False, tooGeneric=False, match=True,
# attempt1WordCombo=[('0', '1', 100.0), ('1', '0', 100.0)], attempt1NameA='jean christians', attempt1NameB='christian jean', 
# attempt2WordCombo=None, attempt2NameA=None, attempt2NameB=None,
# attempt3WordCombo=None, attempt3NameA=None, attempt3NameB=None,
# attempt4WordCombo=None, attempt4NameA=None, attempt4NameB=None)

The above code snippet shows possible example usage of the package. The results variable is a dictionary with various attributes. The attributes relevant to most users will be match, tooGeneric, and tooShort.

match identifies whether the comparison was a match
tooGeneric identifies whether either of the names was too generic (e.g. 'john smith')
tooShort identifies whether either name was too short in regards to number of words (e.g. 'justin').

If you are interrested in the debugging portion of the returned namedtuple, each attempt is the use of different methods to identify if the names are a match or not. Two names might fail one or two methods but eventually be proven to be a match. A simple spelling comparison would fail for 'Maurice' and 'Morris'. The first attempt is this simple spelling comparison after minimal cleaning. The second attempt is a heavier edit of the tokens in order to try to get a closer spelling comparison. The third attempt checks if the modified tokens from attempt two are a match according to pronunciation. The fourth and last attempt identifies if the original tokens from the attempt one are a match according to pronunciation comparison. (Attempts 2 through 4 will not be undertaken if the names have no chance at matching.)

Each attempt's word combo is a list of tuples

[('0', '0', 80), ('1', '1', 100), ('3', '2', 100)]

Each tuple in the list represents the best pairing of one word in nameA, with another word in nameB. Each tuple has three values: a string of the index number of the word in the first provided name, a string of the index number of the word in the second provided name, and a score of how well they matched (0-100). In the above example:

the 1st word in the nameA matched with the 1st word in the nameB, with a score of 80.
the 2nd word in the nameA matched with the 2nd word in the nameB, with a score of 100.
the 4th word in the nameA matched with the 3rd word in the nameB, with a score of 100.

The algorithm finds all possible word pairs and chooses the word pairs that result in the highest overall score for the comparison. match can be True even if one of the names is only one word long. It is important to note, though, that the requirements for match to be evaluated as True changes depending number of words in the name with the shortest words. For example, if the minimum number of words in each name is three or more, the theshold for a good word pair is lower in order to achieve a match, than if there were only two words in the shortest name. This is because there is a much lower chance of a false negative when more words are present that are decent matches. Initials are also taken into account.

Enjoy!

Project details

Release history Release notifications | RSS feed

1.0.21

Mar 25, 2026

1.0.20

Mar 17, 2025

1.0.19

Mar 7, 2025

1.0.17b2 pre-release

Mar 7, 2025

1.0.16

Feb 25, 2025

This version

1.0.5

Oct 1, 2024

1.0.0

Jun 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

namecomparator-1.0.5.tar.gz (426.9 kB view details)

Uploaded Oct 1, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

NameComparator-1.0.5-py3-none-any.whl (429.6 kB view details)

Uploaded Oct 1, 2024 Python 3

File details

Details for the file namecomparator-1.0.5.tar.gz.

File metadata

Download URL: namecomparator-1.0.5.tar.gz
Upload date: Oct 1, 2024
Size: 426.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for namecomparator-1.0.5.tar.gz
Algorithm	Hash digest
SHA256	`7ce3f2200029075a0534fa57a5275b21d96b712c2c3029df14e7213f2b1b30c5`
MD5	`6e84bbf9060eee3cd9e75af4058894eb`
BLAKE2b-256	`5f18a9f11c44defc82e971ebdcc58c1211574e7315a3379649421b76303451ca`

See more details on using hashes here.

File details

Details for the file NameComparator-1.0.5-py3-none-any.whl.

File metadata

Download URL: NameComparator-1.0.5-py3-none-any.whl
Upload date: Oct 1, 2024
Size: 429.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for NameComparator-1.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf9a7acfb5a2f69cef62c31524d7c8e0a37b720a7a8ccfed54734a01f3bdea39`
MD5	`2f40ead5fc31fa7cd61434f0bce912df`
BLAKE2b-256	`c08225a4d1def268ea9a7b9442d8df3067161eea0daf4277088e15c6d1338bf8`

See more details on using hashes here.

NameComparator 1.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Name Comparator

Problem

Solution

The Code

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes