Skip to main content

A library for measuring similarity between strings

Project description

CompareStrings

CompareStrings accepts either two strings or two Pandas Series' containing strings, as inputs, and provides a simple way to tell how similar or dissimilar two strings are.

By default, the compare_strings function returns the Levenshtein Distance between the strings, divided by the length of the first string, where 0 represents absolute similiarity, higher values represent increasing dissimilarity.

Optional argument method allows selection of alternative methods of calculation, such as the absolute Levenshtein distance - method = lev_abs, or the cosine distance (not yet released).

CompareStrings by default stripes numeric and punctuation characters from the string before performing the calculation.

The optional email argument takes 1 or 2 as values, and indicates to the function that either string (or series) 1 or 2 contain an email address. When this argument is used, the input that contains an email address is split on the '@' and the email domain is discarded before the calculation is performed.

The precision argument is used to determine the number of decimals returns in the resulting float.

Installation

pip install CompareStrings

Usage

Strings:

compare_strings supports indivdual strings as inputs. Examples:

from CompareStrings import compare_strings

method='lev_abs'

# Levenshtein Distance

compare_strings('string one','string', method='lev_abs')

Out[1]: 4

There were 4 additions, deletions or substitutions required to change the first string into the second

method='lev_props'

# Levenshtein Distance as a proportion of the length of the first string

compare_strings('string one','string', method='lev_props')

Out[1]: 0.4

There were 4 additions, deletions or substitutions required to change the first string into the second string, and 10 characters in the first string.

Pandas Series:

compare_strings also accepts pandas series as inputs. It will return a new DataFrame containing the inputs and a new column with the output.

The email argument can be used to tell the function if one of the inputs contains an email address, and performs some preprocessing to remove the domain - for example:

Without email set:

email full_name levenshtein_proportions
6203 tom_johnson1@hotmail.com Tom Jonhson 0.46
8990 suzanne_stevenson54@hotmail.com Suzanne stevenson 0.43
6769 marie.eriksson99@hotmail.com Ann Eriksson 0.62
2552 elisabeth.henriksson8@hotmail.com Elisabeth Henriksson 0.38

With email = 1 set:

email full_name levenshtein_proportions
6203 tom_jonson1@hotmail.com Tom Jonson 0
8990 suzanne_stevenson54@hotmail.com Suzanne Stevenson 0
6769 marie.eriksson99@hotmail.com Ann Eriksson 0.29
2552 elisabeth.henriksson8@hotmail.com Elisabeth Henriksson 0

Passing 1 to the email argument tells the function to ignore the characters after and including the '@' in the first column when performing the calculation.

check_names:

The check_names argument is intended to be used in conjunction with the email argument. It adds another column to the returned DataFrame with a True or False value, indicating whether any part of the string was found in the big_names_list. For example, it may be useful to ignore the similarity score if the email address passed into the function does not contain anything recognised as a name.

Disclaimer the names list currently contains 7.6k first and surnames from a number of nationalities, but is in no way exhaustive. It also contains some names that are quite short, and may return false positives if those short strings are found in the inputs.

Coming soon:

  • Support for additional alternative measures of similarity/dissimilarity
  • Support for lists as inputs
  • Probably other stuff - want to help? See below

Contribution

This is my very first python package so contributions are very much welcome. Suggestions include:

  • Documentation incl. tidying up docstrings and comments
  • Additions to the big_names_list
  • Support for names in other languages
  • New similarity measures
  • Support or suggestions for other use cases

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

compare_strings-0.0.3-py3-none-any.whl (7.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page