Skip to main content

Methods to assess string similarity.

Project description

strcompare

A library of string similarity assessment functions.

Properties

Each string assessment score judges the similarity of two strings as a floating point number. Lower numbers indicate dissimilarity, and higher numbers indicate similarity. As such, 0.0 would indicate completely different strings, while 1.0 would indicate exactly equal strings.

Every string assessment score adheres to the following rules/properties, given comparison function $func$ and strings $x$ and $y$.

  • $0.0 <= func(x, y) <= 1.0$ for all valid $x$ and $y$
  • $func(x, y) = 1.0$ if $x = y$
  • $func(x, y) = func(y, x)$ for all valid $x$ and $y$
  • $func(x, y) = 0$ if $x$ and $y$ share no common characters.
    • As a corollary, $func(x, y) = 0$ if exactly one of $x$ and $y$ are empty.

Additionally, several different scoring methodologies come with "major" and "minor" variants. Minor variants additionally exhibit the following property:

  • $func(x, y) = 1$ if $x \subseteq y$, or $y \subseteq x$


Scoring Functions

Each scoring method derives it score by calculating some metric, then scaling that metric by the length of one or both of the strings. Note that "major" variants will scale the metric by the longer string's length, and "minor" variants by the shorter string's length.

Character Distribution

Metric Number of individually matching characters between the two strings

Scoring methods

cdist_score(x, y) Scales metric the the sum of both lengths

Example:
HELLO vs HELP

H E L O
1 1 2 1

H E L P
1 1 1 1

The number of matching characters is 3 (H, E, and L). Since they appear in both, the resulting metric is 2 * 3 = 6. 
As len(HELLO) = 5 and len(HELP) = 4, the scale factor here is 5 + 4 = 9.

cdist_score = 6 / 9 ~= 0.667

Longest Substring

Metric The length of the longest sequence of characters appearing in the same order in both strings.

Scoring methods

lss_major(x, y) Scales metric by the length of the longer string.
lss_minor(x, y) Scales metric by the length of the shorter string.

Example:
STRESSED | DESSERT
___ESSE_ | _ESSE__
--------------------
The scoring metric here is `4`, as the longest common substring is "ESSE"
lss_major = 4 / 8  = 0.5
lss_minor = 4 / 7 ~= 0.5714

Fragmented Substring

Metric The length of the longest sequence of characters appearing in the same relative order in both strings
Adjusted Metric - Minor Calculates a metric based on the length of the longer string, assessing "point" penalties for differences in distances between relative character matches between the two strings
Adjusted Metric - Major Calculates a metric based on the length of the shorter string, assessing "point" penalties for differences in distances between relative character matches between the two strings

Scoring methods

fss_major(x, y) Scales $metric$ by the length of the longer string
fss_minor(x, y) Scales $metric$ by the length of the shorter string
adjusted_fss_major(x, y) Scales major adjusted metric by the product of the lengths of the two strings
adjusted_fss_minor(x, y) Scales minor adjusted metric by the product of the lengths of the two strings

Example:
BRAKE | BERATE
BRA_E | B_RA_E
----------------
The scoring metric here is 4, as the longest fragmentd substring is "BRAE"
fss_major = 4 / 6 ~= 0.667
fss_minor = 4 / 5  = 0.8
----------------
The adjusted major metric here is 19, as characters BRAE match, but R is 1 character away from B in BRAKE, but 2 characters away from B in BERATE
Thus, adjusted major metric = 5 + (5 - 1) + 5 + 5 = 19  ( 5 is the length of the shorter string BRAKE )
adjusted_fss_major = 19 / (5 * 6) = 19 / 30 ~= 0.633

The adjusted minor metric here is 23, as character BRAE match, but R is 1 character away from B in BRAKE, but 2 characters away from B in BERATE
Thus, adjusted minor metric = 6 + (6 - 1) + 6 + 6 = 23  ( 6 is the length of the longer string BERATE )
adjusted_fss_minor = 23 / (5 * 6) = 23 / 30 ~= 0.767

Levenshtein

Metric The levenshtein distance between the two strings subtracted from the longer of the two strings (thus representing the number of already matching characers $i.e.$ non-edits needed).

  • The levenshtein distance is defined as the number of single character edits (removal, addition, replacement) needed to convert one string into the other.
Scoring methods

levenshtein_major(x, y) Scales $metric$ by the length of the longer string
levenshtein_minor(x, y) Scales $metric$ by the length of the shorter string

Example
GUMBO vs GAMBOL
_______________
GUMBO  | GAMBOL | 0 - start
GAMBO  | GAMBOL | 1 - modify U to A
GAMBOL | GAMBOL | 2 - add L to right side
------------------------------------------
The scoring metric here is 4
2 is the levenshtein distance, as two character edits will convert GAMBO to GAMBOL. The longer string is GAMBOL, whose length is 6
Thus, metric = 6 - 2 = 4
levenshtein_major = 4 / 6 ~= 0.667
levenshtein_minor = 4 / 5  = 0.8

Extra Utilities

A few extra utilities relating to the above scoring functions are also provided

substring_length(x, y) Returns the integer length of the longest common substring between strings $x$ and $y$
longest_substring(x, y) Returns a string equal to the longest common substring between strings $x$ and $y$, returning an empty string if there are no common substrings.
levenshtein_distance(x, y) Returns an integer representing the levenshtein distance (as defined above) between strings $x$ and $y$.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

strcompare-2.0.1-cp311-cp311-win_amd64.whl (13.6 kB view details)

Uploaded CPython 3.11Windows x86-64

strcompare-2.0.1-cp310-cp310-win_amd64.whl (14.7 kB view details)

Uploaded CPython 3.10Windows x86-64

strcompare-2.0.1-cp38-cp38-win_amd64.whl (13.6 kB view details)

Uploaded CPython 3.8Windows x86-64

File details

Details for the file strcompare-2.0.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: strcompare-2.0.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for strcompare-2.0.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 12e973f076b82b7f17b8dbae11546a261eafe7d35024c869ae417583aee788df
MD5 67a1da79a20de9bc105537dc3f22b8cf
BLAKE2b-256 545a7ea2b559d0035792d065f992b53cd75cfddf00b2bcefbe9f7c5a025321aa

See more details on using hashes here.

File details

Details for the file strcompare-2.0.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: strcompare-2.0.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 14.7 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for strcompare-2.0.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c2764f40847e51e24d886e1b322da64f2d0e9eca5a597f1f3ac72f808dbd5f63
MD5 7cd0a7bbeed0352f94d66fa78c22e36b
BLAKE2b-256 d47fe01bd9274734067721eda8fcfc1081d87cc44415e618d9f4a8e2e691ae2d

See more details on using hashes here.

File details

Details for the file strcompare-2.0.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: strcompare-2.0.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for strcompare-2.0.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 473ffc7325f116daede9b34d823d02fc1362cf555c9532e6e883aa877543f245
MD5 30314a4cfb5cab1899723103908f62c7
BLAKE2b-256 7060b11084d07823dcb259b0720cfc6c479037c12facf8d4b83ad84d08d2220c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page