Methods to assess string similarity.
Project description
strcompare
A library of string similarity assessment functions.
Properties
Each string assessment score judges the similarity of two strings as a floating point number. Lower numbers indicate dissimilarity, and higher numbers indicate similarity. As such, 0.0 would indicate completely different strings, while 1.0 would indicate exactly equal strings.
Every string assessment score adheres to the following rules/properties, given comparison function $func$ and strings $x$ and $y$.
- $0.0 <= func(x, y) <= 1.0$ for all valid $x$ and $y$
- $func(x, y) = 1.0$ if $x = y$
- $func(x, y) = func(y, x)$ for all valid $x$ and $y$
- $func(x, y) = 0$ if $x$ and $y$ share no common characters.
- As a corollary, $func(x, y) = 0$ if exactly one of $x$ and $y$ are empty.
Scoring Functions
cdist_score
Character Distribution Score
. Generates a score by comparing the difference in distribution of characters between the two strings.
lcs_score
Longest Common Substring
score. Returns the ratio of the longest substring common to both strings to the length of the shorter string.
Example:
STRESSED | DESSERT
STR____D | D____RT => ESSE
---------------------------
Substring length = 4
Short string length = 7
lcs_score = 4/7 ~= 0.57
naive_lcs_score
Naive Longest Common Substring Score
Calculates the same as above using a naive algorithm.
fss_score
Fractured Substring Score
Assesses similarity by comparing groups of characters in the same relative order between the two strings. The greater the number of relative order character matches, the higher the score.
Example:
STRESSED | DESSERT
__RESSED | DE_SER_ => ST
___ESSED | DE_SE__ => R
_____S_D | D______ => ESE
_____S__ | _______ => D
naive_fss_score
Naive Fractured Substring Score
Performs the above assessment using a naive algorithm.
adjusted_fss_score
Adjusted Fractured Substring Score
Identifies "fractured substrings" between the two strings, assessing a penalty for characters with different offsets between the first two strings.
In the example above, the substring ST would assess a penalty due to ST being in indices 0 and 1 in the first string, but 2 and 6 in the second. (Offset of 1 vs offset of 4)
naive_adjusted_fss_score
Naive Adjusted Fractured Substring Score
Performs the above assessment using a naive algorithm.
levenshtein_score
Levenshtein Score
Assigns a score based on the levenshtein distance between the two strings. Final score is calculated by comparing the calculated levenshtein distance to the maximum possible levenshtein distance based on the string lengths.
Let $m$ be the maximum possible levenshtein distance and $s$ be the calculated distance. Final score is calculated as ${(m - s) \over m}$
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for strcompare-1.2.2-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | af1fd37fc11d226901acb8ac7b5856f2d7c5967bff73c739a74b8d6516199919 |
|
MD5 | 86701686965230d8f669b3f893c7072e |
|
BLAKE2b-256 | 4b71ad77f5293d361994137b6fbff36198cfe0d29c40cbd63c5f85cd376cd9f7 |
Hashes for strcompare-1.2.2-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b548d60165e2b1346f2aca7bf8d42a4f4d194dd11903cfba1492253562990ba |
|
MD5 | 1709c63a31251f00b91616d8781946d0 |
|
BLAKE2b-256 | 660c10e8ba1a73eb775c646113089fdd0b364a7cf080048b86111007a712e002 |
Hashes for strcompare-1.2.2-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | db1b9ca62a6da4c166961ce892a6282a61fd4967653327e8219bcd35700130f3 |
|
MD5 | 134d999c93504157402f2bcfb188d4f1 |
|
BLAKE2b-256 | ffd6a8c53b3ce2ad8af3ad12625d664f00b92394894e7a07e2b16fd460077373 |