Skip to main content

Package designed to match strings by similarity

Project description

Static Badge GitHub License PyPI - Downloads Ruff

StringPairFinder is a Python package designed to simplify the process of finding similarities between strings.

drawing

Installation

pip install stringpairfinder

Computing String Similarity

import stringpairfinder as spf

spf.get_similarity("Munich", "Munchen")
>> 0.23809523809523808

Finding the Nearest String

spf.get_nearest_string(
    string="Naples",
    string_list=["Munchen", "Napoli", "Warszawa"]
)
>> 'Napoli'

Mapping Strings to Their Nearest Counterparts

spf.match_strings(
    source_strings=["Naples", "Munich", "Warsaw"],
    target_strings=["Munchen", "Napoli", "Warszawa"]
)
>>  {
        'Naples': 'Napoli',
        'Munich': 'Munchen',
        'Warsaw': 'Warszawa'
    }

Evaluation

Evaluation on a dataset of 1000 observations (see notebooks/StringPairFinder vs FuzzyWuzzy.ipynb) :

  • FuzzyWuzzy algorithm : 85.2 % of success rate
  • StringPairFinder algorithm : 94.0 % of success rate

What is the algorithm ?

The similarity search between two strings consists of a matrix comparison of each character in those strings. Let"s assume we want to compare the strings "Munich" and "Bayern Munich".

  1. The first step is to construct a table $T$ containing the first string in the column and the second in the row. The value of a cell is 1 if the character in the row is the same as the one in the column, and 0 otherwise.

drawing

  1. The second step aims at highlighting the fact that several characters correspond consecutively. Thus, for each row $i$ and column $j$, if cell $T[i-1, j-1] > 0$, then $T[i, j]$ is twice the value of $T[i-1, j-1]$.

drawing

  1. The third step is simply to calculate the similarity score, equal to the sum of all the cells in the $T$ divided by the size of the table.

$$ Score = \frac{\sum_{i=1}^{n_{row}}\sum_{j=1}^{n_{col}} T_{i,j}}{n_{row} * n_{col}} = \frac{1+1+2+4+8+16+32}{78} \approx 0.82 $$

In this example, we obtain a similarity score of 64.

To connect the peers two by two, StringPairFinder calculates the similarity score of all (list1, list2) combinations and returns the association between each character string in list 1 with the character string in list 2 with the highest similarity score.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stringpairfinder-2.0.1.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stringpairfinder-2.0.1-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file stringpairfinder-2.0.1.tar.gz.

File metadata

  • Download URL: stringpairfinder-2.0.1.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for stringpairfinder-2.0.1.tar.gz
Algorithm Hash digest
SHA256 7ef4b38dd427cbfb50ff9ef7404492c6fdde2599076b771016f399de76d1244e
MD5 0a1f0e277394788c29a56af40bafcfe9
BLAKE2b-256 70d29733f199ac1d4fc59c0918b591991b46e831a177b9c75621955d6df1c977

See more details on using hashes here.

File details

Details for the file stringpairfinder-2.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for stringpairfinder-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6e33b00b366779e26cf68242530bb30659b3f5ec6d4f0f0f72ddf9a15b8e441b
MD5 3d808fa1620a6a6b0ca405bf6d6d4c4f
BLAKE2b-256 2bf16d75e3952d4d840fb10ce208bcb628856ccaa267e9bbb2cf77dedbf0c755

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page