stringpairfinder

Package designed to match strings by similarity

These details have not been verified by PyPI

Project links

Project description

Static Badge GitHub License PyPI - Downloads Ruff

StringPairFinder is a Python package designed to simplify the process of finding similarities between strings.

drawing

Installation

pip install stringpairfinder

Computing String Similarity

import stringpairfinder as spf

spf.get_similarity("Munich", "Munchen")

>> 0.23809523809523808

Finding the Nearest String

spf.get_nearest_string(
    string="Naples",
    string_list=["Munchen", "Napoli", "Warszawa"]
)

>> 'Napoli'

Mapping Strings to Their Nearest Counterparts

spf.match_strings(
    source_strings=["Naples", "Munich", "Warsaw"],
    target_strings=["Munchen", "Napoli", "Warszawa"]
)

>>  {
        'Naples': 'Napoli',
        'Munich': 'Munchen',
        'Warsaw': 'Warszawa'
    }

Evaluation

Evaluation on a dataset of 1000 observations (see notebooks/StringPairFinder vs FuzzyWuzzy.ipynb) :

FuzzyWuzzy algorithm : 85.2 % of success rate
StringPairFinder algorithm : 94.0 % of success rate

What is the algorithm ?

The similarity search between two strings consists of a matrix comparison of each character in those strings. Let"s assume we want to compare the strings "Munich" and "Bayern Munich".

The first step is to construct a table $T$ containing the first string in the column and the second in the row. The value of a cell is 1 if the character in the row is the same as the one in the column, and 0 otherwise.

drawing

The second step aims at highlighting the fact that several characters correspond consecutively. Thus, for each row $i$ and column $j$, if cell $T[i-1, j-1] > 0$, then $T[i, j]$ is twice the value of $T[i-1, j-1]$.

drawing

The third step is simply to calculate the similarity score, equal to the sum of all the cells in the $T$ divided by the size of the table.

$$ Score = \frac{\sum_{i=1}^{n_{row}}\sum_{j=1}^{n_{col}} T_{i,j}}{n_{row} * n_{col}} = \frac{1+1+2+4+8+16+32}{78} \approx 0.82 $$

In this example, we obtain a similarity score of 64.

To connect the peers two by two, StringPairFinder calculates the similarity score of all (list1, list2) combinations and returns the association between each character string in list 1 with the character string in list 2 with the highest similarity score.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.1

Jan 26, 2025

2.0.0

Jan 26, 2025

1.0.0

Feb 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stringpairfinder-2.0.1.tar.gz (4.8 kB view details)

Uploaded Jan 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

stringpairfinder-2.0.1-py3-none-any.whl (5.1 kB view details)

Uploaded Jan 26, 2025 Python 3

File details

Details for the file stringpairfinder-2.0.1.tar.gz.

File metadata

Download URL: stringpairfinder-2.0.1.tar.gz
Upload date: Jan 26, 2025
Size: 4.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for stringpairfinder-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`7ef4b38dd427cbfb50ff9ef7404492c6fdde2599076b771016f399de76d1244e`
MD5	`0a1f0e277394788c29a56af40bafcfe9`
BLAKE2b-256	`70d29733f199ac1d4fc59c0918b591991b46e831a177b9c75621955d6df1c977`

See more details on using hashes here.

File details

Details for the file stringpairfinder-2.0.1-py3-none-any.whl.

File metadata

Download URL: stringpairfinder-2.0.1-py3-none-any.whl
Upload date: Jan 26, 2025
Size: 5.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for stringpairfinder-2.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6e33b00b366779e26cf68242530bb30659b3f5ec6d4f0f0f72ddf9a15b8e441b`
MD5	`3d808fa1620a6a6b0ca405bf6d6d4c4f`
BLAKE2b-256	`2bf16d75e3952d4d840fb10ce208bcb628856ccaa267e9bbb2cf77dedbf0c755`

See more details on using hashes here.

stringpairfinder 2.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Installation

Computing String Similarity

Finding the Nearest String

Mapping Strings to Their Nearest Counterparts

Evaluation

What is the algorithm ?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes