A library to match and compare strings.
Project description
stringmatch
stringmatch is a small, lightweight string matching library written in Python, based on the Levenshtein distance and the Levenshtein Python C Extension.
Inspired by libraries like seatgeek/thefuzz, which did not quite fit my needs. And so I am building this library for myself, primarily.
Table of Contents
- 🎯 Key Features
- 📋 Requirements
- ⚙️ Installation
- 🔨 Basic Usage
- 🛠️ Advanced Usage
- 🌟 Contributing
- 🔗 Links
Key Features
This library matches compares and strings to each other based mainly on, among others, the Levenshtein distance.
What makes stringmatch special compared to other libraries with similar functions:
- 💨 Lightweight, straightforward and easy to use
- ⚡ Extremely fast, up to 10x faster than comparable libraries
- 🧰 Allows for highly customisable searches
- 📚 Lots of utility functions to make your life easier
- 🌍 Handles special unicode characters, like emojis or characters from other languages, like ジャパニーズ
Requirements
- Python 3.9 or later.
Installation
Install the latest stable version with pip:
pip install stringmatch
Or install the newest version via git (Might be unstable or unfinished):
pip install -U git+https://github.com/atomflunder/stringmatch
Basic Usage
Matching
The match functions allow you to compare 2 strings and check if they are "similar enough" to each other, or get the best match(es) from a list of strings:
from stringmatch import Match
match = Match()
# Checks if the strings are similar.
match.match("stringmatch", "strngmach") # returns True
match.match("stringmatch", "something else") # returns False
# Returns the best match(es) found in the list.
searches = ["stringmat", "strinma", "strings", "mtch", "whatever", "s"]
match.get_best_match("stringmatch", searches) # returns "stringmat"
match.get_best_matches("stringmatch", searches) # returns ["stringmat", "strinma"]
Ratios
The "ratio of similarity" describes how similar the strings are to each other. It ranges from 100 being an exact match to 0 being something completely different.
You can get the ratio between strings like this:
from stringmatch import Ratio
ratio = Ratio()
# Getting the ratio between the two strings.
ratio.ratio("stringmatch", "stringmatch") # returns 100
ratio.ratio("stringmatch", "strngmach") # returns 90
ratio.ratio("stringmatch", "eh") # returns 15
# Getting the ratio between the first string and the list of strings at once.
searches = ["stringmatch", "strngmach", "eh"]
ratio.ratio_list("searchlib", searches) # returns [100, 90, 15]
Matching & Ratios
You can also get both the match and the ratio together in a tuple using these functions:
from stringmatch import Match
match = Match()
match.match_with_ratio("stringmatch", "strngmach") # returns (True, 90)
searches = ["test", "nope", "tset"]
match.get_best_match_with_ratio("test", searches) # returns ("test", 100)
match.get_best_matches_with_ratio("test", searches) # returns [("test", 100), ("tset", 75)]
Distances
Instead of the ratio, you can also get the Levenshtein distance between strings directly:
from stringmatch import Distance
distance = Distance()
distance.distance("kitten", "sitting") # returns 3
searches = ["sitting", "kitten"]
distance.distance_list("kitten", searches) # returns [3, 0]
Strings
This is primarily meant for internal usage, but you can also use this library to modify strings:
from stringmatch import Strings
strings = Strings()
strings.latinise("Héllö, world!") # returns "Hello, world!"
strings.remove_punctuation("wh'at;, ever") # returns "what ever"
strings.only_letters("Héllö, world!") # returns "Hll world"
strings.ignore_case("test test!", lower=False) # returns "TEST TEST!"
Advanced Usage
Keyword Arguments
You can pass in these optional arguments for the Match()
and Ratio()
functions to customise your search further:
score
Name | Description | Type | Default | Required? |
---|---|---|---|---|
score | The score cutoff for matching. Only available for Match() functions. |
Integer | 70 | No |
# Example:
match("stringmatch", "strngmach", score=95) # returns False
match("stringmatch", "strngmach", score=70) # returns True
limit
Name | Description | Type | Default | Required? |
---|---|---|---|---|
limit | The limit of how many matches to return. If you want to return every match set this to 0 or None. Only available for the get_best_matches() funcion. |
Integer | 5 | No |
# Example:
searches = ["limit 5", "limit 4", "limit 3", "limit 2", "limit 1", "limit 0"]
get_best_matches("limit 5", searches, limit=2) # returns ["limit 5", "limit 4"]
get_best_matches("limit 5", searches, limit=1) # returns ["limit 5"]
get_best_matches("limit 5", searches, limit=None) # returns ["limit 5", "limit 4", "limit 3", "limit 2", "limit 1", "limit 0"]
latinise
Name | Description | Type | Default | Required? |
---|---|---|---|---|
latinise | Replaces special unicode characters with their latin alphabet equivalents. Examples: Ǽ -> AE , ノース -> nosu |
Boolean | False | No |
# Example:
match("séärçh", "search", latinise=True) # returns True
match("séärçh", "search", latinise=False) # returns False
ignore_case
Name | Description | Type | Default | Required? |
---|---|---|---|---|
ignore_case | If you want to ignore case sensitivity while searching. | Boolean | False | No |
# Example:
match("test", "TEST", ignore_case=True) # returns True
match("test", "TEST", ignore_case=False) # returns False
remove_punctuation
Name | Description | Type | Default | Required? |
---|---|---|---|---|
remove_punctuation | Removes commonly used punctuation symbols from the strings, like .,;:!? and so on. |
Boolean | False | No |
# Example:
match("test,---....", "test", remove_punctuation=True) # returns True
match("test,---....", "test", remove_punctuation=False) # returns False
only_letters
Name | Description | Type | Default | Required? |
---|---|---|---|---|
only_letters | Removes every character that is not in the latin alphabet, a more extreme version of remove_punctuation . |
Boolean | False | No |
# Example:
match("»»ᅳtestᅳ►", "test", only_letters=True) # returns True
match("»»ᅳtestᅳ►", "test", only_letters=False) # returns False
Scoring Algorithms
You can pass in different scoring algorithms when initialising the Match()
and Ratio()
classes.
The available options are: LevenshteinScorer
, JaroScorer
, JaroWinklerScorer
.
Different algorithms will produce different results, obviously. By default set to LevenshteinScorer
.
from stringmatch import Match, LevenshteinScorer, JaroWinklerScorer
levenshtein_matcher = Match(scorer=LevenshteinScorer)
jaro_winkler_matcher = Match(scorer=JaroWinklerScorer)
levenshtein_matcher.match("test", "th test") # returns True (score = 73)
jaro_winkler_matcher.match("test", "th test") # returns False (score = 60)
Contributing
Contributions to this library are always appreciated! If you have any sort of feedback, or are interested in contributing, head on over to the Contributing Guidelines.
Additionally, if you like this library, leaving a star and spreading the word would be appreciated a lot!
Thanks in advance for taking the time to do so.
Links
Packages used:
Related packages:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for stringmatch-0.6.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 86e900b90e545bdb3de2a8a384fbdec4bb2d5c487e9bca664742c7c7d10054d6 |
|
MD5 | dbb6df414c5e9ccc5ed3090f58a6c66d |
|
BLAKE2b-256 | a0140306bd690b152f320da8160616b21bb78101319b94287d70e169773f2102 |