Humanized String Distance calculator
Project description
Humanized String Distance Algorithm
This project is created and maintained by Inventives, Inc., and is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
About
The Humanized String Distance (HSD) algorithm is based on a modified dynamic-time-warping solution to compare two strings. The HSD algorithm accounts for closeness of characters based on handwritten and/or extracted (OCR) text. For example, the i character looks similar to j and handwriting recognition systems may easily mistake them for each other based on the writing style. Handwritten or extracted characters like B and 8 are easily confused, similar to S and 5, . and , and many more. The HSD algorithm is a lot more tolerant of these and improves the performance of string distance calculation to match extracted text to a known set of values.
The HSD algorithm takes in the extracted text, and expected/desired text as arguments, and provides a modified string distance score.
The expected/desired string may include lower case alphabets, numbers, and various special characters including:
- Space ( )
- Period (.)
- Comma (,)
- Hyphen (-)
Installation
Install from the pip
package manager.
pip install pyhsd
Or, install from source.
pip install setuptools pybind11 wheel
pip install -e .
Usage
import pyhsd
Calculate HSD distance between two strings
d = pyhsd.distance('he110', 'hello')
Find closest match from a list of options
numMatches = 1
matches = pyhsd.match('he110', [ 'hello', 'world' ], numMatches)
Each match is an instance of the Match
class which contains properties value
representing the string it matched, and distance
with the HSD distance for the match.
Custom transitions file
To match with custom transitions, you may pass a CSV file whch maps possible extracted characters (rows) to desired characters (columns). The corresponded cell for each row-column represents a score on the scale 0 to 1 representing how similar the characters are. For instance, q and v are rarely confused, so they have a low score (0), but b and h may be confused easily, giving them a higher score (0.3). If the row and column characters are the same, then the cell value will be 1 representing an exact match.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyhsd-1.1.1-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82a18de213ac19d9aab01cd1024eb458d2a807e5a06f07e6ea047e8caadbbe70 |
|
MD5 | 80aa9f5b352562422212727400d4d56a |
|
BLAKE2b-256 | 5ac0acaf0933a219eb17c9b778324a15a13e294523b04dd336d71d03770e3308 |