Skip to main content

Humanized String Distance calculator

Project description

Humanized String Distance Algorithm

This project is created and maintained by Inventives, Inc., and is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

About

The Humanized String Distance (HSD) algorithm is based on a modified dynamic-time-warping solution to compare two strings. The HSD algorithm accounts for closeness of characters based on handwritten and/or extracted (OCR) text. For example, the i character looks similar to j and handwriting recognition systems may easily mistake them for each other based on the writing style. Handwritten or extracted characters like B and 8 are easily confused, similar to S and 5, . and , and many more. The HSD algorithm is a lot more tolerant of these and improves the performance of string distance calculation to match extracted text to a known set of values.

The HSD algorithm takes in the extracted text, and expected/desired text as arguments, and provides a modified string distance score.

The expected/desired string may include lower case alphabets, numbers, and various special characters including:

  • Space ( )
  • Period (.)
  • Comma (,)
  • Hyphen (-)

Installation

Install from the pip package manager.

pip install pyhsd

Or, install from source.

pip install setuptools pybind11 wheel
pip install -e .

Usage

import pyhsd

Calculate HSD distance between two strings

d = pyhsd.distance('he110', 'hello')

Find closest match from a list of options

numMatches = 1
matches = pyhsd.match('he110', [ 'hello', 'world' ], numMatches)

Each match is an instance of the Match class which contains properties value representing the string it matched, and distance with the HSD distance for the match.

Custom transitions file

To match with custom transitions, you may pass a CSV file whch maps possible extracted characters (rows) to desired characters (columns). The corresponded cell for each row-column represents a score on the scale 0 to 1 representing how similar the characters are. For instance, q and v are rarely confused, so they have a low score (0), but b and h may be confused easily, giving them a higher score (0.3). If the row and column characters are the same, then the cell value will be 1 representing an exact match.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhsd-1.1.1.tar.gz (11.4 kB view hashes)

Uploaded Source

Built Distribution

pyhsd-1.1.1-cp39-cp39-win_amd64.whl (125.2 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page