Skip to main content

Align strings and compute evaluation metrics

Project description

Stringalign

Two cute caterpillars dancing under bunting with the letters 'STRING ALIGN'

A string comparison library that adhers to the quirks of Unicode.

What is this?

Stringalign is a library for comparing strings. At its base, Stringalign takes two strings, a reference string and a predicted string, and aligns them. Based on this alignment, we can then compute many interesting performance metrics, such as the edit distance (Levenshtein), error rates and much more.

For more information, see Stringalign's extensive documentation on http://stringalign.com/.

A little example

For example, if the reference string is Banana pancakes and the predicted string is bananana pancake, then string align will align it

 Reference: B--anana pancakes
Prediction: bananana pancake-

This alignment is stored as a collection of replacements, insertions, deletions and keeps, that describe what we need to do with the predicted string to make it equal to the reference string. For the string above, we get

[Replaced('B', 'b'), Inserted('a'), Inserted('n'), Kept('a'), Kept('n'), Kept('a'), Kept('n'), Kept('a'), Kept(' '), Kept('p'), Kept('a'), Kept('n'), Kept('c'), Kept('a'), Kept('k'), Kept('e'), Deleted('s')]

or, if we join consequtive the Deleted, Inserted and Replaced:

[Replaced('B', 'ban'), Kept('anana pancake'), Deleted('s')]

Based on these alignments, we can compute standard string comparison metrics such as the Levenshtein distance and character error rate. However, Stringalign also contains functions to do more in-depth analysis of the types of errors that occur when you have a whole collection of reference and predicted strings. Examples of this is: the most common character confusions, the letters most often omitted in the prediction, the letters most often incorrectly included in the prediction, etc. See our gallery of examples for more information.

What's the point?

Stringalign might sound similar to other Python libraries, like Jiwer and Levenshtein (which both use Rapidfuzz behind-the-scenes). However, what puts Stringalign apart is that it handles Unicode "correctly" and provides easy-to-use tools for going in-depth in the analysis.

Take this example:

import Levenshtein

print(Levenshtein.distance('ñ', 'ñ'))
2

What happened here? The first 'ñ' consists of two code points: an n and a "put a tilde on the previous character" code point, while the second 'ñ' only consists of the single code point 'ñ'. Let's try it with Stringalign instead:

from stringalign.align import levenshtein_distance

print(levenshtein_distance('ñ', 'ñ'))
0

We see the expected behaviour. By default, Stringalign will normalize your text and segment it into Unicode extended grapheme clusters before aligning. An extended grapheme cluster is essentially just what a computer should display as one letter, and while a grapheme cluster usually is just one code-point, it's not always that. Since tools like Jiwer and Levenshtein work directly on the code-points, they will miss these edge cases.

Emojis

Often, we don't need to worry about separating between code-points and grapheme clusters. However, the moment emojis come into the picture, this changes. Many emojis are just the single code point. However, some are created by combining two or more other emojis --- like '🏳️‍🌈', which is created by combining the white-flag emoji, '🏳', a variant selector '\uFE0F', a zero width joiner '\u200D' and the rainbow emoji, '🌈'. Because Stringalign takes care of aligning normalized grapheme clusters automatically, it will also work correctly with emoiis

import Levenshtein
from stringalign.align import levenshtein_distance

print("Levenshtein", Levenshtein.distance('🏳️‍🌈', '🌈'))
print("Stringalign", levenshtein_distance('🏳️‍🌈', '🌈'))
Levenshtein 3
stringalign 1

How does it work?

Stringalign works in a two-step process: first, the input strings are tokenized into normalised extended grapheme clusters, before they are aligned using the Needleman-Wunsch algorithm. You can customise this if you want, e.g. switching out the tokenizer with one that casefolds all extended grapheme clusters, to get a case-insensitive alignment, or words to e.g. compute the word-error rate.

We use an extension module written in Rust for two important parts of Stringalign: grouping unicode code-points into extended grapheme clusters (with the unicode_segmentation crate) and assembling the Needleman-Wunsch cost-matrix (which has O(n²) time- and memory-complexity).

Citing Stringalign

If you use Stringalign for your research, then please cite this repo. For example:

Moe, Y. M., & Roald, M. (2024). Stringalign [Computer software]. https://github.com/yngvem/stringalign

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

stringalign-0.1.4-cp311-abi3-win_amd64.whl (378.1 kB view details)

Uploaded CPython 3.11+Windows x86-64

stringalign-0.1.4-cp311-abi3-win32.whl (371.3 kB view details)

Uploaded CPython 3.11+Windows x86

stringalign-0.1.4-cp311-abi3-musllinux_1_2_x86_64.whl (731.6 kB view details)

Uploaded CPython 3.11+musllinux: musl 1.2+ x86-64

stringalign-0.1.4-cp311-abi3-musllinux_1_2_i686.whl (766.2 kB view details)

Uploaded CPython 3.11+musllinux: musl 1.2+ i686

stringalign-0.1.4-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (532.7 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

stringalign-0.1.4-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (542.4 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ s390x

stringalign-0.1.4-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (656.6 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ppc64le

stringalign-0.1.4-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (530.9 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ARMv7l

stringalign-0.1.4-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (520.4 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ARM64

stringalign-0.1.4-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl (547.8 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.5+ i686

stringalign-0.1.4-cp311-abi3-macosx_10_12_x86_64.whl (493.8 kB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

stringalign-0.1.4-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (774.2 kB view details)

Uploaded CPython 3.11+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file stringalign-0.1.4-cp311-abi3-win_amd64.whl.

File metadata

  • Download URL: stringalign-0.1.4-cp311-abi3-win_amd64.whl
  • Upload date:
  • Size: 378.1 kB
  • Tags: CPython 3.11+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stringalign-0.1.4-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5c4658ff5699b283856a0020b4ea51cbbe8b9af1c0ab003c9395e887ceeeb9bc
MD5 4788be2decb8dfd89491d313c013bbbc
BLAKE2b-256 a9015df35a71d6829c7157f149e5b1e7843971c610f54473adf47ac4a0b3fd58

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.4-cp311-abi3-win_amd64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.4-cp311-abi3-win32.whl.

File metadata

  • Download URL: stringalign-0.1.4-cp311-abi3-win32.whl
  • Upload date:
  • Size: 371.3 kB
  • Tags: CPython 3.11+, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stringalign-0.1.4-cp311-abi3-win32.whl
Algorithm Hash digest
SHA256 02b529cc898f7319bfe413404b0d8536dcb7be2d8ada18be8dbd91edf3f3410f
MD5 e63b1178edd5bd2ea00d990afe800b7c
BLAKE2b-256 ecc213e0ade8b15690c8ddd6040a4a386a89a81388a98985d883e028d29d7997

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.4-cp311-abi3-win32.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.4-cp311-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.1.4-cp311-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 3caa941cdea32a5c76412c2cead4ec74cfa0eec37ce6251d3395a7fc3301e654
MD5 04896578d9f9b2e86830d96bc184c6d3
BLAKE2b-256 eebcf7b22cb704815ea53752d29f6f51682378e494389dae7bb96aab0fc478d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.4-cp311-abi3-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.4-cp311-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.1.4-cp311-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 84430d9750c2c5a2ea1f1f36401c07d2677fc60201cee9de4ee3e1c61d7b8e53
MD5 18852e9cd6d528bcf5241145df32dae9
BLAKE2b-256 e42a3f05b5da1cab7f69a15f4a78159fffc823cccbaa2fa3c25a69e9c3007609

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.4-cp311-abi3-musllinux_1_2_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.4-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.1.4-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a9262ada9d5432e82e142bf36c6a7e80d75b8d2993c2ed54f7b9542ddb5fd6cb
MD5 1f32dcd1ae8f0a5e64bc1cc2d145fea7
BLAKE2b-256 03a34457c5c9f1e4430d136d2a8a11801af14c4fe95ceafa0f244f53afdfc094

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.4-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.4-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for stringalign-0.1.4-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 e3282980c83104a9d5a8ae65036914f95ae7b3efa2f5dd76de9e0d9e08353e87
MD5 9604ea5b2737ba9c6c44520f1181f27b
BLAKE2b-256 a8223558df390dc4722a9cb0f039fd19cb90098cd8238ed94cb1f789e22c1492

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.4-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.4-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for stringalign-0.1.4-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 b9d68f7f2aa0b6fc3bb77b9e9082aaa9f5350b9998ac307814193fb1e60741ef
MD5 bb0b5680802780582e5b9aed726fed4e
BLAKE2b-256 9f7fadebeb5bd2a5b14afc55da1d1ceef8e9d1ff6957d69c2523326ef4263fb1

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.4-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.4-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for stringalign-0.1.4-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 b1694d784faf53f4d5e9fbc86cc93aeaea3aa430fcb4e68be71ab533432ed251
MD5 83c816b57a357878669a60964c679e10
BLAKE2b-256 dd58e41e3e87e706651fa8f7541ac027794fa7a27a8b462f36fb5bf1771020cf

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.4-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.4-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for stringalign-0.1.4-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c9e2e5e25e54977fd7688b6df109e1a7abaa1c422a29d0f8ef0315d0844ae648
MD5 b5d9dc0e6b80199c6c17af315eaeb5ea
BLAKE2b-256 08c7e37fef30efeccee1697f561852fcdd792a780d4e4c03938085e7af1c7a15

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.4-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.4-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.1.4-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 85adf526dacee8e0e0b53cf5a923e3bed3dec1ec752d4c5cb16a995262d228a1
MD5 68f10e13669198780da2cc87fab782f5
BLAKE2b-256 1b9bb8c30e34754a7484823cd817c35519c97b04ed04210159cb48c246950372

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.4-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.4-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.1.4-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 7a923bddb583cb63093b72f3dc191879c7c9370a95c4d9b03e62ff48f813d8c6
MD5 2bc9fe7606c31c6675c3d526e0cc8a94
BLAKE2b-256 9cf5f859847e8b448737637a8ef2eedd357fc630325497cc9bf2aa5dfa7db82e

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.4-cp311-abi3-macosx_10_12_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.4-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for stringalign-0.1.4-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 7ab67aa662a897638f5e55bfc3d242f5cdc49fda66be7e2fd27614c10cbba54a
MD5 e140a6a73b7cba384ce7ace5faacec51
BLAKE2b-256 02dcf37c5a6bfb67a4fafe3138d961749fe6de29bd6eb740263b58f91e00920a

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.4-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page