Skip to main content

Align strings and compute evaluation metrics

Project description

Stringalign

Two cute caterpillars dancing under bunting with the letters 'STRING ALIGN'

A string comparison library that adhers to the quirks of Unicode.

What is this?

Stringalign is a library for comparing strings. At its base, Stringalign takes two strings, a reference string and a predicted string, and aligns them. Based on this alignment, we can then compute many interesting performance metrics, such as the edit distance (Levenshtein), error rates and much more.

For more information, see Stringalign's extensive documentation on http://stringalign.com/.

A little example

For example, if the reference string is Banana pancakes and the predicted string is bananana pancake, then string align will align it

 Reference: B--anana pancakes
Prediction: bananana pancake-

This alignment is stored as a collection of replacements, insertions, deletions and keeps, that describe what we need to do with the predicted string to make it equal to the reference string. For the string above, we get

[Replaced('B', 'b'), Inserted('a'), Inserted('n'), Kept('a'), Kept('n'), Kept('a'), Kept('n'), Kept('a'), Kept(' '), Kept('p'), Kept('a'), Kept('n'), Kept('c'), Kept('a'), Kept('k'), Kept('e'), Deleted('s')]

or, if we join consequtive the Deleted, Inserted and Replaced:

[Replaced('B', 'ban'), Kept('anana pancake'), Deleted('s')]

Based on these alignments, we can compute standard string comparison metrics such as the Levenshtein distance and character error rate. However, Stringalign also contains functions to do more in-depth analysis of the types of errors that occur when you have a whole collection of reference and predicted strings. Examples of this is: the most common character confusions, the letters most often omitted in the prediction, the letters most often incorrectly included in the prediction, etc. See our gallery of examples for more information.

What's the point?

Stringalign might sound similar to other Python libraries, like Jiwer and Levenshtein (which both use Rapidfuzz behind-the-scenes). However, what puts Stringalign apart is that it handles Unicode "correctly" and provides easy-to-use tools for going in-depth in the analysis.

Take this example:

import Levenshtein

print(Levenshtein.distance('ñ', 'ñ'))
2

What happened here? The first 'ñ' consists of two code points: an n and a "put a tilde on the previous character" code point, while the second 'ñ' only consists of the single code point 'ñ'. Let's try it with Stringalign instead:

from stringalign.align import levenshtein_distance

print(levenshtein_distance('ñ', 'ñ'))
0

We see the expected behaviour. By default, Stringalign will normalize your text and segment it into Unicode extended grapheme clusters before aligning. An extended grapheme cluster is essentially just what a computer should display as one letter, and while a grapheme cluster usually is just one code-point, it's not always that. Since tools like Jiwer and Levenshtein work directly on the code-points, they will miss these edge cases.

Emojis

Often, we don't need to worry about separating between code-points and grapheme clusters. However, the moment emojis come into the picture, this changes. Many emojis are just the single code point. However, some are created by combining two or more other emojis --- like '🏳️‍🌈', which is created by combining the white-flag emoji, '🏳', a variant selector '\uFE0F', a zero width joiner '\u200D' and the rainbow emoji, '🌈'. Because Stringalign takes care of aligning normalized grapheme clusters automatically, it will also work correctly with emoiis

import Levenshtein
from stringalign.align import levenshtein_distance

print("Levenshtein", Levenshtein.distance('🏳️‍🌈', '🌈'))
print("Stringalign", levenshtein_distance('🏳️‍🌈', '🌈'))
Levenshtein 3
stringalign 1

How does it work?

Stringalign works in a two-step process: first, the input strings are tokenized into normalised extended grapheme clusters, before they are aligned using the Needleman-Wunsch algorithm. You can customise this if you want, e.g. switching out the tokenizer with one that casefolds all extended grapheme clusters, to get a case-insensitive alignment, or words to e.g. compute the word-error rate.

We use an extension module written in Rust for two important parts of Stringalign: grouping unicode code-points into extended grapheme clusters (with the unicode_segmentation crate) and assembling the Needleman-Wunsch cost-matrix (which has O(n²) time- and memory-complexity).

Citing Stringalign

If you use Stringalign for your research, then please cite this repo. For example:

Moe, Y. M., & Roald, M. (2024). Stringalign [Computer software]. https://github.com/yngvem/stringalign

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

stringalign-0.1.3-cp311-abi3-win_amd64.whl (378.2 kB view details)

Uploaded CPython 3.11+Windows x86-64

stringalign-0.1.3-cp311-abi3-win32.whl (371.6 kB view details)

Uploaded CPython 3.11+Windows x86

stringalign-0.1.3-cp311-abi3-musllinux_1_2_x86_64.whl (731.6 kB view details)

Uploaded CPython 3.11+musllinux: musl 1.2+ x86-64

stringalign-0.1.3-cp311-abi3-musllinux_1_2_i686.whl (766.6 kB view details)

Uploaded CPython 3.11+musllinux: musl 1.2+ i686

stringalign-0.1.3-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (532.6 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

stringalign-0.1.3-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (542.5 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ s390x

stringalign-0.1.3-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (656.9 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ppc64le

stringalign-0.1.3-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (531.2 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ARMv7l

stringalign-0.1.3-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (520.4 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ARM64

stringalign-0.1.3-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl (548.3 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.5+ i686

stringalign-0.1.3-cp311-abi3-macosx_10_12_x86_64.whl (494.0 kB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

stringalign-0.1.3-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (775.1 kB view details)

Uploaded CPython 3.11+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file stringalign-0.1.3-cp311-abi3-win_amd64.whl.

File metadata

  • Download URL: stringalign-0.1.3-cp311-abi3-win_amd64.whl
  • Upload date:
  • Size: 378.2 kB
  • Tags: CPython 3.11+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stringalign-0.1.3-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 9f8cf938f78c8404bbc9152d041de046b78f5c5c4a5c680a4a6e4e684665c939
MD5 3d21752c0bab040f5729a5faf1353154
BLAKE2b-256 8714af65a4a95922ea988129dbcd8f79cf95cb29a5142bccfbec873b7d0042e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.3-cp311-abi3-win_amd64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.3-cp311-abi3-win32.whl.

File metadata

  • Download URL: stringalign-0.1.3-cp311-abi3-win32.whl
  • Upload date:
  • Size: 371.6 kB
  • Tags: CPython 3.11+, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stringalign-0.1.3-cp311-abi3-win32.whl
Algorithm Hash digest
SHA256 6802fdf398c4d3fcbd6ca9759a26cb86665e69df9be33af94d385a39434c25eb
MD5 7b3fce7a39732a11322c7e0666825987
BLAKE2b-256 e9671767f34107ab736cb5c77086fbb3874809c5b36677c9690c032e922a8f5f

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.3-cp311-abi3-win32.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.3-cp311-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.1.3-cp311-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 b36cccd36f96efc8a981f68521dad27a613abbe68e941568b4c80f54a6571957
MD5 6dce1ad03b22c4c14ffd2a984deebeae
BLAKE2b-256 3b521b48177163e20a903239aee3ee1dc86556aca7c7b43f7b64e1209250917e

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.3-cp311-abi3-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.3-cp311-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.1.3-cp311-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 4bc0c5d4407bdbd31de6913207bad117be092a19a399d1ac348580215367b329
MD5 d528a599f223df8cf0e252c299caeaf5
BLAKE2b-256 bb6a16fe8ff6bb0780a600fead963e493b9c2f4bc1ecee38eefbae41d04de139

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.3-cp311-abi3-musllinux_1_2_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.3-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.1.3-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 340e31c3a32fbc95406fa72eb0eb0c901caaecf0a0de596592a41c13bbf671e9
MD5 5c02dada51f5949bb00602398f655f04
BLAKE2b-256 96e37051204f7881cf75c0fe25195983235bbe37e2955f6ace92c7dff5470ffe

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.3-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.3-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for stringalign-0.1.3-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 1c30804c4feed1902178beca49235325bbcfd8456e21ec925a03d215cb356f49
MD5 0cf12a2b66628bc772028e89396811e2
BLAKE2b-256 0b8b4f4fc03cf092a410402fd1b5c1cd2c78f44eb8037567a14e618548c0c2a4

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.3-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.3-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for stringalign-0.1.3-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 a6b7a2149bae2f79531b311abfab7364011555e69ded922a3c454d8a8accb88f
MD5 887fbd0f2edbe0619258766bc5b3111c
BLAKE2b-256 52cbd707232969c6f0909a8f974dc6a27114cde2ed26712814a48103d35733ad

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.3-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.3-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for stringalign-0.1.3-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 31d1e2ba88d555d8dcf6eae2611732feea8ad8d256a751dfa23f8f33f39e91a1
MD5 737bcc03910b30aee1f67b0a5d9fc906
BLAKE2b-256 42ead6b62ed8e920c06d789cd2203255408b051d3dc53d656e2dad921fa3febf

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.3-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.3-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for stringalign-0.1.3-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 38ed56c68922bbb979cb387cd5430882ab61ef337279b6754f8bce27bd3244cd
MD5 c0f174334cf96dad290af8f872ddac4d
BLAKE2b-256 241acd6c20f013cebf8a8a19b998a90d484afc8fdc095e7ccca8e09ed4a3fbe2

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.3-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.3-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.1.3-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 e8515064f4a83e5017f5d5ed94e124c06457c4dc7db9ca29d56af0a91f52ac2f
MD5 c76e7f2cad3cddc4d2d01f63d3066e34
BLAKE2b-256 efcdef9636b04e7a77265f915ffdb5cf22881c9a574b25e8850f5e81b6efd49e

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.3-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.3-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.1.3-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f2c32a8fecd10324ab7dae9baba93ee0b259e8506b3c72f3c0e9d19197608306
MD5 279adac34883b122affcd1051ebca827
BLAKE2b-256 17ec6dcb8f0dfab9450f22c5ac70a4e3bc17489a0cf25e7d360ccc4038a7475f

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.3-cp311-abi3-macosx_10_12_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.3-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for stringalign-0.1.3-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 f500be1ef15f73ce5415deca9b5c07325216df1bfcd48bf7c6bac66069c61c33
MD5 489e126a57b8a89c0ed523f89e5b33b9
BLAKE2b-256 40ea36a4d4eac80cb7f6a039eef5c35665ee483f4951ba7160a6221c192676fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.3-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page