Skip to main content

Align strings and compute evaluation metrics

Project description

Stringalign

Two cute caterpillars dancing under bunting with the letters 'STRING ALIGN'

A string comparison library that adhers to the quirks of Unicode.

What is this?

Stringalign is a library for comparing strings. At its base, Stringalign takes two strings, a reference string and a predicted string, and aligns them. Based on this alignment, we can then compute many interesting performance metrics, such as the edit distance (Levenshtein), error rates and much more.

For more information, see Stringalign's extensive documentation on http://stringalign.com/.

A little example

For example, if the reference string is Banana pancakes and the predicted string is bananana pancake, then string align will align it

 Reference: B--anana pancakes
Prediction: bananana pancake-

This alignment is stored as a collection of replacements, insertions, deletions and keeps, that describe what we need to do with the predicted string to make it equal to the reference string. For the string above, we get

[Replaced('B', 'b'), Inserted('a'), Inserted('n'), Kept('a'), Kept('n'), Kept('a'), Kept('n'), Kept('a'), Kept(' '), Kept('p'), Kept('a'), Kept('n'), Kept('c'), Kept('a'), Kept('k'), Kept('e'), Deleted('s')]

or, if we join consequtive the Deleted, Inserted and Replaced:

[Replaced('B', 'ban'), Kept('anana pancake'), Deleted('s')]

Based on these alignments, we can compute standard string comparison metrics such as the Levenshtein distance and character error rate. However, Stringalign also contains functions to do more in-depth analysis of the types of errors that occur when you have a whole collection of reference and predicted strings. Examples of this is: the most common character confusions, the letters most often omitted in the prediction, the letters most often incorrectly included in the prediction, etc. See our gallery of examples for more information.

What's the point?

Stringalign might sound similar to other Python libraries, like Jiwer and Levenshtein (which both use Rapidfuzz behind-the-scenes). However, what puts Stringalign apart is that it handles Unicode "correctly" and provides easy-to-use tools for going in-depth in the analysis.

Take this example:

import Levenshtein

print(Levenshtein.distance('ñ', 'ñ'))
2

What happened here? The first 'ñ' consists of two code points: an n and a "put a tilde on the previous character" code point, while the second 'ñ' only consists of the single code point 'ñ'. Let's try it with Stringalign instead:

from stringalign.align import levenshtein_distance

print(levenshtein_distance('ñ', 'ñ'))
0

We see the expected behaviour. By default, Stringalign will normalize your text and segment it into Unicode extended grapheme clusters before aligning. An extended grapheme cluster is essentially just what a computer should display as one letter, and while a grapheme cluster usually is just one code-point, it's not always that. Since tools like Jiwer and Levenshtein work directly on the code-points, they will miss these edge cases.

Emojis

Often, we don't need to worry about separating between code-points and grapheme clusters. However, the moment emojis come into the picture, this changes. Many emojis are just the single code point. However, some are created by combining two or more other emojis --- like '🏳️‍🌈', which is created by combining the white-flag emoji, '🏳', a variant selector '\uFE0F', a zero width joiner '\u200D' and the rainbow emoji, '🌈'. Because Stringalign takes care of aligning normalized grapheme clusters automatically, it will also work correctly with emoiis

import Levenshtein
from stringalign.align import levenshtein_distance

print("Levenshtein", Levenshtein.distance('🏳️‍🌈', '🌈'))
print("Stringalign", levenshtein_distance('🏳️‍🌈', '🌈'))
Levenshtein 3
stringalign 1

How does it work?

Stringalign works in a two-step process: first, the input strings are tokenized into normalised extended grapheme clusters, before they are aligned using the Needleman-Wunsch algorithm. You can customise this if you want, e.g. switching out the tokenizer with one that casefolds all extended grapheme clusters, to get a case-insensitive alignment, or words to e.g. compute the word-error rate.

We use an extension module written in Rust for two important parts of Stringalign: grouping unicode code-points into extended grapheme clusters (with the unicode_segmentation crate) and assembling the Needleman-Wunsch cost-matrix (which has O(n²) time- and memory-complexity).

Citing Stringalign

If you use Stringalign for your research, then please cite this repo. For example:

Moe, Y. M., & Roald, M. (2024). Stringalign [Computer software]. https://github.com/yngvem/stringalign

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

stringalign-0.1.2-cp311-abi3-win_amd64.whl (378.0 kB view details)

Uploaded CPython 3.11+Windows x86-64

stringalign-0.1.2-cp311-abi3-win32.whl (371.3 kB view details)

Uploaded CPython 3.11+Windows x86

stringalign-0.1.2-cp311-abi3-musllinux_1_2_x86_64.whl (731.3 kB view details)

Uploaded CPython 3.11+musllinux: musl 1.2+ x86-64

stringalign-0.1.2-cp311-abi3-musllinux_1_2_i686.whl (766.4 kB view details)

Uploaded CPython 3.11+musllinux: musl 1.2+ i686

stringalign-0.1.2-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (532.3 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

stringalign-0.1.2-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (542.3 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ s390x

stringalign-0.1.2-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (656.7 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ppc64le

stringalign-0.1.2-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (530.9 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ARMv7l

stringalign-0.1.2-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (520.1 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ARM64

stringalign-0.1.2-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl (548.0 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.5+ i686

stringalign-0.1.2-cp311-abi3-macosx_10_12_x86_64.whl (493.7 kB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

stringalign-0.1.2-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (774.7 kB view details)

Uploaded CPython 3.11+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file stringalign-0.1.2-cp311-abi3-win_amd64.whl.

File metadata

  • Download URL: stringalign-0.1.2-cp311-abi3-win_amd64.whl
  • Upload date:
  • Size: 378.0 kB
  • Tags: CPython 3.11+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stringalign-0.1.2-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 7a973fb8d703ec068a22d7ad33820f82bc02cb3c13a0a794f9bd499da9fdfdb6
MD5 8d5a4abc1f9cb0df2c18cf73c77eaa3f
BLAKE2b-256 66b01088575dd4bbd91dee6f41c9592e32cdeedba88d2a8e61c5fa9d65d6b2ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.2-cp311-abi3-win_amd64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.2-cp311-abi3-win32.whl.

File metadata

  • Download URL: stringalign-0.1.2-cp311-abi3-win32.whl
  • Upload date:
  • Size: 371.3 kB
  • Tags: CPython 3.11+, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stringalign-0.1.2-cp311-abi3-win32.whl
Algorithm Hash digest
SHA256 953df9da23d06caa168ab3669cfc821c2c3d78228758f20d3c81e117d4eb8b6b
MD5 1480195dc047b063cb0740bf3a5f5c3d
BLAKE2b-256 1e834cadc9954cbe01271af38906a4de2a22f631facc5cbbd1b87bbc095bc31e

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.2-cp311-abi3-win32.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.2-cp311-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.1.2-cp311-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 54e1a77ddacd59bb0fd17eb6cbd2f702f3d37374fb6a3295082aa104c33b2dcd
MD5 6a1595baa8e13de4fefa114f1944502b
BLAKE2b-256 dbe336b82f4633f86a9c13c091b46bc0675fb5f06a5b5291cbbe11f6f0c5f4ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.2-cp311-abi3-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.2-cp311-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.1.2-cp311-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 fad12be0818a0e5bc99761183a1e761a62912964f9e9b64f739a183233d891d0
MD5 b6b4ec9efd03497077d5271f2c815e07
BLAKE2b-256 d9bfeda95aec1fb8feb2184df7767fd2d10dc327d8c82abab677a26b03cf84ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.2-cp311-abi3-musllinux_1_2_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.2-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.1.2-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4c8e9de4d88f20069abc9c9b0d9811665083179c8b03c8d1e4176c8b42b65c59
MD5 96a4da2af113f415edcbd74e87133073
BLAKE2b-256 d6b3e8aab3561822f713fedc9f3d5cc7fb38a415cf531ede5280afccbb889b1d

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.2-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.2-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for stringalign-0.1.2-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 1cac941468ba65ec3c7bc4d7f7cc798c70223fe9070d4cab21ee82bb40a5299e
MD5 14c24fcc039430b1812544c36371aebe
BLAKE2b-256 48e5a312333b7d06fd833699ea7e9bb410c04009a8d6ecdd1122ff45c57a9791

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.2-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.2-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for stringalign-0.1.2-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 ffc65b21252e47eab21fec8f0fa55115e4dda3d0b61486a9608aafe6324d056a
MD5 bc22d7e07ad57980dca7c215e541c526
BLAKE2b-256 06e147464a38ba88425e50dd621ffe48bcc4b9b26b2cf90c0af9863d34efa4ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.2-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.2-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for stringalign-0.1.2-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 61bc2d3365b203ee173f4f4e964b29db863f37a29ac16f6a06a6deaa58d0c622
MD5 c0ce577b9761e287a1bb8add0ed61392
BLAKE2b-256 274c819ca47a8487526750f0b6dfbccae45be74ce36ac6d768d68eea8af1439b

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.2-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.2-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for stringalign-0.1.2-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 29b1636e9537f814785aac416c9cc458797e7645324820e1acd70bcdb9ef7478
MD5 ba47d5d9df089cd9a45dfe3a44f41d45
BLAKE2b-256 06ca3d48b11afc62aab4efd082db851c6f8da51f2594b3830ff5b92847e4e61b

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.2-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.2-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.1.2-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 c7f6b0c4a420e69b3ec4a175fb2461c2981d4363f98f58cc74040a2eeaaf6a48
MD5 095cd63c69f5378e0c05fd388d72249f
BLAKE2b-256 28a4f6d4b8ea4ac7102e86c8dd10b97c101675b1ba723c30369121fcb1375a87

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.2-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.2-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.1.2-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 57ebc2a05c6d1a4786592f984b8dcb202d0bcdd92aefd09f3303d58aaa616c25
MD5 6bc557ce53f18a1e8b981f83a9a85846
BLAKE2b-256 50218f749343a01000a97c8e63493b2831f99a638ef6296619e0a736f0b89adb

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.2-cp311-abi3-macosx_10_12_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.1.2-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for stringalign-0.1.2-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 f39765f149e6439308f2c3d9d56664666a08b89cc6e54f2597504470bc8f91b5
MD5 1db2bfb591de83deb392249dc21c1259
BLAKE2b-256 4bd69c84521d933e859719aaec18d4997fe2f90410f722891fe2e60415d42291

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.1.2-cp311-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page