Skip to main content

Align strings and compute evaluation metrics

Project description

Stringalign (experimental)

Two cute caterpillars dancing under bunting with the letters 'STRING ALIGN'

A string comparison library that adhers to the quirks of Unicode.

What is this?

Stringalign is a library for comparing strings. At its base, Stringalign takes two strings, a reference string and a predicted string, and aligns them. Based on this alignment, we can then compute many interesting performance metrics, such as the edit distance (Levenshtein), error rates and much more.

A little example

For example, if the reference string is Banana pancakes and the predicted string is bananana pancake, then string align will align it

B--anana pancakes
bananana pancake-

This alignment is stored as a collection of replacement-, insertion-, deletion- and keep-blocks, that describe what we need to do with the predicted string to make it equal to the reference string. For the string above, we get

[Replace('b', 'B'), Delete('a'), Delete('n'), Keep('a'), Keep('n'), Keep('a'), Keep('n'), Keep('a'), Keep(' '), Keep('p'), Keep('a'), Keep('n'), Keep('c'), Keep('a'), Keep('k'), Keep('e'), Insert('s')]

or, if we join consequtive the Delete, Insert and Replace:

[Replace('ban', 'B'), Keep('anana pancake'), Insert('s')]

Based on these alignments, we can compute standard string comparison metrics such as the Levenshtein distance and character error rate. However, Stringalign also contains functions to do more in-depth analysis of the types of errors that occur when you have a whole collection of reference and predicted strings. Examples of this is: the most common character confusions, the letters most often omitted in the prediction, the letters most often incorrectly included in the prediction, etc.

What's the point?

Stringalign might sound similar to other Python libraries, like Jiwer and Levenshtein (which both use Rapidfuzz behind-the-scenes). However, what puts Stringalign apart is that it handles Unicode "correctly".

Take this example:

import Levenshtein

print(Levenshtein.distance('ñ', 'ñ'))
2

What happened here? The first 'ñ' consists of two code points: an n and a "put a tilde on the previous character" code point, while the second 'ñ' only consists of the single code point 'ñ'. Let's try it with Stringalign instead:

from stringalign.align import levenshtein_distance

print(levenshtein_distance('ñ', 'ñ'))
0

We see the expected behaviour. By default, Stringalign will normalize your text and segment it into Unicode extended grapheme clusters before aligning. An extended grapheme cluster is essentially just what a computer should display as one letter, and while a grapheme cluster usually is just one code-point, it's not always that. Since tools like Jiwer and Levenshtein work directly on the code-points, they will miss these edge cases.

Emojis

Often, we don't need to worry about separating between code-points and grapheme clusters. However, the moment emojis come into the picture, this changes. Many emojis are just the single code point. However, some are created by combining two or more other emojis --- like '🏳️‍🌈', which is created by combining the white-flag emoji, '🏳', a variant selector '\uFE0F', a zero width joiner '\u200D' and the rainbow emoji, '🌈'. Because Stringalign takes care of aligning normalized grapheme clusters automatically, it will also work correctly with emoiis

import Levenshtein
from stringalign.align import levenshtein_distance

print("Levenshtein", Levenshtein.distance('🏳️‍🌈', '🌈'))
print("Stringalign", levenshtein_distance('🏳️‍🌈', '🌈'))
Levenshtein 3
stringalign 1

How does it work?

Stringalign works in a two-step process: first, the input strings are tokenised into normalised extended grapheme clusters, before they are aligned using the Needleman-Wunsch algorithm. You can customise this if you want, e.g. switching out the tokeniser with one that casefolds all extended grapheme clusters, to get a case-insensitive alignment, or words to e.g. compute the word-error rate.

We use an extension module written in Rust for two important parts of Stringalign: grouping unicode code-points into extended grapheme clusters (with the unicode_segmentation crate) and assembling the Needleman-Wunsch cost-matrix (which has O(n²) time- and memory-complexity).

Installing Stringalign

Since Stringalign is still experimental, we don't yet provide wheels so you need to compile it from source. To do this, you first need to install Rustup, which will give you the neccessary Rust tools. Then, you can install Stringalign directly form Git: pip install git+https://github.com/yngvem/stringalign. Alternatively, if you want to use it in a PEP621-formatted pyproject.toml file: stringalign@git+https://github.com/yngvem/stringalign.

If you want to install a specific commit of stringalign, then you can run pip install https://github.com/yngvem/stringalign/archive/{commit-hash}.zip, or, in a pyproject.toml file: stringalign@https://github.com/yngvem/stringalign/archive/39d8eab113b5eca272c533b5384da3f4dbe29424.zip

Citing Stringalign

If you use Stringalign for your research, then please cite this repo. For example:

Moe, Y. M., & Roald, M. (2024). Stringalign [Computer software]. https://github.com/yngvem/stringalign

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stringalign-0.0.3.tar.gz (91.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

stringalign-0.0.3-cp313-cp313-musllinux_1_2_x86_64.whl (374.7 kB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

stringalign-0.0.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (312.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

stringalign-0.0.3-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (328.2 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

stringalign-0.0.3-cp313-cp313-macosx_11_0_arm64.whl (279.0 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

stringalign-0.0.3-cp312-cp312-musllinux_1_2_x86_64.whl (374.9 kB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

stringalign-0.0.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (313.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

stringalign-0.0.3-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (328.5 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

stringalign-0.0.3-cp312-cp312-macosx_11_0_arm64.whl (279.5 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

stringalign-0.0.3-cp311-cp311-musllinux_1_2_x86_64.whl (375.6 kB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

stringalign-0.0.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (314.2 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

stringalign-0.0.3-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (328.9 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

stringalign-0.0.3-cp311-cp311-macosx_11_0_arm64.whl (280.2 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file stringalign-0.0.3.tar.gz.

File metadata

  • Download URL: stringalign-0.0.3.tar.gz
  • Upload date:
  • Size: 91.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for stringalign-0.0.3.tar.gz
Algorithm Hash digest
SHA256 a0e21fa968b2ad2cf9cad9c526766fb61b522c064eda56ae8277ddb2f58a910f
MD5 481745a172bac66038b898992f80f3ad
BLAKE2b-256 dfa3acd6b4fc2e0b0ac4973cdabd21d2f1ede5c8cb9738fe64feb7f965f6b719

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3.tar.gz:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.3-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.3-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 da77398603ee385e1731fad01f40624ef67022a3d89f361ea2c171d976792778
MD5 5fca89b7ae4d66d0241d88464f98cee1
BLAKE2b-256 5e230c6765f533bc2df16ec2b5a3edaa156c677ca55b1b2dd5e99dfc6abb2b69

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3-cp313-cp313-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5c57ba18242bb05cb22bc704ecdfb19be297e44553687dadef61d649b0fb55fb
MD5 76c7c5f21e97988938c1b34a86aec609
BLAKE2b-256 6e5fb82acb69f3960c0f78b1256e3ef22f28c3874cc6659cfb89f9b10f699292

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.3-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.0.3-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 af9e3a508aed2dc940e3a0277faa9807c1af05ba08cd5864d82d8a0601facd50
MD5 f748408079a3cc9a5099cdf8d58d6ad7
BLAKE2b-256 8054edb36580a05436c4c8df4dc7c659a1b0df8b222406f63757975763777758

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.3-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.3-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d47e955546c26667a2f6a3512829185a3e0f6b30bac7259c43ba90fd80676bbe
MD5 f5c2fa5df65b25b34e66cfd3981e1030
BLAKE2b-256 62f61d1c32315ef9f7f96198d1e61ab95fccbee2b20eaffd4502d057d907e2ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.3-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.3-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 cd4f41aaa467b34ce053e0163accd403b5de39dd634cc2eea776cfbc4b3a1e9d
MD5 32b188d769a571addab0fd48224b5233
BLAKE2b-256 56eba33947183a3bc42d4c1406378853cc7220a46a730993a1c3c7c498e6cd85

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3-cp312-cp312-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 384e15726f6136c80b583c645fbee3eb1da029d1686d79f48e66feab8445e2f1
MD5 08d04e2743211daba38f44a1aed1fe55
BLAKE2b-256 4c271077d6c7149c96633b13355e0330239c8a5c067dcf2b11d1bd818f1ce0c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.3-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.0.3-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 de53a5dbea01196ea914474aa204f33a8e255cb75516948df7d3b8d15a653fa4
MD5 bae084107b441362de0e5c3a56273efa
BLAKE2b-256 1d1ce6264142936fd45487205b551249e0052b0507f29b4dc5908ed45de851a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f9d27dfe27b45de1a62d8a20a7372728275003cf86e4438011c87ba415b3e073
MD5 6e28342cc5dbea54c2fe84bb5a0286f9
BLAKE2b-256 84eb6ccdac051a10d71c0633c1b6eb62061f105ff078dd14a83a2abb37d8148d

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.3-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.3-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 ba4537a81d326231b76d5eaf9ec120f4ce212b024334d9ce1545d401fc5fbf16
MD5 a46d4b41bceedb439d25b9b4ee8a39c6
BLAKE2b-256 5334c27d1e0718216cd6a763edc3040b29ff052c8160cec0b4695504b1552e5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3-cp311-cp311-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cc5fd08351021e47da982a3af5a8229cf91de7e0d5038747a4b78f509794048d
MD5 46c0b995444ca3fab8e9ba6eec010044
BLAKE2b-256 474a13211d909e547ba603a9937caae669aefb3a72591fef8910b1cbf30216d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.3-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.0.3-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 5215c84704e8e37bd8335bfdee6394df1a35ca0c9b0bb8c8c90664055d71a739
MD5 ba082818df74747b7e971acf8506a747
BLAKE2b-256 2a5e8c1923ad4cf7e99ef505112de8e596f99a737b55018ac0a567682bbc5792

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4eeecbc0e67e99eaf092bf116ba72cda92a5021b7265b21b160afec0a38433f1
MD5 fb4c565c0fe6dbc8145f7bb38f4bc7d6
BLAKE2b-256 b55e8eb8f0a302a3aab1f6869a6420d163acb62a87ce465a76e497ec9cf93591

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.3-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page