Skip to main content

Align strings and compute evaluation metrics

Project description

Stringalign (experimental)

Two cute caterpillars dancing under bunting with the letters 'STRING ALIGN'

A string comparison library that adhers to the quirks of Unicode.

What is this?

Stringalign is a library for comparing strings. At its base, Stringalign takes two strings, a reference string and a predicted string, and aligns them. Based on this alignment, we can then compute many interesting performance metrics, such as the edit distance (Levenshtein), error rates and much more.

A little example

For example, if the reference string is Banana pancakes and the predicted string is bananana pancake, then string align will align it

B--anana pancakes
bananana pancake-

This alignment is stored as a collection of replacement-, insertion-, deletion- and keep-blocks, that describe what we need to do with the predicted string to make it equal to the reference string. For the string above, we get

[Replace('b', 'B'), Delete('a'), Delete('n'), Keep('a'), Keep('n'), Keep('a'), Keep('n'), Keep('a'), Keep(' '), Keep('p'), Keep('a'), Keep('n'), Keep('c'), Keep('a'), Keep('k'), Keep('e'), Insert('s')]

or, if we join consequtive the Delete, Insert and Replace:

[Replace('ban', 'B'), Keep('anana pancake'), Insert('s')]

Based on these alignments, we can compute standard string comparison metrics such as the Levenshtein distance and character error rate. However, Stringalign also contains functions to do more in-depth analysis of the types of errors that occur when you have a whole collection of reference and predicted strings. Examples of this is: the most common character confusions, the letters most often omitted in the prediction, the letters most often incorrectly included in the prediction, etc.

What's the point?

Stringalign might sound similar to other Python libraries, like Jiwer and Levenshtein (which both use Rapidfuzz behind-the-scenes). However, what puts Stringalign apart is that it handles Unicode "correctly".

Take this example:

import Levenshtein

print(Levenshtein.distance('ñ', 'ñ'))
2

What happened here? The first 'ñ' consists of two code points: an n and a "put a tilde on the previous character" code point, while the second 'ñ' only consists of the single code point 'ñ'. Let's try it with Stringalign instead:

from stringalign.align import levenshtein_distance

print(levenshtein_distance('ñ', 'ñ'))
0

We see the expected behaviour. By default, Stringalign will normalize your text and segment it into Unicode extended grapheme clusters before aligning. An extended grapheme cluster is essentially just what a computer should display as one letter, and while a grapheme cluster usually is just one code-point, it's not always that. Since tools like Jiwer and Levenshtein work directly on the code-points, they will miss these edge cases.

Emojis

Often, we don't need to worry about separating between code-points and grapheme clusters. However, the moment emojis come into the picture, this changes. Many emojis are just the single code point. However, some are created by combining two or more other emojis --- like '🏳️‍🌈', which is created by combining the white-flag emoji, '🏳', a variant selector '\uFE0F', a zero width joiner '\u200D' and the rainbow emoji, '🌈'. Because Stringalign takes care of aligning normalized grapheme clusters automatically, it will also work correctly with emoiis

import Levenshtein
from stringalign.align import levenshtein_distance

print("Levenshtein", Levenshtein.distance('🏳️‍🌈', '🌈'))
print("Stringalign", levenshtein_distance('🏳️‍🌈', '🌈'))
Levenshtein 3
stringalign 1

How does it work?

Stringalign works in a two-step process: first, the input strings are tokenised into normalised extended grapheme clusters, before they are aligned using the Needleman-Wunsch algorithm. You can customise this if you want, e.g. switching out the tokeniser with one that casefolds all extended grapheme clusters, to get a case-insensitive alignment, or words to e.g. compute the word-error rate.

We use an extension module written in Rust for two important parts of Stringalign: grouping unicode code-points into extended grapheme clusters (with the unicode_segmentation crate) and assembling the Needleman-Wunsch cost-matrix (which has O(n²) time- and memory-complexity).

Installing Stringalign

Since Stringalign is still experimental, we don't yet provide wheels so you need to compile it from source. To do this, you first need to install Rustup, which will give you the neccessary Rust tools. Then, you can install Stringalign directly form Git: pip install git+https://github.com/yngvem/stringalign. Alternatively, if you want to use it in a PEP621-formatted pyproject.toml file: stringalign@git+https://github.com/yngvem/stringalign.

If you want to install a specific commit of stringalign, then you can run pip install https://github.com/yngvem/stringalign/archive/{commit-hash}.zip, or, in a pyproject.toml file: stringalign@https://github.com/yngvem/stringalign/archive/39d8eab113b5eca272c533b5384da3f4dbe29424.zip

Citing Stringalign

If you use Stringalign for your research, then please cite this repo. For example:

Moe, Y. M., & Roald, M. (2014). Stringalign [Computer software]. https://github.com/yngvem/stringalign

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stringalign-0.0.2.tar.gz (89.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

stringalign-0.0.2-cp313-cp313-musllinux_1_2_x86_64.whl (374.3 kB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

stringalign-0.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (312.5 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

stringalign-0.0.2-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (327.7 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

stringalign-0.0.2-cp313-cp313-macosx_11_0_arm64.whl (272.6 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

stringalign-0.0.2-cp312-cp312-musllinux_1_2_x86_64.whl (374.4 kB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

stringalign-0.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (312.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

stringalign-0.0.2-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (327.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

stringalign-0.0.2-cp312-cp312-macosx_11_0_arm64.whl (273.3 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

stringalign-0.0.2-cp311-cp311-musllinux_1_2_x86_64.whl (375.1 kB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

stringalign-0.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (313.7 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

stringalign-0.0.2-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (328.4 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

stringalign-0.0.2-cp311-cp311-macosx_11_0_arm64.whl (273.6 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file stringalign-0.0.2.tar.gz.

File metadata

  • Download URL: stringalign-0.0.2.tar.gz
  • Upload date:
  • Size: 89.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for stringalign-0.0.2.tar.gz
Algorithm Hash digest
SHA256 f1d2c606550be3aff2453d750727a75a9dad134f6597f86feb1d034015396679
MD5 4edaeb92d21e477845df5cce473ef46f
BLAKE2b-256 7fba2c257b70f3fad851c0b4adc2ae2d59ca38e86ecd37eaea65f3fac60beff3

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2.tar.gz:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.2-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.2-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 696fceb1356a52ca172070dad40170acc131497471aa4833caa6ff32e381b8a9
MD5 fe0e36da061f48f54948b9a9c23ae8d9
BLAKE2b-256 a6c65083827c4e9d3960a6b0fbc0b9f4254c36122632081f48862333d8d24804

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2-cp313-cp313-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 08765c885ff863e55c0226e2236ad816560aca7e9d05348906e36bb9da3be011
MD5 dce54b07967bf0f0f53c7871c24f2971
BLAKE2b-256 2abcbde859910a7901c4043c70ff7d3ff46b930040901f50b73794a5558e267f

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.2-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.0.2-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 f8b8eac1835e76d78ad93f74736b11d408fbb96403bb6073f516301865e8c22d
MD5 a15130f3ab7f00f72b0207d1ed4d2cf5
BLAKE2b-256 edbca818c461bab36dafda9c5a5c5fb71020bf8bbe4659121421dab69cd8d7bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 07fa908d5235e1250bdcacb97edda8fffe239a32e5c32b8093065bf3aab9ade3
MD5 0897f95de73018d3911f42739282e59d
BLAKE2b-256 c0d447df1bb2b3f7a5d37541e245bef29de8b590ac57084d334532fa1f1c6b5e

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.2-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.2-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 68e243f8be5ae680e63e57d53c2ad22adbb58fc8a685529fd7a42a1cb844c60d
MD5 d65ce839b437f577a53a5cf82ae35e20
BLAKE2b-256 b0bacf54b4d46b29ca22ddeb566a4884c6134a710758fee1a7df815c492316fa

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2-cp312-cp312-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c441e5675f48ecb7912305bbf7d294cfe0a5003fdfa9f75fe6bbbd4820a6fee8
MD5 e153cbb3705a25e2e27bab389faf437d
BLAKE2b-256 194e433c49eccac2844f14b3162731098e12fbc84e61e315e6cd6754385c7d68

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.2-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.0.2-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 ed3c54f2daf5b4e34926398d9b6318d10b46c19011474b8320f70991f871e79c
MD5 40cb47ef3ee53864161e36943eed68cb
BLAKE2b-256 7d79b9def2211417ead79e92c2e6d33bf79e9894498a0f23daa5ccbaa9d70470

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 45d3ad4c585258128b7fe098b053285a656359a88aff5c0271aa78a52d7b4d18
MD5 24fa8bae2d69eeedc678b60628b3d91f
BLAKE2b-256 31f8a58c0d9c3e72d1e83911bac3433371da4af317303e41a31fb306dc60b351

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.2-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.2-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 2d046c6dc5188fa804e355016f53ac823b6ef3ead77678706244b9236f5d2c1b
MD5 41b65c04d2b557c50addbe9664765ad9
BLAKE2b-256 47869afc93916cd1f101ab4b5da47e170f10f4fe70a3d4816dd58d9b65ab3c71

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2-cp311-cp311-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3cf1e08072f5d5de07dfdf91e8702ac91cb9913415991033c9a872d3dce6f623
MD5 fb422f58ffa1c5732589cfba05bf3daa
BLAKE2b-256 75b2adc61fdb961dcbb57c9fe6e70fcfc5607b1132a21be91f96289514dde6c9

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.2-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.0.2-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 eeed6fb48db894364cad673da933c39e6dee7280cdc4f9a4d549dce84d782e32
MD5 f3ea18e2785996181acac2cee4525ea4
BLAKE2b-256 d5d17e442d11f461a49debb9557e422a80d01ca09dcbcd68bd7c09fdaa03c785

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 02e36d3bc9ce28d6cb1d8f452c500f997cc359bf1f306fa4334ba93115bcd0e3
MD5 1fcce88a1731244cb0e4e1e9e3a6bf60
BLAKE2b-256 67d3ea7263d167d6baf004612aaf1bf5cbd54a77d2a5cea8a60f883457e81dd1

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.2-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page