Skip to main content

Align strings and compute evaluation metrics

Project description

Stringalign (experimental)

Two cute caterpillars dancing under bunting with the letters 'STRING ALIGN'

A string comparison library that adhers to the quirks of Unicode.

What is this?

Stringalign is a library for comparing strings. At its base, Stringalign takes two strings, a reference string and a predicted string, and aligns them. Based on this alignment, we can then compute many interesting performance metrics, such as the edit distance (Levenshtein), error rates and much more.

A little example

For example, if the reference string is Banana pancakes and the predicted string is bananana pancake, then string align will align it

B--anana pancakes
bananana pancake-

This alignment is stored as a collection of replacement-, insertion-, deletion- and keep-blocks, that describe what we need to do with the predicted string to make it equal to the reference string. For the string above, we get

[Replace('b', 'B'), Delete('a'), Delete('n'), Keep('a'), Keep('n'), Keep('a'), Keep('n'), Keep('a'), Keep(' '), Keep('p'), Keep('a'), Keep('n'), Keep('c'), Keep('a'), Keep('k'), Keep('e'), Insert('s')]

or, if we join consequtive the Delete, Insert and Replace:

[Replace('ban', 'B'), Keep('anana pancake'), Insert('s')]

Based on these alignments, we can compute standard string comparison metrics such as the Levenshtein distance and character error rate. However, Stringalign also contains functions to do more in-depth analysis of the types of errors that occur when you have a whole collection of reference and predicted strings. Examples of this is: the most common character confusions, the letters most often omitted in the prediction, the letters most often incorrectly included in the prediction, etc.

What's the point?

Stringalign might sound similar to other Python libraries, like Jiwer and Levenshtein (which both use Rapidfuzz behind-the-scenes). However, what puts Stringalign apart is that it handles Unicode "correctly".

Take this example:

import Levenshtein

print(Levenshtein.distance('ñ', 'ñ'))
2

What happened here? The first 'ñ' consists of two code points: an n and a "put a tilde on the previous character" code point, while the second 'ñ' only consists of the single code point 'ñ'. Let's try it with Stringalign instead:

from stringalign.align import levenshtein_distance

print(levenshtein_distance('ñ', 'ñ'))
0

We see the expected behaviour. By default, Stringalign will normalize your text and segment it into Unicode extended grapheme clusters before aligning. An extended grapheme cluster is essentially just what a computer should display as one letter, and while a grapheme cluster usually is just one code-point, it's not always that. Since tools like Jiwer and Levenshtein work directly on the code-points, they will miss these edge cases.

Emojis

Often, we don't need to worry about separating between code-points and grapheme clusters. However, the moment emojis come into the picture, this changes. Many emojis are just the single code point. However, some are created by combining two or more other emojis --- like '🏳️‍🌈', which is created by combining the white-flag emoji, '🏳', a variant selector '\uFE0F', a zero width joiner '\u200D' and the rainbow emoji, '🌈'. Because Stringalign takes care of aligning normalized grapheme clusters automatically, it will also work correctly with emoiis

import Levenshtein
from stringalign.align import levenshtein_distance

print("Levenshtein", Levenshtein.distance('🏳️‍🌈', '🌈'))
print("Stringalign", levenshtein_distance('🏳️‍🌈', '🌈'))
Levenshtein 3
stringalign 1

How does it work?

Stringalign works in a two-step process: first, the input strings are tokenised into normalised extended grapheme clusters, before they are aligned using the Needleman-Wunsch algorithm. You can customise this if you want, e.g. switching out the tokeniser with one that casefolds all extended grapheme clusters, to get a case-insensitive alignment, or words to e.g. compute the word-error rate.

We use an extension module written in Rust for two important parts of Stringalign: grouping unicode code-points into extended grapheme clusters (with the unicode_segmentation crate) and assembling the Needleman-Wunsch cost-matrix (which has O(n²) time- and memory-complexity).

Installing Stringalign

Since Stringalign is still experimental, we don't yet provide wheels so you need to compile it from source. To do this, you first need to install Rustup, which will give you the neccessary Rust tools. Then, you can install Stringalign directly form Git: pip install git+https://github.com/yngvem/stringalign. Alternatively, if you want to use it in a PEP621-formatted pyproject.toml file: stringalign@git+https://github.com/yngvem/stringalign.

If you want to install a specific commit of stringalign, then you can run pip install https://github.com/yngvem/stringalign/archive/{commit-hash}.zip, or, in a pyproject.toml file: stringalign@https://github.com/yngvem/stringalign/archive/39d8eab113b5eca272c533b5384da3f4dbe29424.zip

Citing Stringalign

If you use Stringalign for your research, then please cite this repo. For example:

Moe, Y. M., & Roald, M. (2014). Stringalign [Computer software]. https://github.com/yngvem/stringalign

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stringalign-0.0.1.tar.gz (89.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

stringalign-0.0.1-cp313-cp313-musllinux_1_2_x86_64.whl (374.2 kB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

stringalign-0.0.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (312.4 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

stringalign-0.0.1-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (327.7 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

stringalign-0.0.1-cp313-cp313-macosx_11_0_arm64.whl (272.6 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

stringalign-0.0.1-cp312-cp312-musllinux_1_2_x86_64.whl (374.4 kB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

stringalign-0.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (312.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

stringalign-0.0.1-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (327.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

stringalign-0.0.1-cp312-cp312-macosx_11_0_arm64.whl (273.3 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

stringalign-0.0.1-cp311-cp311-musllinux_1_2_x86_64.whl (375.1 kB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

stringalign-0.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (313.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

stringalign-0.0.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (328.4 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

stringalign-0.0.1-cp311-cp311-macosx_11_0_arm64.whl (273.6 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file stringalign-0.0.1.tar.gz.

File metadata

  • Download URL: stringalign-0.0.1.tar.gz
  • Upload date:
  • Size: 89.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for stringalign-0.0.1.tar.gz
Algorithm Hash digest
SHA256 411d32e61adc55e1a460c990e9f32b08f3a0b755513642bcc8767bf58eb4096a
MD5 fc972d1fd694e32bb0f745e2f52eca15
BLAKE2b-256 24bf89292de0a5711ea1d11726f7be8abea7fb8e5a6f9cf6b016bbf3b24ff635

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1.tar.gz:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.1-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.1-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 37998d652cb6c4443d5dc88161bfc44401484746912d2bc81ef221abc81f930c
MD5 188b64a2dfb316b9cef0b754a6fd9ab9
BLAKE2b-256 94b35e315cdba5c4b2dc117d56dbdb31d8b66c29bff2b949e8ac91980bf672b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1-cp313-cp313-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ae5f9d9c55127d821eda831d13f48e761742d0cfe0ec4ece56a92a80f8a9cce3
MD5 9734e44bc82cbc287545676d84f71ad4
BLAKE2b-256 0d6c5c03de63078dfdd9dbdca2eaed191c30452fccc6fb56b856ef1a1b945ee9

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.1-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.0.1-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 eb68a357c2ce726949f850f65f1cbd63a5c1d64f1bc86ac9f3324976bd7a05f0
MD5 f7152b8356cfdcc939b7221230f4524e
BLAKE2b-256 9a6b6e4c96a6e9b6f1137d239fd15017d6ec594137bd1cd21305a6bbf6866b92

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 97eb71155d3d9ce7e4c4ece5ab6f50b0269a751aa15d0a53d18a1405fb8a7200
MD5 9bb29199c6dea0866c44a7963ba51a7c
BLAKE2b-256 93158b9fd18cdfd7166f624b2b56a85a93ea823d8b1cc5426083765544dc6c3d

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.1-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.1-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 40389910c6238a6399c497701ef6c319ae72eeda017eb3057482e2361bae9a33
MD5 5f9bfb4de99d2775aef64d47465107be
BLAKE2b-256 6e1f1e3fae0d47407cbaf35a6c0bbebc0393035ab0c6d82d1f2f167cd122291b

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1-cp312-cp312-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4b09d12dcf1aea17fdc025259de608e81f9c03a8697484eac5f52c3e7faa2a15
MD5 ba8cc07d106b1b3d25999fee8dc83bad
BLAKE2b-256 81af11a3c0035167a14bbd074d60a1ffee0a1fa65dac28d4afe9695349b50fad

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.1-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.0.1-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 cc48cbf3d4792741dda9c61a08300e72e01d27589cda1b4114e0d16720b9b8ef
MD5 9179dbaa62451207e7eef570301aadef
BLAKE2b-256 2282dfdf07a84770743512bf9b6e0d4b3f799390b1cf98a62f4fffd62887a612

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8ffb1f7be2fd902bb9971cf495529c39f3fe6344d692a96a1afa1864422b01bc
MD5 75507dc637586f3e81728c48dc670cf9
BLAKE2b-256 b78148cae3816ee2d74435fbf8d7999144d76980f8b041afdbc9eb37af6606a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.1-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.1-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 17db91ec67fbdc8d06552f6b6f4222f7711c82c5385de82a532224ee8fb11ae1
MD5 ddec2ce4f8c8fca20e2b39c6a328d89d
BLAKE2b-256 7c6ed86126e35fbc8685a7b48f51380ce3c1f7a89ff41b18552eb395aab7d0d7

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1-cp311-cp311-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 83ee43d7f862e4f2734a7c4a8699342c2dfd422dcfada4c347b915115ff9e96d
MD5 8bc712c726f02211ffa0a1e90dbb5811
BLAKE2b-256 ad90109ac80f491136eb8851edeee3354b172c11ea9240def50dd084d014d059

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for stringalign-0.0.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 a5f33c6b3b33bdd1cc95a8722c60c39ebd0400768d0f7545e43ef6a90908de37
MD5 5ec00418f0624c5fb8a2e9cc09a17073
BLAKE2b-256 d21bab500777a2a269fbc53f350374325823f09763caabb9a169c06f36f9ae27

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stringalign-0.0.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for stringalign-0.0.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a29380eb16f1796ef6a24ab9d26d0f01b3d4a6130b41bc92382be62cf63d3124
MD5 f3d1530c4261d24ed27c0c26fd9f0f1f
BLAKE2b-256 74176238e9b9338ed7e8869a660d3f82a4e0337f4aa7a408e925fdb6e825ec3a

See more details on using hashes here.

Provenance

The following attestation bundles were made for stringalign-0.0.1-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on yngvem/stringalign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page