Skip to main content

Minimum Word Error Rate Alignment for speech recognition evaluation

Project description

mweralign

mweralign is a Python package for aligning a stream of words to a reference segmentation. It is designed for use in speech translation tasks, where system outputs must be aligned to a reference translation in order for standard MT metrics to work. This package is a Python wrapper around the original MWERAlign C++ library, which implements the AS-WER algorithm for automatic sentence segmentation and alignment. The wrapper also includes a modernization of that code and support for modern subword tokenization, which helps with alignment.

Installation

To install the package, you can use pip:

pip install mweralign

Or install from source:

git clone https://github.com/mjpost/mweralign
cd mweralign
pip install .

Usage

You can see usage information by running mweralign with the --help flag:

mweralign --help

The standard use case is to provide a reference file, in which segments (sentences) are listed one per line, and a hypothesis file, which contains the output of a speech translation system, and has no line requirements. The output will be a file with the same number of lines as the hypothesis, where each line contains the index of the segment in the reference that corresponds to that hypothesis line.

mweralign -r ref.txt -h hyp.txt -o aligned.txt

You will want to use a tokenizer. Currently supported is "cj", which segments Han characters with whitespace, or any SentencePiece model, which are provided in the form of a filesystem path:

mweralign -r ref.zh.txt -h hyp.txt -o aligned.txt -t cj

# download the flores200 SPM model (one time)
sacrebleu -t wmt24 -l en-zh --echo src | sacrebleu -t wmt24 -l en-zh --tok flores200 > /dev/null
# align
mweralign -r ref.txt -h hyp.txt -o aligned.txt -t ~/.sacrebleu/models/flores200sacrebleuspm

You may also wish to supply the ISO 639-1 language code (-l zh). For zh and ja, this tells the underlying AS-WER algorithm not to prevent sentences from starting with the SentencePiece space character. For other languages, it has no effect.

mweralign -r ref.txt -h hyp.txt -o aligned.txt -t cj -l zh

Project layout

src/                 # C++ core library and standalone CLI
python/
  mweralign/         # Python package (CLI + wrappers)
  bindings/          # pybind11 bindings (mweralign._mweralign)
  tests/             # pytest unit + regression suite
    regression/      # golden-file CLI regression cases
CMakeLists.txt       # builds the standalone C++ `mweralign` binary
setup.py / pyproject.toml  # builds the Python package/extension

Development

Install in editable mode with the development dependencies and run the tests:

pip install -e ".[dev]"
pytest python/tests

Regression suite

The regression suite under python/tests/regression/ runs the mweralign CLI on fixed inputs and compares the output to committed golden files. Each case is a directory containing a cmd file (the CLI arguments), the input files it references, and an expected.txt golden output.

After an intentional change in behavior, regenerate the golden files with:

MWERALIGN_REGEN=1 pytest python/tests/test_regression.py

To add a new case, create a directory under python/tests/regression/, add a cmd file plus its input files, and run the regen command above to produce expected.txt.

Building the standalone C++ CLI

The Python package builds its own extension, so this is only needed if you want the standalone mweralign binary:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# binary at build/mweralign

Citation

If you use this package, please cite the following two papers. We suggest a sentence similar to the following: "To align the text, we used the mweralign package \citep{post-huang-2025-effects}, which implements a variant of the AS-WER algorithm \citep{matusov-etal-2005-evaluating}.

License

This project contains code under multiple licenses:

  • Original C++ alignment code: GNU General Public License v3 (GPL-3.0)
  • Python bindings and wrapper code: Apache License 2.0
  • Build scripts and documentation: Apache License 2.0

The project as a whole is distributed under GPL-3.0 due to the inclusion of GPL-licensed components.

What this means for users:

  • You can use this library in GPL-compatible projects
  • If you distribute software that includes this library, your software must be GPL-compatible
  • The Python wrapper code (separate from the C++ core) is available under Apache License 2.0

Attribution

This software includes original GPL-licensed C++ code for alignment algorithms. Python bindings and packaging by Matt Post (Apache License 2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mweralign-1.1.1.tar.gz (35.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mweralign-1.1.1-cp314-cp314-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.14musllinux: musl 1.2+ x86-64

mweralign-1.1.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (184.0 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

mweralign-1.1.1-cp313-cp313-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

mweralign-1.1.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (183.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

mweralign-1.1.1-cp312-cp312-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

mweralign-1.1.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (183.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

mweralign-1.1.1-cp311-cp311-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

mweralign-1.1.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (184.2 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

mweralign-1.1.1-cp310-cp310-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

mweralign-1.1.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (183.0 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file mweralign-1.1.1.tar.gz.

File metadata

  • Download URL: mweralign-1.1.1.tar.gz
  • Upload date:
  • Size: 35.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.1.1.tar.gz
Algorithm Hash digest
SHA256 4fe6ebcc6a209e7728a11f90ac678f6294e8715a9dc9c0c8736ed6579089b978
MD5 ce9cde5c0483a865ea7093ebce84a56f
BLAKE2b-256 56c8adb10769bd4df7b01c79b2822c7769a0e3a13229d3943255a31a6aa042ea

See more details on using hashes here.

File details

Details for the file mweralign-1.1.1-cp314-cp314-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.1-cp314-cp314-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 068bb7e1bfb69cab5655ad67feae4a6932b39f971a10147217d454f5108aa8a1
MD5 46991677dc1d1421ca60648a09664fae
BLAKE2b-256 6f812f363424d0014b42440b63f2b1d7211cd4bb671b155e39c6b3ad41c740ca

See more details on using hashes here.

File details

Details for the file mweralign-1.1.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 1932be87e3787f74b6037374478fdc6eebb83c599d0b1e5a2d453da2affbd5ca
MD5 f308d5d3b592a58ec4fbef87fb86a35f
BLAKE2b-256 6e40b49d07812065c87fe0dc8b9bb32c4988d5067f6ffbc179d8a71f417a16f3

See more details on using hashes here.

File details

Details for the file mweralign-1.1.1-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.1-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 0d7c9956737900a0c3a168e8b7061b0cdefc15319315baf0be32a31716a9062b
MD5 0e7f91b3a49a8bebe8093089e4c31cd5
BLAKE2b-256 7465ba85b8069c212a10b874aa681658f33a201bc652a223fc6e839964eacfe6

See more details on using hashes here.

File details

Details for the file mweralign-1.1.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 abc28d02c7c8adc4681a04d7549d99bbd8b6f1ca19ccb9c99799737bbfec164a
MD5 9715e41bc9d4446f93f79b7437ab09b1
BLAKE2b-256 bace899e0b3452d0c074bb94263ad3ff49b020330e9c0a1de38837dc7000f49d

See more details on using hashes here.

File details

Details for the file mweralign-1.1.1-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.1-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 4b68255f76462804af02afe7f1fb2c473970702e610ecc390f4f8c75e14bc882
MD5 a4991d35ae2a70afd203096b5a8c39b9
BLAKE2b-256 fcbe828825975edc1f12ddc461ccda40b40eda2f275ef87544b903733d0f997f

See more details on using hashes here.

File details

Details for the file mweralign-1.1.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 a5241ce7ff8b51276ef31d8400b4609b55892f9d00a6713b23aea3b2304116c5
MD5 370b3fe73fa1636085598d8483e8af4a
BLAKE2b-256 17dd5b5f977419038aa45cc2cb013bdc0cd687844c19ae556de277c56e866612

See more details on using hashes here.

File details

Details for the file mweralign-1.1.1-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.1-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 05cf4d892f4e7c41439c77e6dad6de362173c60411dd97f8179a5075b63d2c78
MD5 60dc2acbd296c4a0a546105b63e6b6b0
BLAKE2b-256 19f74ff13326862327da54a50d967936f40b4dc4564634c0a75c418bfb52175a

See more details on using hashes here.

File details

Details for the file mweralign-1.1.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 8e18dafa57a51aee5a0156ac8fcd9863342db7205e72187eb9a84103d5fee242
MD5 efc48ae4b136d515d10babf1c13f4d63
BLAKE2b-256 5e84e8a0c9a26aaf66db90d8fd72487cc430309c8312dde6a54d5b30d64efad7

See more details on using hashes here.

File details

Details for the file mweralign-1.1.1-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.1-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 6f57565c742fb6d25fdac169b751db9a8eecb808c4e0f757b9a31e8a9afd3565
MD5 00305b2a8d4761de94d7faeeb585f473
BLAKE2b-256 26c02c99da6975143220671b8121eeb69f4ab76efda03bbfa5f425a4f173eb3d

See more details on using hashes here.

File details

Details for the file mweralign-1.1.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 825831421bf2b7f0dc5f60cbc7618205542ac725d8024e2dfe2a10ab73511185
MD5 9685bdda3b092f6bec63cb146aa3c5eb
BLAKE2b-256 e767af4481a068e42ee6eb870c224a5629c6e56fb118ad0dfd8703e1d786668c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page