Skip to main content

Minimum Word Error Rate Alignment for speech recognition evaluation

Project description

mweralign

mweralign is a Python package for aligning a stream of words to a reference segmentation. It is designed for use in speech translation tasks, where system outputs must be aligned to a reference translation in order for standard MT metrics to work. This package is a Python wrapper around the original MWERAlign C++ library, which implements the AS-WER algorithm for automatic sentence segmentation and alignment. The wrapper also includes a modernization of that code and support for modern subword tokenization, which helps with alignment.

Installation

To install the package, you can use pip:

pip install mweralign

Or install from source:

git clone https://github.com/mjpost/mweralign
cd mweralign
pip install .

Usage

You can see usage information by running mweralign with the --help flag:

mweralign --help

The standard use case is to provide a reference file, in which segments (sentences) are listed one per line, and a hypothesis file, which contains the output of a speech translation system, and has no line requirements. The output will be a file with the same number of lines as the hypothesis, where each line contains the index of the segment in the reference that corresponds to that hypothesis line.

mweralign -r ref.txt -h hyp.txt -o aligned.txt

You will want to use a tokenizer. Currently supported is "cj", which segments Han characters with whitespace, or any SentencePiece model, which are provided in the form of a filesystem path:

mweralign -r ref.zh.txt -h hyp.txt -o aligned.txt -t cj

# download the flores200 SPM model (one time)
sacrebleu -t wmt24 -l en-zh --echo src | sacrebleu -t wmt24 -l en-zh --tok flores200 > /dev/null
# align
mweralign -r ref.txt -h hyp.txt -o aligned.txt -t ~/.sacrebleu/models/flores200sacrebleuspm

You may also wish to supply the ISO 639-1 language code (-l zh). For zh and ja, this tells the underlying AS-WER algorithm not to prevent sentences from starting with the SentencePiece space character. For other languages, it has no effect.

mweralign -r ref.txt -h hyp.txt -o aligned.txt -t cj -l zh

Project layout

src/                 # C++ core library and standalone CLI
python/
  mweralign/         # Python package (CLI + wrappers)
  bindings/          # pybind11 bindings (mweralign._mweralign)
  tests/             # pytest unit + regression suite
    regression/      # golden-file CLI regression cases
CMakeLists.txt       # builds the standalone C++ `mweralign` binary
setup.py / pyproject.toml  # builds the Python package/extension

Development

Install in editable mode with the development dependencies and run the tests:

pip install -e ".[dev]"
pytest python/tests

Regression suite

The regression suite under python/tests/regression/ runs the mweralign CLI on fixed inputs and compares the output to committed golden files. Each case is a directory containing a cmd file (the CLI arguments), the input files it references, and an expected.txt golden output.

After an intentional change in behavior, regenerate the golden files with:

MWERALIGN_REGEN=1 pytest python/tests/test_regression.py

To add a new case, create a directory under python/tests/regression/, add a cmd file plus its input files, and run the regen command above to produce expected.txt.

Building the standalone C++ CLI

The Python package builds its own extension, so this is only needed if you want the standalone mweralign binary:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# binary at build/mweralign

Citation

If you use this package, please cite the following two papers. We suggest a sentence similar to the following: "To align the text, we used the mweralign package \citep{post-huang-2025-effects}, which implements a variant of the AS-WER algorithm \citep{matusov-etal-2005-evaluating}.

@inproceedings{post-hoang-2025-effects, title = "Effects of automatic alignment on speech translation metrics", author = "Post, Matt and Hoang, Hieu", editor = "Salesky, Elizabeth and Federico, Marcello and Anastasopoulos, Antonis", booktitle = "Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)", month = jul, year = "2025", address = "Vienna, Austria (in-person and online)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.iwslt-1.7/", doi = "10.18653/v1/2025.iwslt-1.7", pages = "84--92", ISBN = "979-8-89176-272-5", }

@inproceedings{matusov2005evaluating, title={Evaluating machine translation output with automatic sentence segmentation}, author={Matusov, Evgeny and Leusch, Gregor and Bender, Oliver and Ney, Hermann}, booktitle={IWSLT 2005}, pages={138--144}, year={2005} } }

License

This project contains code under multiple licenses:

  • Original C++ alignment code: GNU General Public License v3 (GPL-3.0)
  • Python bindings and wrapper code: Apache License 2.0
  • Build scripts and documentation: Apache License 2.0

The project as a whole is distributed under GPL-3.0 due to the inclusion of GPL-licensed components.

What this means for users:

  • You can use this library in GPL-compatible projects
  • If you distribute software that includes this library, your software must be GPL-compatible
  • The Python wrapper code (separate from the C++ core) is available under Apache License 2.0

Attribution

This software includes original GPL-licensed C++ code for alignment algorithms. Python bindings and packaging by Matt Post (Apache License 2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mweralign-1.1.0.tar.gz (36.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mweralign-1.1.0-cp314-cp314-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.14musllinux: musl 1.2+ x86-64

mweralign-1.1.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (184.3 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

mweralign-1.1.0-cp313-cp313-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

mweralign-1.1.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (184.3 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

mweralign-1.1.0-cp312-cp312-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

mweralign-1.1.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (184.2 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

mweralign-1.1.0-cp311-cp311-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

mweralign-1.1.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (184.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

mweralign-1.1.0-cp310-cp310-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

mweralign-1.1.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (183.4 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file mweralign-1.1.0.tar.gz.

File metadata

  • Download URL: mweralign-1.1.0.tar.gz
  • Upload date:
  • Size: 36.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.1.0.tar.gz
Algorithm Hash digest
SHA256 1eea05dc200ce719b57828e2f1a00950c09b44b1e64a75a2c7948682a4a1fa80
MD5 404bb52bc5d068f325c390ac212e10b9
BLAKE2b-256 96c8318da75883d62357d50a064f67f69d3c9818c85faacab40cf1eeb5216bf6

See more details on using hashes here.

File details

Details for the file mweralign-1.1.0-cp314-cp314-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.0-cp314-cp314-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 905c6f67f809c23bfdb1ef2658b357320c69c0efca85523d8928ad38e0a27519
MD5 554e1127b9d56f60bf47567e6a65e85a
BLAKE2b-256 db9d4c1d5e48d0d84f45a0ee0eb6087c3afc8d85df80a211f0a04af2ec61d0b2

See more details on using hashes here.

File details

Details for the file mweralign-1.1.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 b0f541f8986322a8f683ab305914a5352e2875be414c07c44d6e36359fde91e8
MD5 41cefd0cff07f3db35cd4fb157821705
BLAKE2b-256 afb43afb74df0954480a16ac77a0831ea6209400f74cc7c08844f75b76450eed

See more details on using hashes here.

File details

Details for the file mweralign-1.1.0-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.0-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d198abbae673afbc711e692607cd709ed7f837c9dfa3b682201e38c280385876
MD5 8ef6f4e922e04327eb2c0f8279488e12
BLAKE2b-256 4950e1822465442525c10f5434230ec91ac9554da5068f5f3a46a8c792a4ccc2

See more details on using hashes here.

File details

Details for the file mweralign-1.1.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 23dc301a86312b1b20e9f855493d12a349502578f75387a4d4b51f0d1c274659
MD5 49f841c3d81cc8624cdae4faaa773fc5
BLAKE2b-256 39b1644e0f5ccc67a1f820b8fe3ea742e21c10783ab02a71a51bdc353c775401

See more details on using hashes here.

File details

Details for the file mweralign-1.1.0-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.0-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 3fc60863d89107328485c25b10aa1282930a6c8d38eb9afddc5f59443d3eae5d
MD5 3788d5277b4740d6fb909211a1a10fc9
BLAKE2b-256 fe774f454dd539cf744c6a568c2796dfcbcb658146ab43972ab30f603ec52df0

See more details on using hashes here.

File details

Details for the file mweralign-1.1.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 b99ae1edef2cb527b95814904578aaf5ab2c7c177890d137854ae191023ef3be
MD5 10e772e7e0059686251abd54f3fa1145
BLAKE2b-256 d5f634b495603313cecfbe360a15e22bfb1ef00778c025377a5ac1c148da1e57

See more details on using hashes here.

File details

Details for the file mweralign-1.1.0-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.0-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 984c8cd3e242023e8b7c590238066967e06111c8bb59c8a6752f7cdc906635dc
MD5 04713742389b904a4044c67aca2cb572
BLAKE2b-256 871d6526421f78c7d8acc1f0110678255427501b775e459ab0547a790f687d68

See more details on using hashes here.

File details

Details for the file mweralign-1.1.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 cfbf378c6dcd5248222c239f4b8313ca3a4f723c0c5bec466b7a002707f99caa
MD5 36ec4dcb469ef6b6b855d2152dd6de64
BLAKE2b-256 63e857478f68cc052b966f92560209e165212596199d35ac5dc96f67884fc19e

See more details on using hashes here.

File details

Details for the file mweralign-1.1.0-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.0-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 9b8c0cc55ec0f1892cbc50f2a4483f2d3d3052a39b1494a8abaab40d84e84144
MD5 526ac027d21dc8aebc6cb7d8f879b7a9
BLAKE2b-256 40668ea16015e6928b5fa9867ab2c274c20300b08a662b1acd88e998732a7fc4

See more details on using hashes here.

File details

Details for the file mweralign-1.1.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.1.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 5e97d6fdb2e55e916dba3d76e37cf6324cf8a5869debfc8e3259c6329f4f5157
MD5 326e336da7590f1dd359d623b1ed0378
BLAKE2b-256 b279681a6c089d9a848d0bce6db46c58779352aaa0712975cee1e17b4af29dbe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page