Minimum Word Error Rate Alignment for speech recognition evaluation
Project description
mweralign
mweralign is a Python package for aligning a stream of words to a reference segmentation. It is designed for use in speech translation tasks, where system outputs must be aligned to a reference translation in order for standard MT metrics to work. This package is a Python wrapper around the original MWERAlign C++ library, which implements the AS-WER algorithm for automatic sentence segmentation and alignment. The wrapper also includes a modernization of that code and support for modern subword tokenization, which helps with alignment.
Installation
To install the package, you can use pip:
pip install mweralign
Or install from source:
git clone https://github.com/mjpost/mweralign
cd mweralign
pip install .
Usage
You can see usage information by running mweralign with the --help flag:
mweralign --help
The standard use case is to provide a reference file, in which segments (sentences) are listed one per line, and a hypothesis file, which contains the output of a speech translation system, and has no line requirements. The output will be a file with the same number of lines as the hypothesis, where each line contains the index of the segment in the reference that corresponds to that hypothesis line.
mweralign -r ref.txt -h hyp.txt -o aligned.txt
You will want to use a tokenizer. Currently supported is "cj", which segments Han characters with whitespace, or any SentencePiece model, which are provided in the form of a filesystem path:
mweralign -r ref.zh.txt -h hyp.txt -o aligned.txt -t cj
# download the flores200 SPM model (one time)
sacrebleu -t wmt24 -l en-zh --echo src | sacrebleu -t wmt24 -l en-zh --tok flores200 > /dev/null
# align
mweralign -r ref.txt -h hyp.txt -o aligned.txt -t ~/.sacrebleu/models/flores200sacrebleuspm
You may also wish to supply the ISO 639-1 language code (-l zh). For zh and ja, this tells the underlying AS-WER algorithm not to prevent sentences from starting with the SentencePiece space character. For other languages, it has no effect.
mweralign -r ref.txt -h hyp.txt -o aligned.txt -t cj -l zh
Project layout
src/ # C++ core library and standalone CLI
python/
mweralign/ # Python package (CLI + wrappers)
bindings/ # pybind11 bindings (mweralign._mweralign)
tests/ # pytest unit + regression suite
regression/ # golden-file CLI regression cases
CMakeLists.txt # builds the standalone C++ `mweralign` binary
setup.py / pyproject.toml # builds the Python package/extension
Development
Install in editable mode with the development dependencies and run the tests:
pip install -e ".[dev]"
pytest python/tests
Regression suite
The regression suite under python/tests/regression/ runs the mweralign
CLI on fixed inputs and compares the output to committed golden files. Each
case is a directory containing a cmd file (the CLI arguments), the input
files it references, and an expected.txt golden output.
After an intentional change in behavior, regenerate the golden files with:
MWERALIGN_REGEN=1 pytest python/tests/test_regression.py
To add a new case, create a directory under python/tests/regression/, add a
cmd file plus its input files, and run the regen command above to produce
expected.txt.
Building the standalone C++ CLI
The Python package builds its own extension, so this is only needed if you want
the standalone mweralign binary:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# binary at build/mweralign
Citation
If you use this package, please cite the following two papers. We suggest a sentence similar to the following: "To align the text, we used the mweralign package \citep{post-huang-2025-effects}, which implements a variant of the AS-WER algorithm \citep{matusov-etal-2005-evaluating}.
License
This project contains code under multiple licenses:
- Original C++ alignment code: GNU General Public License v3 (GPL-3.0)
- Python bindings and wrapper code: Apache License 2.0
- Build scripts and documentation: Apache License 2.0
The project as a whole is distributed under GPL-3.0 due to the inclusion of GPL-licensed components.
What this means for users:
- You can use this library in GPL-compatible projects
- If you distribute software that includes this library, your software must be GPL-compatible
- The Python wrapper code (separate from the C++ core) is available under Apache License 2.0
Attribution
This software includes original GPL-licensed C++ code for alignment algorithms. Python bindings and packaging by Matt Post (Apache License 2.0).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mweralign-1.2.0.tar.gz.
File metadata
- Download URL: mweralign-1.2.0.tar.gz
- Upload date:
- Size: 37.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4a7bf119b1c57a1efd5d7c64bfffc31073a6255c29c4516a05d5e7c2d90f3ab
|
|
| MD5 |
aa1e0ccb0765297b3e5ede3d140d41fe
|
|
| BLAKE2b-256 |
6566c0b05ec4acde6355404e1e9c7d4b73033183176e4c5dcd71168640434d58
|
File details
Details for the file mweralign-1.2.0-cp314-cp314-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: mweralign-1.2.0-cp314-cp314-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.14, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
619a848f42e770b89f918d253704d51ddb69105faf132dc4e201ab0affbdd828
|
|
| MD5 |
9f23e8355158b002e934f17182b96175
|
|
| BLAKE2b-256 |
909cd8981cc038744a7edcf02183fe9dd4f4433891c0f4b103d673b3c3412b3c
|
File details
Details for the file mweralign-1.2.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: mweralign-1.2.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 185.2 kB
- Tags: CPython 3.14, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e6a854d2ba4245339eb06cf496f9591bf29dd61696b008f95bdccebc820e5ac
|
|
| MD5 |
024a5582d36a81394b48901e24dee722
|
|
| BLAKE2b-256 |
79750a8a0fc67728d08060505707ba7dfa5d4ba70a96c70f12eea8d3da61bbf7
|
File details
Details for the file mweralign-1.2.0-cp313-cp313-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: mweralign-1.2.0-cp313-cp313-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.13, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
992da5fc2310938de02ef37994e213dde87d4753535192a8870bf56adf9cb3f5
|
|
| MD5 |
8ffbc9a7a889e264bb4c97cf6ab0f96a
|
|
| BLAKE2b-256 |
79d61bed2390f5dd2e9e4cb062929caa94982b05d510f1ddd19ec2c7c3baaa95
|
File details
Details for the file mweralign-1.2.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: mweralign-1.2.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 185.1 kB
- Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b2e183339d3cf2371af5a55a783e72c60b57155eb85f931efaa4e0c2beba09b
|
|
| MD5 |
856ed36c9cc732bfd3070a78415b0e5a
|
|
| BLAKE2b-256 |
98f4b0805f11849d7a30ba4364489aeb4fbcc4ef5f786820e45c67a5cb5e8036
|
File details
Details for the file mweralign-1.2.0-cp312-cp312-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: mweralign-1.2.0-cp312-cp312-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc7e09dcbd9fee85b6cce1314b0fb1363556552029e52bebcdb03a44101a6c69
|
|
| MD5 |
e3394d9372c307b76e463a3a2dbe7810
|
|
| BLAKE2b-256 |
2fad8551027d6fef89de7902990119a61f3a979e868eb348bb5861bf6ac487a8
|
File details
Details for the file mweralign-1.2.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: mweralign-1.2.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 185.1 kB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56da9bfdd6d3d11696a8e691c4aca50fde45831a35ea45fca79f77895d4d9163
|
|
| MD5 |
e22102620522a93ee6d89478046e7853
|
|
| BLAKE2b-256 |
9cf38efeb61b014d94bb72f6d5a2be0b65f188f2e9cd56f32a27797fc5a0dbc0
|
File details
Details for the file mweralign-1.2.0-cp311-cp311-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: mweralign-1.2.0-cp311-cp311-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.11, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdfee0355116bfe63634244a2dce32804ec69a52155970a594fd2215d8b6630a
|
|
| MD5 |
ea1b0e65504c608629a702898d856a29
|
|
| BLAKE2b-256 |
3ab3cfc89474e7f264259c9919faa2a8b01e93a56cb2f0d1e7dbef69baf77f28
|
File details
Details for the file mweralign-1.2.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: mweralign-1.2.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 185.5 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
385829082b970fd55186f0550f66f7b45624657d9867557074f3c2d2ea6d9754
|
|
| MD5 |
bcb8ca3725881078cd2f80e15a03ee7c
|
|
| BLAKE2b-256 |
75afee8cc755cfbd96b96caf9907bc0970682ba9151c6d65ad3e478c76a5622e
|
File details
Details for the file mweralign-1.2.0-cp310-cp310-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: mweralign-1.2.0-cp310-cp310-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
637625efaeca001ec7e71e3b888573764aa72cd2a0a04da5abdfad666aab0a7d
|
|
| MD5 |
908ec4425009fdd0c2fad5de3822ca05
|
|
| BLAKE2b-256 |
f01ede5f9a7fcb5ba358d7d8bd7b76035591bae3d82665e0bf71c4a19fe22bcd
|
File details
Details for the file mweralign-1.2.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: mweralign-1.2.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 184.3 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b6484fac1bc4f6ceb55d3e7d7bf03cc22f5abb8daf97cce9075737ad6f57864
|
|
| MD5 |
47a8425a392896c8b5e5637ee0d8354f
|
|
| BLAKE2b-256 |
72d05d74fa51af347181d0f3a8e50b5847c27ceedea5f61dcc10094b8b96fec0
|