Minimum Word Error Rate Alignment for speech recognition evaluation
Project description
mweralign
mweralign is a Python package for aligning a stream of words to a reference segmentation. It is designed for use in speech translation tasks, where system outputs must be aligned to a reference translation in order for standard MT metrics to work. This package is a Python wrapper around the original MWERAlign C++ library, which implements the AS-WER algorithm for automatic sentence segmentation and alignment. The wrapper also includes a modernization of that code and support for modern subword tokenization, which helps with alignment.
Installation
To install the package, you can use pip:
pip install mweralign
Or install from source:
git clone https://github.com/mjpost/mweralign
cd mweralign
pip install .
Usage
You can see usage information by running mweralign with the --help flag:
mweralign --help
The core flags are:
| Flag | Long form | Meaning |
|---|---|---|
-r |
--ref-file |
Reference file: the target segmentation, one segment per line. Required. |
-t |
--hyp-file |
Hypothesis file: the system output to re-segment. Required. |
-o |
--output |
Output file (default: stdout). |
-m |
--tokenizer |
Tokenizer/segmenter to use (default: spm32k; see below). |
-l |
--language |
ISO 639-1 language code (e.g. de, zh). |
-w |
--no-whitespace |
The language does not delimit words with whitespace (CJK); see Language code. |
-d |
--docid-file |
Document ids, one per reference line (see Document-aware alignment). |
--score |
Scoring mode: report WER instead of re-segmenting (see Scoring mode). | |
-V |
--version |
Print the installed version and exit. |
Aligning a hypothesis to a reference segmentation
The standard use case provides a reference file (segments listed one per line) and a hypothesis file (the output of a speech-translation system, with no line requirements). mweralign concatenates the hypothesis into a single word stream and re-splits it to match the reference segmentation. The output has the same number of lines as the reference, where each line is the slice of the hypothesis aligned to the corresponding reference segment:
mweralign -r ref.txt -t hyp.txt -o aligned.txt
Tokenization
For good alignment you should use a tokenizer, selected with -m. The default is
spm32k (see the recommendation below). Supported values are:
none(orwhitespace) — no tokenizer; split on whitespace only;cj— segments Han characters with whitespace (dependency-free, no model needed);- a named, on-demand SentencePiece model:
spm32k,spm64k,spm128k, orspm256k(spmis an alias forspm256k); - a filesystem path to any SentencePiece
.modelfile.
The named models are character-preserving (identity-normalization) models that ship with
the project. They are downloaded on demand the first time you request them, fetched from
the project's GitHub Release, verified against a checksum, and cached under
~/.cache/mweralign/models (override with MWERALIGN_SPM_DIR):
mweralign -r ref.txt -t hyp.txt -o aligned.txt -m spm32k # the default
mweralign -r ref.txt -t hyp.txt -o aligned.txt -m none # plain whitespace
To pre-fetch the models (e.g. for offline use):
python -m mweralign.models --all # all sizes
python -m mweralign.models spm32k spm256k # specific ones
Recommendation: for the best segmentation quality, use a character-preserving
(identity-normalization) SentencePiece model for all languages, including CJK. In our
WMT24 experiments an identity SPM model restored the original segmentation far more
accurately than whitespace tokenization on every language pair, and on the CJK pairs
(en-ja, en-zh, ja-zh) it clearly outperformed the cj character segmenter (~94% vs. ~69%
boundary accuracy): per-character tokenization gives the aligner too much freedom, whereas
subword pieces constrain boundaries to sensible word edges. Vocabulary size has little
effect (32k is sufficient; 128k is marginally best), so a small model is a fine default.
The cj segmenter remains available as a dependency-free fallback.
Note that the flores200 SPM model (e.g. from sacrebleu) applies NMT-style normalization
that rewrites characters, so it is unsuitable when you need the original text restored
verbatim; use an identity-normalization model such as spm32k for that.
Language code
You may also supply the ISO 639-1 language code with -l. For zh and ja, this tells
the underlying AS-WER algorithm not to prevent sentences from starting with the
SentencePiece space character. For other languages it has no effect.
mweralign -r ref.zh.txt -t hyp.txt -o aligned.txt -m spm256k -l zh
Equivalently, you can pass --no-whitespace (-w) for any language whose script does not
delimit words with whitespace (e.g. Chinese, Japanese), without naming a specific language:
mweralign -r ref.zh.txt -t hyp.txt -o aligned.txt -m spm256k -w
If the reference looks like a CJK script but neither -l zh/-l ja nor -w was given,
mweralign prints a one-line suggestion to add the flag.
When a SentencePiece model is used to tokenize a non-CJK language, the aligner also forbids
mid-word segment boundaries: no output segment may begin on a word-internal sub-word piece
(one lacking the leading ▁ marker), so re-segmenting never splits a word across two segments.
This is automatic and requires no flag. Pure-punctuation pieces are exempt, since they
legitimately attach to the preceding token.
By default the re-segmented output is detokenized back to plain text. Pass --no-detok to
emit the tokenized pieces instead.
Document-aware alignment
If your hypothesis is split per document (rather than one big stream), pass a docid file
with -d. It lists one document id per reference line; reference lines sharing a docid
form a document, and the hypothesis file must contain one line per distinct document (in
order). Each document's hypothesis is then aligned independently to its own reference
segments:
mweralign -r ref.txt -t hyp.txt -d docids.txt -o aligned.txt
Scoring mode
With --score, mweralign skips alignment and instead computes word error rate on already
parallel input: ref.txt and hyp.txt must have the same number of lines, compared
line-by-line. It prints a per-segment breakdown and a corpus total:
mweralign --score -r ref.txt -t hyp.txt
segment 1: WER=150.00 (S=12 I=6 D=0 N=12)
segment 2: WER=100.00 (S=11 I=0 D=7 N=18)
...
TOTAL: WER=42.50 (errors=85 ref_words=200)
A tokenizer (-m) may be combined with --score to score on tokenized text.
Inspecting the segmentation scores
The aligner chooses where to split the hypothesis stream with a dynamic program. You can dump
the competing segment-boundary costs it considered with --trace-file. Pass - to write the
trace to stdout, or a path to write it to a file. It is off by default and adds no cost when
unused.
printf 'the cat sat\non the mat\n' > /tmp/ref.txt
printf 'the cat\nsat on the mat\n' > /tmp/hyp.txt
mweralign -r /tmp/ref.txt -t /tmp/hyp.txt -o /dev/null --trace-file - 2>/dev/null
Or for a longer example:
mweralign \
-r test/data/wmt22.en-de.en \
-t test/data/wmt22.en-de.sys \
-m spm256k \
-l de \
-o /dev/null \
--trace-file -
For each segment, the trace lists the chosen end position and every candidate end position with
its cost and the previous segment's end (prev_end):
# docid range 0 (segments 0-2)
segment 1: chosen end j=3 (cost=0)
end j= 0 cost= 0 prev_end=0
end j= 3 cost= 0 prev_end=0 <- chosen
end j= 2 cost= 1 prev_end=0
...
segment 2: chosen end j=6 (cost=0)
end j= 6 cost= 0 prev_end=3 <- chosen
...
Here j is a position in the (tokenized) hypothesis stream, so segment 1 covers hyp tokens
1..3 and segment 2 covers 4..6. The alignment output itself still goes to -o (sent to
/dev/null above so only the trace is shown); 2>/dev/null suppresses the AS-WER line.
The trace above is the boundary-cost table (cheap to record). Finer-grained per-cell edit costs
are available only through the Python API, align_texts_traced(..., cells=True), since they grow
with the full alignment grid and are impractical to print for long inputs.
Project layout
src/ # C++ core library and standalone CLI
python/
mweralign/ # Python package (CLI + wrappers)
bindings/ # pybind11 bindings (mweralign._mweralign)
tests/ # pytest unit + regression suite
regression/ # golden-file CLI regression cases
CMakeLists.txt # builds the standalone C++ `mweralign` binary
setup.py / pyproject.toml # builds the Python package/extension
Development
Install in editable mode with the development dependencies and run the tests:
pip install -e ".[dev]"
pytest python/tests
Regression suite
The regression suite under python/tests/regression/ runs the mweralign
CLI on fixed inputs and compares the output to committed golden files. Each
case is a directory containing a cmd file (the CLI arguments), the input
files it references, and an expected.txt golden output.
After an intentional change in behavior, regenerate the golden files with:
MWERALIGN_REGEN=1 pytest python/tests/test_regression.py
To add a new case, create a directory under python/tests/regression/, add a
cmd file plus its input files, and run the regen command above to produce
expected.txt.
Building the standalone C++ CLI
The Python package builds its own extension, so this is only needed if you want
the standalone mweralign binary:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# binary at build/mweralign
Citation
If you use this package, please cite the following two papers. We suggest a sentence similar to the following: "To align the text, we used the mweralign package \citep{post-huang-2025-effects}, which implements a variant of the AS-WER algorithm \citep{matusov-etal-2005-evaluating}.
License
This project contains code under multiple licenses:
- Original C++ alignment code: GNU General Public License v3 (GPL-3.0)
- Python bindings and wrapper code: Apache License 2.0
- Build scripts and documentation: Apache License 2.0
The project as a whole is distributed under GPL-3.0 due to the inclusion of GPL-licensed components.
What this means for users:
- You can use this library in GPL-compatible projects
- If you distribute software that includes this library, your software must be GPL-compatible
- The Python wrapper code (separate from the C++ core) is available under Apache License 2.0
Attribution
This software includes original GPL-licensed C++ code for alignment algorithms. Python bindings and packaging by Matt Post (Apache License 2.0).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mweralign-1.4.1.tar.gz.
File metadata
- Download URL: mweralign-1.4.1.tar.gz
- Upload date:
- Size: 54.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72daa7ae6a98307c1b46032cc9428c9864937e2700cff22d2672a6ed91d71097
|
|
| MD5 |
e7a9e794fb3a899efb4b20e10e9788a4
|
|
| BLAKE2b-256 |
ad9e74d9927a0bed6e0b17c08cbee584cc547378210c548f72f9c7ec35644996
|
File details
Details for the file mweralign-1.4.1-cp314-cp314-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: mweralign-1.4.1-cp314-cp314-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.14, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1df21b397c97212999a0796dbbcb28fdc0a2d78933282f04f124fd8c71756765
|
|
| MD5 |
b43ba1e528ebd24d169053618614639c
|
|
| BLAKE2b-256 |
3d41bb03492b2fdc22bfffe99f98cf88966a0bd9961e66bbf0e14a760b03fd92
|
File details
Details for the file mweralign-1.4.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: mweralign-1.4.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 206.8 kB
- Tags: CPython 3.14, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c9d2443069779647f050402b7a22b550c2d1535b6753e18f700a78b40b19b16
|
|
| MD5 |
5a70ea454ad1653dafc1096bc98b1cbe
|
|
| BLAKE2b-256 |
af647ab35d3e38895838292230471b7b2c8997081f0f2e8f00537a8fc156377e
|
File details
Details for the file mweralign-1.4.1-cp313-cp313-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: mweralign-1.4.1-cp313-cp313-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.13, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3ba438188563291534c050e3fc96dd54cb55ac58cb73db5789da893f9c059fc
|
|
| MD5 |
0bb0f4bb4d6f1d75164eadbab11a33cf
|
|
| BLAKE2b-256 |
d1614bf31a3fcf9330ee67e9e49344ac632d9ac8063661e9acd1c23bce5c422f
|
File details
Details for the file mweralign-1.4.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: mweralign-1.4.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 206.8 kB
- Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6878391ef55315fce708c05d903d84977014f0c43615c6485ee013b4c7bcd72e
|
|
| MD5 |
1760b4e850e918c0dfd1f530b491556c
|
|
| BLAKE2b-256 |
f42dc49e2f81e755d9d0bb65b50890e8ad2e5231ad4a5ab8f368c537cb79cfdc
|
File details
Details for the file mweralign-1.4.1-cp312-cp312-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: mweralign-1.4.1-cp312-cp312-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
989813976a824ea2499ccb1b53314b77a0ae05ffb8bdea5c5297860100a16ef4
|
|
| MD5 |
4a5c77287a3eb6ed913e7b2b037717ca
|
|
| BLAKE2b-256 |
61870fa500a543ee94aa4aa742c3aa28a52840fe6f7ae2684ccfe8fe01e4a81e
|
File details
Details for the file mweralign-1.4.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: mweralign-1.4.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 206.6 kB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44be0cd7b43875c47ae8de1aba00783f5d2b93d9f968bdf86601c7cfbfc5f7f8
|
|
| MD5 |
750b7824d62d57d5b46af30f23defe0a
|
|
| BLAKE2b-256 |
01de945bcba81c8139430504cc9b22abdb4dfbdb1ab0148f10a47991c89dba01
|
File details
Details for the file mweralign-1.4.1-cp311-cp311-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: mweralign-1.4.1-cp311-cp311-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.11, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4779df7dcd4c50b8c6ce74205831442f27d4c464124614784e215882bfb4fda3
|
|
| MD5 |
74256566acbff42fa6da1a7233ddc97d
|
|
| BLAKE2b-256 |
f64ff716704209ddfada0df75030ce44e6fc6a003ad5e5c4da9475bd946a1aed
|
File details
Details for the file mweralign-1.4.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: mweralign-1.4.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 206.6 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c0b0255d90755f1fcee2e61a5c6bd2b2b8a69135d541738b5056b6ff1febf08
|
|
| MD5 |
9a9583f65709ae2e109e6cd93682e0d3
|
|
| BLAKE2b-256 |
d477545a990145bda36b2cca24e7e8fe59adeca5e7f2ee4ad9cb00cb0a39bac1
|
File details
Details for the file mweralign-1.4.1-cp310-cp310-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: mweralign-1.4.1-cp310-cp310-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a631307e9b96eda75749773e9869b7c1ae46e5cae29d5e6dabaa966901a43de3
|
|
| MD5 |
be496b27bb2a6f4cab5ffe7188824b8d
|
|
| BLAKE2b-256 |
4a95946149a591665fa6ff68b2c91c8dc418f86d1ce416dc988f3f911e29009f
|
File details
Details for the file mweralign-1.4.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: mweralign-1.4.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 205.5 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7786feffe82efdbf56bffccd3d6ae59bab376e28e93ac7341988d651d97b424
|
|
| MD5 |
3fcc45c7577b17d210ee985e534f843e
|
|
| BLAKE2b-256 |
37271f9887fb31adb4af1572609862d5410b5bab082aedd0a7d7961345fb0525
|