Skip to main content

Minimum Word Error Rate Alignment for speech recognition evaluation

Project description

mweralign

mweralign is a Python package for aligning a stream of words to a reference segmentation. It is designed for use in speech translation tasks, where system outputs must be aligned to a reference translation in order for standard MT metrics to work. This package is a Python wrapper around the original MWERAlign C++ library, which implements the AS-WER algorithm for automatic sentence segmentation and alignment. The wrapper also includes a modernization of that code and support for modern subword tokenization, which helps with alignment.

Installation

To install the package, you can use pip:

pip install mweralign

Or install from source:

git clone https://github.com/mjpost/mweralign
cd mweralign
pip install .

Usage

You can see usage information by running mweralign with the --help flag:

mweralign --help

The core flags are:

Flag Long form Meaning
-r --ref-file Reference file: the target segmentation, one segment per line. Required.
-t --hyp-file Hypothesis file: the system output to re-segment. Required.
-o --output Output file (default: stdout).
-m --tokenizer Tokenizer/segmenter to use (default: spm32k; see below).
-l --language ISO 639-1 language code (e.g. de, zh).
-w --no-whitespace The language does not delimit words with whitespace (CJK); see Language code.
-d --docid-file Document ids, one per reference line (see Document-aware alignment).
--score Scoring mode: report WER instead of re-segmenting (see Scoring mode).

Aligning a hypothesis to a reference segmentation

The standard use case provides a reference file (segments listed one per line) and a hypothesis file (the output of a speech-translation system, with no line requirements). mweralign concatenates the hypothesis into a single word stream and re-splits it to match the reference segmentation. The output has the same number of lines as the reference, where each line is the slice of the hypothesis aligned to the corresponding reference segment:

mweralign -r ref.txt -t hyp.txt -o aligned.txt

Tokenization

For good alignment you should use a tokenizer, selected with -m. The default is spm32k (see the recommendation below). Supported values are:

  • none (or whitespace) — no tokenizer; split on whitespace only;
  • cj — segments Han characters with whitespace (dependency-free, no model needed);
  • a named, on-demand SentencePiece model: spm32k, spm64k, spm128k, or spm256k (spm is an alias for spm256k);
  • a filesystem path to any SentencePiece .model file.

The named models are character-preserving (identity-normalization) models that ship with the project. They are downloaded on demand the first time you request them, fetched from the project's GitHub Release, verified against a checksum, and cached under ~/.cache/mweralign/models (override with MWERALIGN_SPM_DIR):

mweralign -r ref.txt -t hyp.txt -o aligned.txt -m spm32k   # the default

mweralign -r ref.txt -t hyp.txt -o aligned.txt -m none     # plain whitespace

To pre-fetch the models (e.g. for offline use):

python -m mweralign.models --all          # all sizes
python -m mweralign.models spm32k spm256k # specific ones

Recommendation: for the best segmentation quality, use a character-preserving (identity-normalization) SentencePiece model for all languages, including CJK. In our WMT24 experiments an identity SPM model restored the original segmentation far more accurately than whitespace tokenization on every language pair, and on the CJK pairs (en-ja, en-zh, ja-zh) it clearly outperformed the cj character segmenter (~94% vs. ~69% boundary accuracy): per-character tokenization gives the aligner too much freedom, whereas subword pieces constrain boundaries to sensible word edges. Vocabulary size has little effect (32k is sufficient; 128k is marginally best), so a small model is a fine default. The cj segmenter remains available as a dependency-free fallback.

Note that the flores200 SPM model (e.g. from sacrebleu) applies NMT-style normalization that rewrites characters, so it is unsuitable when you need the original text restored verbatim; use an identity-normalization model such as spm32k for that.

Language code

You may also supply the ISO 639-1 language code with -l. For zh and ja, this tells the underlying AS-WER algorithm not to prevent sentences from starting with the SentencePiece space character. For other languages it has no effect.

mweralign -r ref.zh.txt -t hyp.txt -o aligned.txt -m spm256k -l zh

Equivalently, you can pass --no-whitespace (-w) for any language whose script does not delimit words with whitespace (e.g. Chinese, Japanese), without naming a specific language:

mweralign -r ref.zh.txt -t hyp.txt -o aligned.txt -m spm256k -w

If the reference looks like a CJK script but neither -l zh/-l ja nor -w was given, mweralign prints a one-line suggestion to add the flag.

When a SentencePiece model is used to tokenize a non-CJK language, the aligner also forbids mid-word segment boundaries: no output segment may begin on a word-internal sub-word piece (one lacking the leading marker), so re-segmenting never splits a word across two segments. This is automatic and requires no flag. Pure-punctuation pieces are exempt, since they legitimately attach to the preceding token.

By default the re-segmented output is detokenized back to plain text. Pass --no-detok to emit the tokenized pieces instead.

Document-aware alignment

If your hypothesis is split per document (rather than one big stream), pass a docid file with -d. It lists one document id per reference line; reference lines sharing a docid form a document, and the hypothesis file must contain one line per distinct document (in order). Each document's hypothesis is then aligned independently to its own reference segments:

mweralign -r ref.txt -t hyp.txt -d docids.txt -o aligned.txt

Scoring mode

With --score, mweralign skips alignment and instead computes word error rate on already parallel input: ref.txt and hyp.txt must have the same number of lines, compared line-by-line. It prints a per-segment breakdown and a corpus total:

mweralign --score -r ref.txt -t hyp.txt

segment 1: WER=150.00 (S=12 I=6 D=0 N=12)
segment 2: WER=100.00 (S=11 I=0 D=7 N=18)
...
TOTAL: WER=42.50 (errors=85 ref_words=200)

A tokenizer (-m) may be combined with --score to score on tokenized text.

Inspecting the segmentation scores

The aligner chooses where to split the hypothesis stream with a dynamic program. You can dump the competing segment-boundary costs it considered with --trace-file. Pass - to write the trace to stdout, or a path to write it to a file. It is off by default and adds no cost when unused.

printf 'the cat sat\non the mat\n' > /tmp/ref.txt
printf 'the cat\nsat on the mat\n' > /tmp/hyp.txt
mweralign -r /tmp/ref.txt -t /tmp/hyp.txt -o /dev/null --trace-file - 2>/dev/null

Or for a longer example:

mweralign \
  -r test/data/wmt22.en-de.en \
  -t test/data/wmt22.en-de.sys \
  -m spm256k \
  -l de \
  -o /dev/null \
  --trace-file -

For each segment, the trace lists the chosen end position and every candidate end position with its cost and the previous segment's end (prev_end):

# docid range 0 (segments 0-2)
segment 1: chosen end j=3 (cost=0)
    end j=   0  cost=     0  prev_end=0
    end j=   3  cost=     0  prev_end=0  <- chosen
    end j=   2  cost=     1  prev_end=0
    ...
segment 2: chosen end j=6 (cost=0)
    end j=   6  cost=     0  prev_end=3  <- chosen
    ...

Here j is a position in the (tokenized) hypothesis stream, so segment 1 covers hyp tokens 1..3 and segment 2 covers 4..6. The alignment output itself still goes to -o (sent to /dev/null above so only the trace is shown); 2>/dev/null suppresses the AS-WER line.

The trace above is the boundary-cost table (cheap to record). Finer-grained per-cell edit costs are available only through the Python API, align_texts_traced(..., cells=True), since they grow with the full alignment grid and are impractical to print for long inputs.

Project layout

src/                 # C++ core library and standalone CLI
python/
  mweralign/         # Python package (CLI + wrappers)
  bindings/          # pybind11 bindings (mweralign._mweralign)
  tests/             # pytest unit + regression suite
    regression/      # golden-file CLI regression cases
CMakeLists.txt       # builds the standalone C++ `mweralign` binary
setup.py / pyproject.toml  # builds the Python package/extension

Development

Install in editable mode with the development dependencies and run the tests:

pip install -e ".[dev]"
pytest python/tests

Regression suite

The regression suite under python/tests/regression/ runs the mweralign CLI on fixed inputs and compares the output to committed golden files. Each case is a directory containing a cmd file (the CLI arguments), the input files it references, and an expected.txt golden output.

After an intentional change in behavior, regenerate the golden files with:

MWERALIGN_REGEN=1 pytest python/tests/test_regression.py

To add a new case, create a directory under python/tests/regression/, add a cmd file plus its input files, and run the regen command above to produce expected.txt.

Building the standalone C++ CLI

The Python package builds its own extension, so this is only needed if you want the standalone mweralign binary:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# binary at build/mweralign

Citation

If you use this package, please cite the following two papers. We suggest a sentence similar to the following: "To align the text, we used the mweralign package \citep{post-huang-2025-effects}, which implements a variant of the AS-WER algorithm \citep{matusov-etal-2005-evaluating}.

License

This project contains code under multiple licenses:

  • Original C++ alignment code: GNU General Public License v3 (GPL-3.0)
  • Python bindings and wrapper code: Apache License 2.0
  • Build scripts and documentation: Apache License 2.0

The project as a whole is distributed under GPL-3.0 due to the inclusion of GPL-licensed components.

What this means for users:

  • You can use this library in GPL-compatible projects
  • If you distribute software that includes this library, your software must be GPL-compatible
  • The Python wrapper code (separate from the C++ core) is available under Apache License 2.0

Attribution

This software includes original GPL-licensed C++ code for alignment algorithms. Python bindings and packaging by Matt Post (Apache License 2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mweralign-1.4.0.tar.gz (54.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mweralign-1.4.0-cp314-cp314-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.14musllinux: musl 1.2+ x86-64

mweralign-1.4.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (206.7 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

mweralign-1.4.0-cp313-cp313-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

mweralign-1.4.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (206.7 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

mweralign-1.4.0-cp312-cp312-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

mweralign-1.4.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (206.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

mweralign-1.4.0-cp311-cp311-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

mweralign-1.4.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (206.5 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

mweralign-1.4.0-cp310-cp310-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

mweralign-1.4.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (205.4 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file mweralign-1.4.0.tar.gz.

File metadata

  • Download URL: mweralign-1.4.0.tar.gz
  • Upload date:
  • Size: 54.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.0.tar.gz
Algorithm Hash digest
SHA256 3e6b4d4d9245f34175236ede3cbd9518192b84bd2c74ff28f763a23b6e799635
MD5 a01dbc397cb0e0e14ea457b81df9a213
BLAKE2b-256 76e268517fd00438849ecb16594b1e9ccee0c6c2354fb6ca18079951402cba62

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp314-cp314-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.0-cp314-cp314-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 a3eec7ea9e547df0c21cf16628b3eb7923018ba58a86ad57b0600b0f2833bf90
MD5 8670ce364d7dc4842d5192c4a3ded1f6
BLAKE2b-256 bd461c6e2ba18edac65a523adfb89ca70b58fb8b34f0e79aabb874d154926edb

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 53c8a54fd12c4b0b20052840632d046e55d49107b18bfae4d6bb8db40f0bc5dd
MD5 daa646483d047d56aa3257eab23f5f61
BLAKE2b-256 6deb83c6a166d0bd000fc31a229b2e1c4bdc09d48c6c7a23ff19ed8345af90dc

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.0-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 e634bd27fda1f6d3928af6cefa855dc14523cd6dc1d53c952d811b46fbd36bbe
MD5 bd571033b426f1330bd51870fbc66574
BLAKE2b-256 8e8a758ebb95f075ff21d844751839f46ffaea918880ab5f45552fb5d836f6fd

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 c496331ea9e022cb3192584da736eac6bdd4aafc83ff29fedcdbd8737a5ae22e
MD5 c1460036f16dda8a0c2fd5d63d581da7
BLAKE2b-256 b87c218bf9a517ce964b65cfef6cded811271175f3db5ffa9361e631f5c92d2d

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.0-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 12592fd346c19ce3da00eac3c4204fb997e479bb8c9f7e55c797a32cf709fcc7
MD5 1cf67b1c55bebac5301c8cd654edd65c
BLAKE2b-256 70c0cce0253c982cc5412b05476095445232fc394b26d0923e33b4dc819474f6

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 5c4d968ddb4c15050b5d6f7a826eaa8af7f9de1604b85a26b658c3768ce7a7eb
MD5 eb0b0be538e20cf7007511e2802af35e
BLAKE2b-256 474be2b82c0fc28ccbaf111744521d3c53f589078ee46d904b44b60f780758d9

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.0-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 a8d7390f7f7c3b7a4f1df255531851d23b17c7ae9e41a63223086ad7e0f84a80
MD5 759fbd4d68ba57499ef3955f6ec9ddaf
BLAKE2b-256 2d9930687e1b7901d42768f75deb5992ec43528db185abdc6cc87dc284cd86c3

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 74406721df28998bae695f7434cfdde667252868f2e3d6983cfcad0575356a72
MD5 d4b05f9db28910b813467b651982665d
BLAKE2b-256 64135ffccedab4114fb3ed3e75fdcdecc25c5636dd6e5276e85f77813dbd2c29

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.0-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 0fa414b95977e71cb7a072c2807e8a7d682fd82b9ec2a1c358bca8cddeeab7da
MD5 c75ed01d768b65b2578cbda105f4385b
BLAKE2b-256 5bdfdb8fa50c6d8b5490fc92d4d9b1a35d8af49436161bb49dcc1f276419e076

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 ca9717202441acee0f03a28e9c752a562b946ef040e741a687976b73e911ec9d
MD5 1c50a617abc9370c19b4165b4a70bf23
BLAKE2b-256 99240a464a436dc0b03ed7c74eba1e8fc641bfa84736c56d5e6c7ec8d14a4055

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page