Skip to main content

Minimum Word Error Rate Alignment for speech recognition evaluation

Project description

mweralign

mweralign is a Python package for aligning a stream of words to a reference segmentation. It is designed for use in speech translation tasks, where system outputs must be aligned to a reference translation in order for standard MT metrics to work. This package is a Python wrapper around the original MWERAlign C++ library, which implements the AS-WER algorithm for automatic sentence segmentation and alignment. The wrapper also includes a modernization of that code and support for modern subword tokenization, which helps with alignment.

Installation

To install the package, you can use pip:

pip install mweralign

Or install from source:

git clone https://github.com/mjpost/mweralign
cd mweralign
pip install .

Usage

You can see usage information by running mweralign with the --help flag:

mweralign --help

The core flags are:

Flag Long form Meaning
-r --ref-file Reference file: the target segmentation, one segment per line. Required.
-t --hyp-file Hypothesis file: the system output to re-segment. Required.
-o --output Output file (default: stdout).
-m --tokenizer Tokenizer/segmenter to use (default: spm32k; see below).
-l --language ISO 639-1 language code (e.g. de, zh).
-w --no-whitespace The language does not delimit words with whitespace (CJK); see Language code.
-d --docid-file Document ids, one per reference line (see Document-aware alignment).
--score Scoring mode: report WER instead of re-segmenting (see Scoring mode).
-V --version Print the installed version and exit.

Aligning a hypothesis to a reference segmentation

The standard use case provides a reference file (segments listed one per line) and a hypothesis file (the output of a speech-translation system, with no line requirements). mweralign concatenates the hypothesis into a single word stream and re-splits it to match the reference segmentation. The output has the same number of lines as the reference, where each line is the slice of the hypothesis aligned to the corresponding reference segment:

mweralign -r ref.txt -t hyp.txt -o aligned.txt

Tokenization

For good alignment you should use a tokenizer, selected with -m. The default is spm32k (see the recommendation below). Supported values are:

  • none (or whitespace) — no tokenizer; split on whitespace only;
  • cj — segments Han characters with whitespace (dependency-free, no model needed);
  • a named, on-demand SentencePiece model: spm32k, spm64k, spm128k, or spm256k (spm is an alias for spm256k);
  • a filesystem path to any SentencePiece .model file.

The named models are character-preserving (identity-normalization) models that ship with the project. They are downloaded on demand the first time you request them, fetched from the project's GitHub Release, verified against a checksum, and cached under ~/.cache/mweralign/models (override with MWERALIGN_SPM_DIR):

mweralign -r ref.txt -t hyp.txt -o aligned.txt -m spm32k   # the default

mweralign -r ref.txt -t hyp.txt -o aligned.txt -m none     # plain whitespace

To pre-fetch the models (e.g. for offline use):

python -m mweralign.models --all          # all sizes
python -m mweralign.models spm32k spm256k # specific ones

Recommendation: for the best segmentation quality, use a character-preserving (identity-normalization) SentencePiece model for all languages, including CJK. In our WMT24 experiments an identity SPM model restored the original segmentation far more accurately than whitespace tokenization on every language pair, and on the CJK pairs (en-ja, en-zh, ja-zh) it clearly outperformed the cj character segmenter (~94% vs. ~69% boundary accuracy): per-character tokenization gives the aligner too much freedom, whereas subword pieces constrain boundaries to sensible word edges. Vocabulary size has little effect (32k is sufficient; 128k is marginally best), so a small model is a fine default. The cj segmenter remains available as a dependency-free fallback.

Note that the flores200 SPM model (e.g. from sacrebleu) applies NMT-style normalization that rewrites characters, so it is unsuitable when you need the original text restored verbatim; use an identity-normalization model such as spm32k for that.

Language code

You may also supply the ISO 639-1 language code with -l. For zh and ja, this tells the underlying AS-WER algorithm not to prevent sentences from starting with the SentencePiece space character. For other languages it has no effect.

mweralign -r ref.zh.txt -t hyp.txt -o aligned.txt -m spm256k -l zh

Equivalently, you can pass --no-whitespace (-w) for any language whose script does not delimit words with whitespace (e.g. Chinese, Japanese), without naming a specific language:

mweralign -r ref.zh.txt -t hyp.txt -o aligned.txt -m spm256k -w

If the reference looks like a CJK script but neither -l zh/-l ja nor -w was given, mweralign prints a one-line suggestion to add the flag.

When a SentencePiece model is used to tokenize a non-CJK language, the aligner also forbids mid-word segment boundaries: no output segment may begin on a word-internal sub-word piece (one lacking the leading marker), so re-segmenting never splits a word across two segments. This is automatic and requires no flag. Pure-punctuation pieces are exempt, since they legitimately attach to the preceding token.

By default the re-segmented output is detokenized back to plain text. Pass --no-detok to emit the tokenized pieces instead.

Document-aware alignment

If your hypothesis is split per document (rather than one big stream), pass a docid file with -d. It lists one document id per reference line; reference lines sharing a docid form a document, and the hypothesis file must contain one line per distinct document (in order). Each document's hypothesis is then aligned independently to its own reference segments:

mweralign -r ref.txt -t hyp.txt -d docids.txt -o aligned.txt

Scoring mode

With --score, mweralign skips alignment and instead computes word error rate on already parallel input: ref.txt and hyp.txt must have the same number of lines, compared line-by-line. It prints a per-segment breakdown and a corpus total:

mweralign --score -r ref.txt -t hyp.txt

segment 1: WER=150.00 (S=12 I=6 D=0 N=12)
segment 2: WER=100.00 (S=11 I=0 D=7 N=18)
...
TOTAL: WER=42.50 (errors=85 ref_words=200)

A tokenizer (-m) may be combined with --score to score on tokenized text.

Inspecting the segmentation scores

The aligner chooses where to split the hypothesis stream with a dynamic program. You can dump the competing segment-boundary costs it considered with --trace-file. Pass - to write the trace to stdout, or a path to write it to a file. It is off by default and adds no cost when unused.

printf 'the cat sat\non the mat\n' > /tmp/ref.txt
printf 'the cat\nsat on the mat\n' > /tmp/hyp.txt
mweralign -r /tmp/ref.txt -t /tmp/hyp.txt -o /dev/null --trace-file - 2>/dev/null

Or for a longer example:

mweralign \
  -r test/data/wmt22.en-de.en \
  -t test/data/wmt22.en-de.sys \
  -m spm256k \
  -l de \
  -o /dev/null \
  --trace-file -

For each segment, the trace lists the chosen end position and every candidate end position with its cost and the previous segment's end (prev_end):

# docid range 0 (segments 0-2)
segment 1: chosen end j=3 (cost=0)
    end j=   0  cost=     0  prev_end=0
    end j=   3  cost=     0  prev_end=0  <- chosen
    end j=   2  cost=     1  prev_end=0
    ...
segment 2: chosen end j=6 (cost=0)
    end j=   6  cost=     0  prev_end=3  <- chosen
    ...

Here j is a position in the (tokenized) hypothesis stream, so segment 1 covers hyp tokens 1..3 and segment 2 covers 4..6. The alignment output itself still goes to -o (sent to /dev/null above so only the trace is shown); 2>/dev/null suppresses the AS-WER line.

The trace above is the boundary-cost table (cheap to record). Finer-grained per-cell edit costs are available only through the Python API, align_texts_traced(..., cells=True), since they grow with the full alignment grid and are impractical to print for long inputs.

Project layout

src/                 # C++ core library and standalone CLI
python/
  mweralign/         # Python package (CLI + wrappers)
  bindings/          # pybind11 bindings (mweralign._mweralign)
  tests/             # pytest unit + regression suite
    regression/      # golden-file CLI regression cases
CMakeLists.txt       # builds the standalone C++ `mweralign` binary
setup.py / pyproject.toml  # builds the Python package/extension

Development

Install in editable mode with the development dependencies and run the tests:

pip install -e ".[dev]"
pytest python/tests

Regression suite

The regression suite under python/tests/regression/ runs the mweralign CLI on fixed inputs and compares the output to committed golden files. Each case is a directory containing a cmd file (the CLI arguments), the input files it references, and an expected.txt golden output.

After an intentional change in behavior, regenerate the golden files with:

MWERALIGN_REGEN=1 pytest python/tests/test_regression.py

To add a new case, create a directory under python/tests/regression/, add a cmd file plus its input files, and run the regen command above to produce expected.txt.

Building the standalone C++ CLI

The Python package builds its own extension, so this is only needed if you want the standalone mweralign binary:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# binary at build/mweralign

Citation

If you use this package, please cite the following two papers. We suggest a sentence similar to the following: "To align the text, we used the mweralign package \citep{post-huang-2025-effects}, which implements a variant of the AS-WER algorithm \citep{matusov-etal-2005-evaluating}.

License

This project contains code under multiple licenses:

  • Original C++ alignment code: GNU General Public License v3 (GPL-3.0)
  • Python bindings and wrapper code: Apache License 2.0
  • Build scripts and documentation: Apache License 2.0

The project as a whole is distributed under GPL-3.0 due to the inclusion of GPL-licensed components.

What this means for users:

  • You can use this library in GPL-compatible projects
  • If you distribute software that includes this library, your software must be GPL-compatible
  • The Python wrapper code (separate from the C++ core) is available under Apache License 2.0

Attribution

This software includes original GPL-licensed C++ code for alignment algorithms. Python bindings and packaging by Matt Post (Apache License 2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mweralign-1.4.1.tar.gz (54.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mweralign-1.4.1-cp314-cp314-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.14musllinux: musl 1.2+ x86-64

mweralign-1.4.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (206.8 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

mweralign-1.4.1-cp313-cp313-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

mweralign-1.4.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (206.8 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

mweralign-1.4.1-cp312-cp312-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

mweralign-1.4.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (206.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

mweralign-1.4.1-cp311-cp311-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

mweralign-1.4.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (206.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

mweralign-1.4.1-cp310-cp310-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

mweralign-1.4.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (205.5 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file mweralign-1.4.1.tar.gz.

File metadata

  • Download URL: mweralign-1.4.1.tar.gz
  • Upload date:
  • Size: 54.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.1.tar.gz
Algorithm Hash digest
SHA256 72daa7ae6a98307c1b46032cc9428c9864937e2700cff22d2672a6ed91d71097
MD5 e7a9e794fb3a899efb4b20e10e9788a4
BLAKE2b-256 ad9e74d9927a0bed6e0b17c08cbee584cc547378210c548f72f9c7ec35644996

See more details on using hashes here.

File details

Details for the file mweralign-1.4.1-cp314-cp314-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.1-cp314-cp314-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 1df21b397c97212999a0796dbbcb28fdc0a2d78933282f04f124fd8c71756765
MD5 b43ba1e528ebd24d169053618614639c
BLAKE2b-256 3d41bb03492b2fdc22bfffe99f98cf88966a0bd9961e66bbf0e14a760b03fd92

See more details on using hashes here.

File details

Details for the file mweralign-1.4.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 1c9d2443069779647f050402b7a22b550c2d1535b6753e18f700a78b40b19b16
MD5 5a70ea454ad1653dafc1096bc98b1cbe
BLAKE2b-256 af647ab35d3e38895838292230471b7b2c8997081f0f2e8f00537a8fc156377e

See more details on using hashes here.

File details

Details for the file mweralign-1.4.1-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.1-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d3ba438188563291534c050e3fc96dd54cb55ac58cb73db5789da893f9c059fc
MD5 0bb0f4bb4d6f1d75164eadbab11a33cf
BLAKE2b-256 d1614bf31a3fcf9330ee67e9e49344ac632d9ac8063661e9acd1c23bce5c422f

See more details on using hashes here.

File details

Details for the file mweralign-1.4.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 6878391ef55315fce708c05d903d84977014f0c43615c6485ee013b4c7bcd72e
MD5 1760b4e850e918c0dfd1f530b491556c
BLAKE2b-256 f42dc49e2f81e755d9d0bb65b50890e8ad2e5231ad4a5ab8f368c537cb79cfdc

See more details on using hashes here.

File details

Details for the file mweralign-1.4.1-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.1-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 989813976a824ea2499ccb1b53314b77a0ae05ffb8bdea5c5297860100a16ef4
MD5 4a5c77287a3eb6ed913e7b2b037717ca
BLAKE2b-256 61870fa500a543ee94aa4aa742c3aa28a52840fe6f7ae2684ccfe8fe01e4a81e

See more details on using hashes here.

File details

Details for the file mweralign-1.4.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 44be0cd7b43875c47ae8de1aba00783f5d2b93d9f968bdf86601c7cfbfc5f7f8
MD5 750b7824d62d57d5b46af30f23defe0a
BLAKE2b-256 01de945bcba81c8139430504cc9b22abdb4dfbdb1ab0148f10a47991c89dba01

See more details on using hashes here.

File details

Details for the file mweralign-1.4.1-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.1-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 4779df7dcd4c50b8c6ce74205831442f27d4c464124614784e215882bfb4fda3
MD5 74256566acbff42fa6da1a7233ddc97d
BLAKE2b-256 f64ff716704209ddfada0df75030ce44e6fc6a003ad5e5c4da9475bd946a1aed

See more details on using hashes here.

File details

Details for the file mweralign-1.4.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 2c0b0255d90755f1fcee2e61a5c6bd2b2b8a69135d541738b5056b6ff1febf08
MD5 9a9583f65709ae2e109e6cd93682e0d3
BLAKE2b-256 d477545a990145bda36b2cca24e7e8fe59adeca5e7f2ee4ad9cb00cb0a39bac1

See more details on using hashes here.

File details

Details for the file mweralign-1.4.1-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.1-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 a631307e9b96eda75749773e9869b7c1ae46e5cae29d5e6dabaa966901a43de3
MD5 be496b27bb2a6f4cab5ffe7188824b8d
BLAKE2b-256 4a95946149a591665fa6ff68b2c91c8dc418f86d1ce416dc988f3f911e29009f

See more details on using hashes here.

File details

Details for the file mweralign-1.4.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.4.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 a7786feffe82efdbf56bffccd3d6ae59bab376e28e93ac7341988d651d97b424
MD5 3fcc45c7577b17d210ee985e534f843e
BLAKE2b-256 37271f9887fb31adb4af1572609862d5410b5bab082aedd0a7d7961345fb0525

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page