Minimum Word Error Rate Alignment for speech recognition evaluation

These details have not been verified by PyPI

Project description

mweralign

mweralign is a Python package for aligning a stream of words to a reference segmentation. It is designed for use in speech translation tasks, where system outputs must be aligned to a reference translation in order for standard MT metrics to work. This package is a Python wrapper around the original MWERAlign C++ library, which implements the AS-WER algorithm for automatic sentence segmentation and alignment. The wrapper also includes a modernization of that code and support for modern subword tokenization, which helps with alignment.

Installation

To install the package, you can use pip:

pip install mweralign

Or install from source:

git clone https://github.com/mjpost/mweralign
cd mweralign
pip install .

Usage

You can see usage information by running mweralign with the --help flag:

mweralign --help

The core flags are:

Flag	Long form	Meaning
`-r`	`--ref-file`	Reference file: the target segmentation, one segment per line. Required.
`-t`	`--hyp-file`	Hypothesis file: the system output to re-segment. Required.
`-o`	`--output`	Output file (default: stdout).
`-m`	`--tokenizer`	Tokenizer/segmenter to use (default: `spm32k`; see below).
`-l`	`--language`	ISO 639-1 language code (e.g. `de`, `zh`).
`-w`	`--no-whitespace`	The language does not delimit words with whitespace (CJK); see Language code.
`-d`	`--docid-file`	Document ids, one per reference line (see Document-aware alignment).
	`--score`	Scoring mode: report WER instead of re-segmenting (see Scoring mode).

Aligning a hypothesis to a reference segmentation

The standard use case provides a reference file (segments listed one per line) and a hypothesis file (the output of a speech-translation system, with no line requirements). mweralign concatenates the hypothesis into a single word stream and re-splits it to match the reference segmentation. The output has the same number of lines as the reference, where each line is the slice of the hypothesis aligned to the corresponding reference segment:

mweralign -r ref.txt -t hyp.txt -o aligned.txt

Tokenization

For good alignment you should use a tokenizer, selected with -m. The default is spm32k (see the recommendation below). Supported values are:

none (or whitespace) — no tokenizer; split on whitespace only;
cj — segments Han characters with whitespace (dependency-free, no model needed);
a named, on-demand SentencePiece model: spm32k, spm64k, spm128k, or spm256k (spm is an alias for spm256k);
a filesystem path to any SentencePiece .model file.

The named models are character-preserving (identity-normalization) models that ship with the project. They are downloaded on demand the first time you request them, fetched from the project's GitHub Release, verified against a checksum, and cached under ~/.cache/mweralign/models (override with MWERALIGN_SPM_DIR):

mweralign -r ref.txt -t hyp.txt -o aligned.txt -m spm32k   # the default

mweralign -r ref.txt -t hyp.txt -o aligned.txt -m none     # plain whitespace

To pre-fetch the models (e.g. for offline use):

python -m mweralign.models --all          # all sizes
python -m mweralign.models spm32k spm256k # specific ones

Recommendation: for the best segmentation quality, use a character-preserving (identity-normalization) SentencePiece model for all languages, including CJK. In our WMT24 experiments an identity SPM model restored the original segmentation far more accurately than whitespace tokenization on every language pair, and on the CJK pairs (en-ja, en-zh, ja-zh) it clearly outperformed the cj character segmenter (~94% vs. ~69% boundary accuracy): per-character tokenization gives the aligner too much freedom, whereas subword pieces constrain boundaries to sensible word edges. Vocabulary size has little effect (32k is sufficient; 128k is marginally best), so a small model is a fine default. The cj segmenter remains available as a dependency-free fallback.

Note that the flores200 SPM model (e.g. from sacrebleu) applies NMT-style normalization that rewrites characters, so it is unsuitable when you need the original text restored verbatim; use an identity-normalization model such as spm32k for that.

Language code

You may also supply the ISO 639-1 language code with -l. For zh and ja, this tells the underlying AS-WER algorithm not to prevent sentences from starting with the SentencePiece space character. For other languages it has no effect.

mweralign -r ref.zh.txt -t hyp.txt -o aligned.txt -m spm256k -l zh

Equivalently, you can pass --no-whitespace (-w) for any language whose script does not delimit words with whitespace (e.g. Chinese, Japanese), without naming a specific language:

mweralign -r ref.zh.txt -t hyp.txt -o aligned.txt -m spm256k -w

If the reference looks like a CJK script but neither -l zh/-l ja nor -w was given, mweralign prints a one-line suggestion to add the flag.

When a SentencePiece model is used to tokenize a non-CJK language, the aligner also forbids mid-word segment boundaries: no output segment may begin on a word-internal sub-word piece (one lacking the leading ▁ marker), so re-segmenting never splits a word across two segments. This is automatic and requires no flag. Pure-punctuation pieces are exempt, since they legitimately attach to the preceding token.

By default the re-segmented output is detokenized back to plain text. Pass --no-detok to emit the tokenized pieces instead.

Document-aware alignment

If your hypothesis is split per document (rather than one big stream), pass a docid file with -d. It lists one document id per reference line; reference lines sharing a docid form a document, and the hypothesis file must contain one line per distinct document (in order). Each document's hypothesis is then aligned independently to its own reference segments:

mweralign -r ref.txt -t hyp.txt -d docids.txt -o aligned.txt

Scoring mode

With --score, mweralign skips alignment and instead computes word error rate on already parallel input: ref.txt and hyp.txt must have the same number of lines, compared line-by-line. It prints a per-segment breakdown and a corpus total:

mweralign --score -r ref.txt -t hyp.txt

segment 1: WER=150.00 (S=12 I=6 D=0 N=12)
segment 2: WER=100.00 (S=11 I=0 D=7 N=18)
...
TOTAL: WER=42.50 (errors=85 ref_words=200)

A tokenizer (-m) may be combined with --score to score on tokenized text.

Inspecting the segmentation scores

The aligner chooses where to split the hypothesis stream with a dynamic program. You can dump the competing segment-boundary costs it considered with --trace-file. Pass - to write the trace to stdout, or a path to write it to a file. It is off by default and adds no cost when unused.

printf 'the cat sat\non the mat\n' > /tmp/ref.txt
printf 'the cat\nsat on the mat\n' > /tmp/hyp.txt
mweralign -r /tmp/ref.txt -t /tmp/hyp.txt -o /dev/null --trace-file - 2>/dev/null

Or for a longer example:

mweralign \
  -r test/data/wmt22.en-de.en \
  -t test/data/wmt22.en-de.sys \
  -m spm256k \
  -l de \
  -o /dev/null \
  --trace-file -

For each segment, the trace lists the chosen end position and every candidate end position with its cost and the previous segment's end (prev_end):

# docid range 0 (segments 0-2)
segment 1: chosen end j=3 (cost=0)
    end j=   0  cost=     0  prev_end=0
    end j=   3  cost=     0  prev_end=0  <- chosen
    end j=   2  cost=     1  prev_end=0
    ...
segment 2: chosen end j=6 (cost=0)
    end j=   6  cost=     0  prev_end=3  <- chosen
    ...

Here j is a position in the (tokenized) hypothesis stream, so segment 1 covers hyp tokens 1..3 and segment 2 covers 4..6. The alignment output itself still goes to -o (sent to /dev/null above so only the trace is shown); 2>/dev/null suppresses the AS-WER line.

The trace above is the boundary-cost table (cheap to record). Finer-grained per-cell edit costs are available only through the Python API, align_texts_traced(..., cells=True), since they grow with the full alignment grid and are impractical to print for long inputs.

Project layout

src/                 # C++ core library and standalone CLI
python/
  mweralign/         # Python package (CLI + wrappers)
  bindings/          # pybind11 bindings (mweralign._mweralign)
  tests/             # pytest unit + regression suite
    regression/      # golden-file CLI regression cases
CMakeLists.txt       # builds the standalone C++ `mweralign` binary
setup.py / pyproject.toml  # builds the Python package/extension

Development

Install in editable mode with the development dependencies and run the tests:

pip install -e ".[dev]"
pytest python/tests

Regression suite

The regression suite under python/tests/regression/ runs the mweralign CLI on fixed inputs and compares the output to committed golden files. Each case is a directory containing a cmd file (the CLI arguments), the input files it references, and an expected.txt golden output.

After an intentional change in behavior, regenerate the golden files with:

MWERALIGN_REGEN=1 pytest python/tests/test_regression.py

To add a new case, create a directory under python/tests/regression/, add a cmd file plus its input files, and run the regen command above to produce expected.txt.

Building the standalone C++ CLI

The Python package builds its own extension, so this is only needed if you want the standalone mweralign binary:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# binary at build/mweralign

Citation

If you use this package, please cite the following two papers. We suggest a sentence similar to the following: "To align the text, we used the mweralign package \citep{post-huang-2025-effects}, which implements a variant of the AS-WER algorithm \citep{matusov-etal-2005-evaluating}.

License

This project contains code under multiple licenses:

Original C++ alignment code: GNU General Public License v3 (GPL-3.0)
Python bindings and wrapper code: Apache License 2.0
Build scripts and documentation: Apache License 2.0

The project as a whole is distributed under GPL-3.0 due to the inclusion of GPL-licensed components.

What this means for users:

You can use this library in GPL-compatible projects
If you distribute software that includes this library, your software must be GPL-compatible
The Python wrapper code (separate from the C++ core) is available under Apache License 2.0

Attribution

This software includes original GPL-licensed C++ code for alignment algorithms. Python bindings and packaging by Matt Post (Apache License 2.0).

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.4.1

Jun 12, 2026

This version

1.4.0

Jun 12, 2026

1.3.0

Jun 12, 2026

1.2.0

Jun 5, 2026

1.1.1

Jun 5, 2026

1.1.0

Jun 5, 2026

1.0.1

Jul 30, 2025

1.0.0

Jul 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mweralign-1.4.0.tar.gz (54.4 kB view details)

Uploaded Jun 12, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mweralign-1.4.0-cp314-cp314-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded Jun 12, 2026 CPython 3.14musllinux: musl 1.2+ x86-64

mweralign-1.4.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (206.7 kB view details)

Uploaded Jun 12, 2026 CPython 3.14manylinux: glibc 2.17+ x86-64

mweralign-1.4.0-cp313-cp313-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded Jun 12, 2026 CPython 3.13musllinux: musl 1.2+ x86-64

mweralign-1.4.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (206.7 kB view details)

Uploaded Jun 12, 2026 CPython 3.13manylinux: glibc 2.17+ x86-64

mweralign-1.4.0-cp312-cp312-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded Jun 12, 2026 CPython 3.12musllinux: musl 1.2+ x86-64

mweralign-1.4.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (206.6 kB view details)

Uploaded Jun 12, 2026 CPython 3.12manylinux: glibc 2.17+ x86-64

mweralign-1.4.0-cp311-cp311-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded Jun 12, 2026 CPython 3.11musllinux: musl 1.2+ x86-64

mweralign-1.4.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (206.5 kB view details)

Uploaded Jun 12, 2026 CPython 3.11manylinux: glibc 2.17+ x86-64

mweralign-1.4.0-cp310-cp310-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded Jun 12, 2026 CPython 3.10musllinux: musl 1.2+ x86-64

mweralign-1.4.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (205.4 kB view details)

Uploaded Jun 12, 2026 CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file mweralign-1.4.0.tar.gz.

File metadata

Download URL: mweralign-1.4.0.tar.gz
Upload date: Jun 12, 2026
Size: 54.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.0.tar.gz
Algorithm	Hash digest
SHA256	`3e6b4d4d9245f34175236ede3cbd9518192b84bd2c74ff28f763a23b6e799635`
MD5	`a01dbc397cb0e0e14ea457b81df9a213`
BLAKE2b-256	`76e268517fd00438849ecb16594b1e9ccee0c6c2354fb6ca18079951402cba62`

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp314-cp314-musllinux_1_2_x86_64.whl.

File metadata

Download URL: mweralign-1.4.0-cp314-cp314-musllinux_1_2_x86_64.whl
Upload date: Jun 12, 2026
Size: 1.2 MB
Tags: CPython 3.14, musllinux: musl 1.2+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.0-cp314-cp314-musllinux_1_2_x86_64.whl
Algorithm	Hash digest
SHA256	`a3eec7ea9e547df0c21cf16628b3eb7923018ba58a86ad57b0600b0f2833bf90`
MD5	`8670ce364d7dc4842d5192c4a3ded1f6`
BLAKE2b-256	`bd461c6e2ba18edac65a523adfb89ca70b58fb8b34f0e79aabb874d154926edb`

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

Download URL: mweralign-1.4.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Upload date: Jun 12, 2026
Size: 206.7 kB
Tags: CPython 3.14, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm	Hash digest
SHA256	`53c8a54fd12c4b0b20052840632d046e55d49107b18bfae4d6bb8db40f0bc5dd`
MD5	`daa646483d047d56aa3257eab23f5f61`
BLAKE2b-256	`6deb83c6a166d0bd000fc31a229b2e1c4bdc09d48c6c7a23ff19ed8345af90dc`

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

Download URL: mweralign-1.4.0-cp313-cp313-musllinux_1_2_x86_64.whl
Upload date: Jun 12, 2026
Size: 1.2 MB
Tags: CPython 3.13, musllinux: musl 1.2+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.0-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm	Hash digest
SHA256	`e634bd27fda1f6d3928af6cefa855dc14523cd6dc1d53c952d811b46fbd36bbe`
MD5	`bd571033b426f1330bd51870fbc66574`
BLAKE2b-256	`8e8a758ebb95f075ff21d844751839f46ffaea918880ab5f45552fb5d836f6fd`

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

Download URL: mweralign-1.4.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Upload date: Jun 12, 2026
Size: 206.7 kB
Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm	Hash digest
SHA256	`c496331ea9e022cb3192584da736eac6bdd4aafc83ff29fedcdbd8737a5ae22e`
MD5	`c1460036f16dda8a0c2fd5d63d581da7`
BLAKE2b-256	`b87c218bf9a517ce964b65cfef6cded811271175f3db5ffa9361e631f5c92d2d`

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

Download URL: mweralign-1.4.0-cp312-cp312-musllinux_1_2_x86_64.whl
Upload date: Jun 12, 2026
Size: 1.2 MB
Tags: CPython 3.12, musllinux: musl 1.2+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.0-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm	Hash digest
SHA256	`12592fd346c19ce3da00eac3c4204fb997e479bb8c9f7e55c797a32cf709fcc7`
MD5	`1cf67b1c55bebac5301c8cd654edd65c`
BLAKE2b-256	`70c0cce0253c982cc5412b05476095445232fc394b26d0923e33b4dc819474f6`

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

Download URL: mweralign-1.4.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Upload date: Jun 12, 2026
Size: 206.6 kB
Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm	Hash digest
SHA256	`5c4d968ddb4c15050b5d6f7a826eaa8af7f9de1604b85a26b658c3768ce7a7eb`
MD5	`eb0b0be538e20cf7007511e2802af35e`
BLAKE2b-256	`474be2b82c0fc28ccbaf111744521d3c53f589078ee46d904b44b60f780758d9`

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

Download URL: mweralign-1.4.0-cp311-cp311-musllinux_1_2_x86_64.whl
Upload date: Jun 12, 2026
Size: 1.2 MB
Tags: CPython 3.11, musllinux: musl 1.2+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.0-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm	Hash digest
SHA256	`a8d7390f7f7c3b7a4f1df255531851d23b17c7ae9e41a63223086ad7e0f84a80`
MD5	`759fbd4d68ba57499ef3955f6ec9ddaf`
BLAKE2b-256	`2d9930687e1b7901d42768f75deb5992ec43528db185abdc6cc87dc284cd86c3`

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

Download URL: mweralign-1.4.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Upload date: Jun 12, 2026
Size: 206.5 kB
Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm	Hash digest
SHA256	`74406721df28998bae695f7434cfdde667252868f2e3d6983cfcad0575356a72`
MD5	`d4b05f9db28910b813467b651982665d`
BLAKE2b-256	`64135ffccedab4114fb3ed3e75fdcdecc25c5636dd6e5276e85f77813dbd2c29`

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

Download URL: mweralign-1.4.0-cp310-cp310-musllinux_1_2_x86_64.whl
Upload date: Jun 12, 2026
Size: 1.2 MB
Tags: CPython 3.10, musllinux: musl 1.2+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.0-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm	Hash digest
SHA256	`0fa414b95977e71cb7a072c2807e8a7d682fd82b9ec2a1c358bca8cddeeab7da`
MD5	`c75ed01d768b65b2578cbda105f4385b`
BLAKE2b-256	`5bdfdb8fa50c6d8b5490fc92d4d9b1a35d8af49436161bb49dcc1f276419e076`

See more details on using hashes here.

File details

Details for the file mweralign-1.4.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

Download URL: mweralign-1.4.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Upload date: Jun 12, 2026
Size: 205.4 kB
Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.4.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm	Hash digest
SHA256	`ca9717202441acee0f03a28e9c752a562b946ef040e741a687976b73e911ec9d`
MD5	`1c50a617abc9370c19b4165b4a70bf23`
BLAKE2b-256	`99240a464a436dc0b03ed7c74eba1e8fc641bfa84736c56d5e6c7ec8d14a4055`

See more details on using hashes here.

mweralign 1.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

mweralign

Installation

Usage

Aligning a hypothesis to a reference segmentation

Tokenization

Language code

Document-aware alignment

Scoring mode

Inspecting the segmentation scores

Project layout

Development

Regression suite

Building the standalone C++ CLI

Citation

License

What this means for users:

Attribution

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes