Skip to main content

Minimum Word Error Rate Alignment for speech recognition evaluation

Project description

mweralign

mweralign is a Python package for aligning a stream of words to a reference segmentation. It is designed for use in speech translation tasks, where system outputs must be aligned to a reference translation in order for standard MT metrics to work. This package is a Python wrapper around the original MWERAlign C++ library, which implements the AS-WER algorithm for automatic sentence segmentation and alignment. The wrapper also includes a modernization of that code and support for modern subword tokenization, which helps with alignment.

Installation

To install the package, you can use pip:

pip install mweralign

Or install from source:

git clone https://github.com/mjpost/mweralign
cd mweralign
pip install .

Usage

You can see usage information by running mweralign with the --help flag:

mweralign --help

The standard use case is to provide a reference file, in which segments (sentences) are listed one per line, and a hypothesis file, which contains the output of a speech translation system, and has no line requirements. The output will be a file with the same number of lines as the hypothesis, where each line contains the index of the segment in the reference that corresponds to that hypothesis line.

mweralign -r ref.txt -h hyp.txt -o aligned.txt

You will want to use a tokenizer. Currently supported is "cj", which segments Han characters with whitespace, or any SentencePiece model, which are provided in the form of a filesystem path:

mweralign -r ref.zh.txt -h hyp.txt -o aligned.txt -t cj

The package ships with pre-trained, character-preserving (identity-normalization) SentencePiece models that are downloaded on demand the first time you request them by name. Pass spm32k, spm64k, spm128k, or spm256k (spm is an alias for 256k):

mweralign -r ref.txt -h hyp.txt -o aligned.txt -t spm32k

The model is fetched from the project's GitHub Release, verified against a checksum, and cached under ~/.cache/mweralign/models (override with MWERALIGN_SPM_DIR). To pre-fetch all sizes (e.g. for offline use):

python -m mweralign.models --all

Recommendation: for the best segmentation quality, use a character-preserving (identity-normalization) SentencePiece model for all languages, including CJK. In our WMT24 experiments an identity SPM model restored the original segmentation far more accurately than whitespace tokenization on every language pair, and on the CJK pairs (en-ja, en-zh, ja-zh) it clearly outperformed the cj character segmenter (~94% vs. ~69% boundary accuracy): per-character tokenization gives the aligner too much freedom, whereas subword pieces constrain boundaries to sensible word edges. Vocabulary size has little effect (32k is sufficient; 128k is marginally best), so a small model is a fine default. Note that the flores200 SPM model applies NMT-style normalization that rewrites characters, so it is unsuitable when you need the original text restored verbatim; use an identity-normalization model for that. The cj segmenter remains available as a dependency-free fallback.

# download the flores200 SPM model (one time)
sacrebleu -t wmt24 -l en-zh --echo src | sacrebleu -t wmt24 -l en-zh --tok flores200 > /dev/null
# align
mweralign -r ref.txt -h hyp.txt -o aligned.txt -t ~/.sacrebleu/models/flores200sacrebleuspm

You may also wish to supply the ISO 639-1 language code (-l zh). For zh and ja, this tells the underlying AS-WER algorithm not to prevent sentences from starting with the SentencePiece space character. For other languages, it has no effect.

mweralign -r ref.txt -h hyp.txt -o aligned.txt -t cj -l zh

When a SentencePiece model is used to tokenize a non-CJK language, the aligner also forbids mid-word segment boundaries: no output segment may begin on a word-internal sub-word piece (one lacking the leading marker), so re-segmenting never splits a word across two segments. This is automatic and requires no flag. Pure-punctuation pieces are exempt, since they legitimately attach to the preceding token.

Inspecting the segmentation scores

The aligner chooses where to split the hypothesis stream with a dynamic program. You can dump the competing segment-boundary costs it considered with --trace-file. Pass - to write the trace to stdout, or a path to write it to a file. It is off by default and adds no cost when unused.

printf 'the cat sat\non the mat\n' > /tmp/ref.txt
printf 'the cat\nsat on the mat\n' > /tmp/hyp.txt
mweralign -r /tmp/ref.txt -t /tmp/hyp.txt -o /dev/null --trace-file - 2>/dev/null

Or for a longer example:

mweralign % mweralign \
  -r test/data/wmt22.en-de.en \
  -t test/data/wmt22.en-de.sys \
  -m spm256k \
  -l de \
  -o /dev/null \
  --trace-file -

For each segment, the trace lists the chosen end position and every candidate end position with its cost and the previous segment's end (prev_end):

# docid range 0 (segments 0-2)
segment 1: chosen end j=3 (cost=0)
    end j=   0  cost=     0  prev_end=0
    end j=   3  cost=     0  prev_end=0  <- chosen
    end j=   2  cost=     1  prev_end=0
    ...
segment 2: chosen end j=6 (cost=0)
    end j=   6  cost=     0  prev_end=3  <- chosen
    ...

Here j is a position in the (tokenized) hypothesis stream, so segment 1 covers hyp tokens 1..3 and segment 2 covers 4..6. The alignment output itself still goes to -o (sent to /dev/null above so only the trace is shown); 2>/dev/null suppresses the AS-WER line.

The trace above is the boundary-cost table (cheap to record). Finer-grained per-cell edit costs are available only through the Python API, align_texts_traced(..., cells=True), since they grow with the full alignment grid and are impractical to print for long inputs.

Project layout

src/                 # C++ core library and standalone CLI
python/
  mweralign/         # Python package (CLI + wrappers)
  bindings/          # pybind11 bindings (mweralign._mweralign)
  tests/             # pytest unit + regression suite
    regression/      # golden-file CLI regression cases
CMakeLists.txt       # builds the standalone C++ `mweralign` binary
setup.py / pyproject.toml  # builds the Python package/extension

Development

Install in editable mode with the development dependencies and run the tests:

pip install -e ".[dev]"
pytest python/tests

Regression suite

The regression suite under python/tests/regression/ runs the mweralign CLI on fixed inputs and compares the output to committed golden files. Each case is a directory containing a cmd file (the CLI arguments), the input files it references, and an expected.txt golden output.

After an intentional change in behavior, regenerate the golden files with:

MWERALIGN_REGEN=1 pytest python/tests/test_regression.py

To add a new case, create a directory under python/tests/regression/, add a cmd file plus its input files, and run the regen command above to produce expected.txt.

Building the standalone C++ CLI

The Python package builds its own extension, so this is only needed if you want the standalone mweralign binary:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# binary at build/mweralign

Citation

If you use this package, please cite the following two papers. We suggest a sentence similar to the following: "To align the text, we used the mweralign package \citep{post-huang-2025-effects}, which implements a variant of the AS-WER algorithm \citep{matusov-etal-2005-evaluating}.

License

This project contains code under multiple licenses:

  • Original C++ alignment code: GNU General Public License v3 (GPL-3.0)
  • Python bindings and wrapper code: Apache License 2.0
  • Build scripts and documentation: Apache License 2.0

The project as a whole is distributed under GPL-3.0 due to the inclusion of GPL-licensed components.

What this means for users:

  • You can use this library in GPL-compatible projects
  • If you distribute software that includes this library, your software must be GPL-compatible
  • The Python wrapper code (separate from the C++ core) is available under Apache License 2.0

Attribution

This software includes original GPL-licensed C++ code for alignment algorithms. Python bindings and packaging by Matt Post (Apache License 2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mweralign-1.3.0.tar.gz (51.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mweralign-1.3.0-cp314-cp314-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.14musllinux: musl 1.2+ x86-64

mweralign-1.3.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (205.0 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

mweralign-1.3.0-cp313-cp313-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

mweralign-1.3.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (205.0 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

mweralign-1.3.0-cp312-cp312-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

mweralign-1.3.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (204.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

mweralign-1.3.0-cp311-cp311-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

mweralign-1.3.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (204.8 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

mweralign-1.3.0-cp310-cp310-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

mweralign-1.3.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (203.7 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file mweralign-1.3.0.tar.gz.

File metadata

  • Download URL: mweralign-1.3.0.tar.gz
  • Upload date:
  • Size: 51.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.3.0.tar.gz
Algorithm Hash digest
SHA256 57ed03265241174fc52f5587bb97d737613ce68f9fabe624ca4542ab37b5faaf
MD5 be8ab33993f967170a092c42eac5977e
BLAKE2b-256 03cd70ea6758caf8bb42c624efca3c5b37606a4889d8cc037bf971e95f952f3c

See more details on using hashes here.

File details

Details for the file mweralign-1.3.0-cp314-cp314-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.3.0-cp314-cp314-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 f11c5c2ffd098b8f3caeab0ff6228d36dee9c2baa90715b06ed926d2ed0758ef
MD5 d2aacd4080ee0f9441982fb59079b45d
BLAKE2b-256 3ab2ed87ad1812d91a45d5d7663804a1aae23933a8aed17d75838cc4093a9b15

See more details on using hashes here.

File details

Details for the file mweralign-1.3.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.3.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 5d17d98faa22e9b7912fbb1b24320951cbad2826f0d83d286fe74ef189ee18f9
MD5 046fee84ff0a78f2875fde00a74f3442
BLAKE2b-256 a456277dbcc04819f9a05b44d2f9cb857492a1e381b33824e20959002ea42cc5

See more details on using hashes here.

File details

Details for the file mweralign-1.3.0-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.3.0-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 e219cf370cea0061cb179a61e531fb3b1b3f001e9374984edf58da7059dbb53f
MD5 290a1bcb18982f0f7503fc2744fc4e3a
BLAKE2b-256 e97b300db9871efa3b5f270a103c13fb5dc9ca34c58437b68c332c1262e40829

See more details on using hashes here.

File details

Details for the file mweralign-1.3.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.3.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 740588ee4d456186610711909f6abc44f590aaddfda55180a255eb55f0764c49
MD5 3921855e4487657988dc28b47cf57cea
BLAKE2b-256 6a8394a4b08f29b212bd4f59ec069e346eae66872e2372b8e9ad19a69c327d41

See more details on using hashes here.

File details

Details for the file mweralign-1.3.0-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.3.0-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 abb63ff85392843506788d33d8bb745cf8931caeeee70199b5401ce9b2d1e473
MD5 50c33c03b4be035bef2a418f74bd0f30
BLAKE2b-256 a16f27d64866b65a3937e1ca84e6f595231e3cc44d56b5bf82d6f974bd850a75

See more details on using hashes here.

File details

Details for the file mweralign-1.3.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.3.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 a017cb01ebb38aee124ff6b133049ac294b96e059957b684831497477b4603d8
MD5 40e90df4032afb4075045dec02f13236
BLAKE2b-256 c549209ffb1ecbd733b612e1bde67558c9cf264d401283bae8f68c07413b48d3

See more details on using hashes here.

File details

Details for the file mweralign-1.3.0-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.3.0-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 c802b9d432f6e83386bd5615ddf4e3df2a38047ba72e4297901a8b1963b987ba
MD5 651b370f0ca6568dd82dd9cf2bdbed2b
BLAKE2b-256 44dced53d227f29b39422203415adc2d337da0fc4ac406bc7d435899ed5e0282

See more details on using hashes here.

File details

Details for the file mweralign-1.3.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.3.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 0a0d06207497fb622f05e20b9451b4d6c74c5da6588447b7c6a51ae963ce4442
MD5 2fc4b82274725d27b278b37dfd495399
BLAKE2b-256 d623989adfe156ef31c9cadb3df8a1e10fe1602f4847955baa7bf442e72c4ec2

See more details on using hashes here.

File details

Details for the file mweralign-1.3.0-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.3.0-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 45a8a465ae0b47a825fcd7c1eb7f45ce441ab15de9a178dd301195bcb93638ce
MD5 403b5c33a74784249f2eb0cd20aeb9df
BLAKE2b-256 a0e7a05230feb9569f521a16ca966265028848e2c72f99fc21bd1bcb1c0d8fcf

See more details on using hashes here.

File details

Details for the file mweralign-1.3.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.3.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 9375e776e611b577fcd876f9281720dcf47e7a2eeb73767f9352ad16b7a6e25f
MD5 fa51af4f611633481f4f6f87bea18773
BLAKE2b-256 e2e14fb726f1099257688bccb65d72acbe8576fc95460a413ef3c6be6687e932

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page