Skip to main content

Minimum Word Error Rate Alignment for speech recognition evaluation

Project description

mweralign

mweralign is a Python package for aligning a stream of words to a reference segmentation. It is designed for use in speech translation tasks, where system outputs must be aligned to a reference translation in order for standard MT metrics to work. This package is a Python wrapper around the original MWERAlign C++ library, which implements the AS-WER algorithm for automatic sentence segmentation and alignment. The wrapper also includes a modernization of that code and support for modern subword tokenization, which helps with alignment.

Installation

To install the package, you can use pip:

pip install mweralign

Or install from source:

git clone https://github.com/mjpost/mweralign
cd mweralign
pip install .

Usage

You can see usage information by running mweralign with the --help flag:

mweralign --help

The standard use case is to provide a reference file, in which segments (sentences) are listed one per line, and a hypothesis file, which contains the output of a speech translation system, and has no line requirements. The output will be a file with the same number of lines as the hypothesis, where each line contains the index of the segment in the reference that corresponds to that hypothesis line.

mweralign -r ref.txt -h hyp.txt -o aligned.txt

You will want to use a tokenizer. Currently supported is "cj", which segments Han characters with whitespace, or any SentencePiece model, which are provided in the form of a filesystem path:

mweralign -r ref.zh.txt -h hyp.txt -o aligned.txt -t cj

# download the flores200 SPM model (one time)
sacrebleu -t wmt24 -l en-zh --echo src | sacrebleu -t wmt24 -l en-zh --tok flores200 > /dev/null
# align
mweralign -r ref.txt -h hyp.txt -o aligned.txt -t ~/.sacrebleu/models/flores200sacrebleuspm

You may also wish to supply the ISO 639-1 language code (-l zh). For zh and ja, this tells the underlying AS-WER algorithm not to prevent sentences from starting with the SentencePiece space character. For other languages, it has no effect.

mweralign -r ref.txt -h hyp.txt -o aligned.txt -t cj -l zh

Project layout

src/                 # C++ core library and standalone CLI
python/
  mweralign/         # Python package (CLI + wrappers)
  bindings/          # pybind11 bindings (mweralign._mweralign)
  tests/             # pytest unit + regression suite
    regression/      # golden-file CLI regression cases
CMakeLists.txt       # builds the standalone C++ `mweralign` binary
setup.py / pyproject.toml  # builds the Python package/extension

Development

Install in editable mode with the development dependencies and run the tests:

pip install -e ".[dev]"
pytest python/tests

Regression suite

The regression suite under python/tests/regression/ runs the mweralign CLI on fixed inputs and compares the output to committed golden files. Each case is a directory containing a cmd file (the CLI arguments), the input files it references, and an expected.txt golden output.

After an intentional change in behavior, regenerate the golden files with:

MWERALIGN_REGEN=1 pytest python/tests/test_regression.py

To add a new case, create a directory under python/tests/regression/, add a cmd file plus its input files, and run the regen command above to produce expected.txt.

Building the standalone C++ CLI

The Python package builds its own extension, so this is only needed if you want the standalone mweralign binary:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# binary at build/mweralign

Citation

If you use this package, please cite the following two papers. We suggest a sentence similar to the following: "To align the text, we used the mweralign package \citep{post-huang-2025-effects}, which implements a variant of the AS-WER algorithm \citep{matusov-etal-2005-evaluating}.

License

This project contains code under multiple licenses:

  • Original C++ alignment code: GNU General Public License v3 (GPL-3.0)
  • Python bindings and wrapper code: Apache License 2.0
  • Build scripts and documentation: Apache License 2.0

The project as a whole is distributed under GPL-3.0 due to the inclusion of GPL-licensed components.

What this means for users:

  • You can use this library in GPL-compatible projects
  • If you distribute software that includes this library, your software must be GPL-compatible
  • The Python wrapper code (separate from the C++ core) is available under Apache License 2.0

Attribution

This software includes original GPL-licensed C++ code for alignment algorithms. Python bindings and packaging by Matt Post (Apache License 2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mweralign-1.2.0.tar.gz (37.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mweralign-1.2.0-cp314-cp314-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.14musllinux: musl 1.2+ x86-64

mweralign-1.2.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (185.2 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

mweralign-1.2.0-cp313-cp313-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

mweralign-1.2.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (185.1 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

mweralign-1.2.0-cp312-cp312-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

mweralign-1.2.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (185.1 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

mweralign-1.2.0-cp311-cp311-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

mweralign-1.2.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (185.5 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

mweralign-1.2.0-cp310-cp310-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

mweralign-1.2.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (184.3 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file mweralign-1.2.0.tar.gz.

File metadata

  • Download URL: mweralign-1.2.0.tar.gz
  • Upload date:
  • Size: 37.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mweralign-1.2.0.tar.gz
Algorithm Hash digest
SHA256 b4a7bf119b1c57a1efd5d7c64bfffc31073a6255c29c4516a05d5e7c2d90f3ab
MD5 aa1e0ccb0765297b3e5ede3d140d41fe
BLAKE2b-256 6566c0b05ec4acde6355404e1e9c7d4b73033183176e4c5dcd71168640434d58

See more details on using hashes here.

File details

Details for the file mweralign-1.2.0-cp314-cp314-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.2.0-cp314-cp314-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 619a848f42e770b89f918d253704d51ddb69105faf132dc4e201ab0affbdd828
MD5 9f23e8355158b002e934f17182b96175
BLAKE2b-256 909cd8981cc038744a7edcf02183fe9dd4f4433891c0f4b103d673b3c3412b3c

See more details on using hashes here.

File details

Details for the file mweralign-1.2.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.2.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 3e6a854d2ba4245339eb06cf496f9591bf29dd61696b008f95bdccebc820e5ac
MD5 024a5582d36a81394b48901e24dee722
BLAKE2b-256 79750a8a0fc67728d08060505707ba7dfa5d4ba70a96c70f12eea8d3da61bbf7

See more details on using hashes here.

File details

Details for the file mweralign-1.2.0-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.2.0-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 992da5fc2310938de02ef37994e213dde87d4753535192a8870bf56adf9cb3f5
MD5 8ffbc9a7a889e264bb4c97cf6ab0f96a
BLAKE2b-256 79d61bed2390f5dd2e9e4cb062929caa94982b05d510f1ddd19ec2c7c3baaa95

See more details on using hashes here.

File details

Details for the file mweralign-1.2.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.2.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 9b2e183339d3cf2371af5a55a783e72c60b57155eb85f931efaa4e0c2beba09b
MD5 856ed36c9cc732bfd3070a78415b0e5a
BLAKE2b-256 98f4b0805f11849d7a30ba4364489aeb4fbcc4ef5f786820e45c67a5cb5e8036

See more details on using hashes here.

File details

Details for the file mweralign-1.2.0-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.2.0-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 fc7e09dcbd9fee85b6cce1314b0fb1363556552029e52bebcdb03a44101a6c69
MD5 e3394d9372c307b76e463a3a2dbe7810
BLAKE2b-256 2fad8551027d6fef89de7902990119a61f3a979e868eb348bb5861bf6ac487a8

See more details on using hashes here.

File details

Details for the file mweralign-1.2.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.2.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 56da9bfdd6d3d11696a8e691c4aca50fde45831a35ea45fca79f77895d4d9163
MD5 e22102620522a93ee6d89478046e7853
BLAKE2b-256 9cf38efeb61b014d94bb72f6d5a2be0b65f188f2e9cd56f32a27797fc5a0dbc0

See more details on using hashes here.

File details

Details for the file mweralign-1.2.0-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.2.0-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 cdfee0355116bfe63634244a2dce32804ec69a52155970a594fd2215d8b6630a
MD5 ea1b0e65504c608629a702898d856a29
BLAKE2b-256 3ab3cfc89474e7f264259c9919faa2a8b01e93a56cb2f0d1e7dbef69baf77f28

See more details on using hashes here.

File details

Details for the file mweralign-1.2.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.2.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 385829082b970fd55186f0550f66f7b45624657d9867557074f3c2d2ea6d9754
MD5 bcb8ca3725881078cd2f80e15a03ee7c
BLAKE2b-256 75afee8cc755cfbd96b96caf9907bc0970682ba9151c6d65ad3e478c76a5622e

See more details on using hashes here.

File details

Details for the file mweralign-1.2.0-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.2.0-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 637625efaeca001ec7e71e3b888573764aa72cd2a0a04da5abdfad666aab0a7d
MD5 908ec4425009fdd0c2fad5de3822ca05
BLAKE2b-256 f01ede5f9a7fcb5ba358d7d8bd7b76035591bae3d82665e0bf71c4a19fe22bcd

See more details on using hashes here.

File details

Details for the file mweralign-1.2.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for mweralign-1.2.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 6b6484fac1bc4f6ceb55d3e7d7bf03cc22f5abb8daf97cce9075737ad6f57864
MD5 47a8425a392896c8b5e5637ee0d8354f
BLAKE2b-256 72d05d74fa51af347181d0f3a8e50b5847c27ceedea5f61dcc10094b8b96fec0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page