REMERGE is a Multi-Word Expression (MWE) discovery algorithm derived from the MERGE algorithm.

These details have not been verified by PyPI

Project description

REMERGE - Multi-Word Expression discovery algorithm

REMERGE is a Multi-Word Expression (MWE) discovery algorithm, which started as a re-implementation and simplification of a similar algorithm called MERGE, detailed in a publication and PhD thesis[^2][^3]. The primary benefit of this algorithm is that it's non-parametric in regards to the size of the n-grams that constitute a MWE—you do not need to specify a priori how many n-grams comprise a MWE—you only need to specify the number of iterations you want the algorithm to run.

The code was originally derived from an existing implementation from the original author[^1] that I reviewed, converted from python 2 to 3, then modified and updated with the following:

a correction of the log-likelihood calculation; previously it was not using the correct values for the contingency table
the removal of gapsize / discontinuous bigrams (see below for issues with the prior implementation)
an overall reduction in codebase size and complexity
- ~60% reduction in loc
- removed pandas and nltk dependencies
type annotations
the inclusion of additional metrics (Frequency, NPMI[^4]) for selecting the winning bigram.
corrections for merging sequential bigrams greedily and completely.
- e.g. 'ya ya ya ya' -> '(ya ya) (ya ya)' -> '(ya ya ya ya)'. Previously the merge order was non-deterministic, and you could end up with 'ya (ya ya) ya'
An overall simplification of the algorithm.
- As a tradeoff, this version may be less efficient. After a bigram is merged into a single lexeme in the original implementation, new bigrams and conflicting (old) bigrams were respectively added and subtracted from a mutable counter of bigrams. The counts of this object were difficult to track and validate, and resulted in errors in certain cases, so I removed this step. Instead, only the lexeme data is updated with the new merged lexemes. Then, we track which lines contain the merged lexeme and create an update counter that subtracts the old bigrams from the new bigrams and updates the bigram data using that counter.
Clarified license with the original author and licensed as MIT.

Usage

import remerge

corpus = [
    "a list of already tokenized texts",
    "where each item is a document string",
    "isn't this API nice"
]

winners = remerge.run(
    corpus, iterations=1, method=remerge.SelectionMethod.frequency
)
# winners[0].merged_lexeme.word == ('a', 'list')

There are 3 bigram winner selection methods: Log-Likelihood (G²)[^5], NPMI[^4], and raw frequency. They are available under the SelectionMethod enum. The default is log-likelihood, which was used in the original implementation.

Tie-breaking is deterministic only: candidates are ranked by score, then frequency, then lexicographic merged-token order.

If using NPMI (SelectionMethod.npmi), you likely want to provide a min_count parameter, "as infrequent word pairs tend to dominate the top of bigramme lists that are ranked after PMI". (p. 4[^4])

winners = remerge.run(corpus, 100, method=remerge.SelectionMethod.npmi, min_count=25)

Migration (`0.4.0`)

WinnerInfo.bigram_locations has been removed. Use:

winner.merge_token_count for the number of non-overlapping merges applied in that iteration.
winner.score for the score that selected the winning bigram (frequency, log-likelihood, or NPMI based on method).

API - `remerge.run`

Argument	Type	Description
corpus	`List[str]`	A corpus of document strings. Documents are split into segments according to `splitter`, then each segment is tokenized with whitespace splitting in Rust.
iterations	`int`	The maximum number of iterations to run the algorithm. Papers typically use >500.
method	`SelectionMethod`, optional	One of "frequency", "log_likelihood", or "npmi". Defaults to "log_likelihood".
min_count	`int`, optional	The minimum count required for a bigram to be included in winner calculations for all methods. If choosing NPMI ("npmi"), prefer using min_count because this measure is biased towards infrequent word pairs. Defaults to 0.
splitter	`Splitter`, optional	Segmenter used before tokenization: `"delimiter"` (default) or `"sentencex"`.
line_delimiter	`Optional[str]`, optional	Delimiter used when `splitter="delimiter"`. Defaults to `\"\\n\"`. Set to `None` to treat each document as a single segment. Ignored when `splitter="sentencex"`.
sentencex_language	`str`, optional	Language code passed to `sentencex` when `splitter="sentencex"` (for example `"en"`). Defaults to `"en"`.
rescore_interval	`int`, optional	Number of iterations between full candidate rescoring for LL/NPMI methods. `1` forces full rescoring every iteration; higher values trade exactness for speed. Defaults to `25`.
on_exhausted	`ExhaustionPolicy`, optional	Behavior when no candidate passes filters (or threshold): "stop" returns winners collected so far, "raise" raises `NoCandidateBigramError`. Defaults to "stop".
min_score	`Optional[float]`, optional	Optional minimum score threshold for the selected winner. If the best candidate is below this threshold, behavior follows `on_exhausted`. Defaults to None.

run() returns list[WinnerInfo]. Each WinnerInfo contains:

bigram
merged_lexeme
score
merge_token_count

API - `remerge.annotate`

annotate() returns (winners, annotated_docs, labels) and uses the same WinnerInfo shape as run().

Tokenization and annotation output normalization:

Input text is tokenized with Rust split_whitespace().
Original whitespace formatting is not preserved.
annotate() reconstructs output with normalized single-space token joins within each segment.

Performance and scaling notes:

Internal bigram/lexeme location tracking is memory-intensive by design.
Prefer setting min_count above 0 for large corpora and especially for npmi.
Keep iterations practical for corpus size and use case.
rescore_interval=1 gives exact LL/NPMI rescoring each iteration; larger values trade some exactness for speed.

Install

Latest release:

pip install -U remerge-mwe

For latest from github:

pip install git+https://github.com/pmbaumgartner/remerge-mwe.git

Development

Use uv for local project and dependency management. This package now builds a Rust extension via PyO3 + maturin, so a local Rust toolchain is required.

Create/sync the environment with all dependency groups:

uv sync --all-groups

Build/install the extension for the current environment:

uv run --no-sync maturin develop

Run tests:

uv run --no-sync pytest -v -m "not corpus and not parity"

Run tests with Python coverage (wrapper/API layer):

uv run --no-sync pytest -v -m "not corpus and not parity" --cov=src/remerge --cov-report=term-missing --cov-report=xml

Run full corpus/parity checks (slower, intended for CI/mainline validation):

uv run --no-sync pytest -v -m "corpus or parity"

Regenerate the parity fixture after intentional algorithm changes:

bin/regenerate-parity-fixture.sh

Add a runtime dependency:

uv add <pkg>

Add a development dependency:

uv add --dev <pkg>

If you make changes under rust/, run uv run --no-sync maturin develop again before testing.

PyO3 troubleshooting:

# Print PyO3 interpreter/build config and stop.
PYO3_PRINT_CONFIG=1 uv run --no-sync maturin develop

# Force the Python interpreter PyO3 should inspect.
PYO3_PYTHON=.venv/bin/python uv run --no-sync maturin develop

If cargo test fails with unresolved Python symbols, confirm pyo3/extension-module is not forced in Cargo.toml, then run:

cargo clean
cargo test

Releasing

Releases are automated with GitHub Actions in .github/workflows/release.yml.

Update both version fields to the same value:
- pyproject.toml -> [project].version
- Cargo.toml -> [package].version Recommended flow with uv version:

# Check current package version.
uv version
uv version --short

# Preview a minor bump without writing changes.
uv version --bump minor --dry-run

# Apply the bump to pyproject.toml without lock/sync changes.
uv version --bump minor --frozen

Then set Cargo.toml [package].version to match the new pyproject.toml version. Important: pushing a matching git tag (for example v0.4.0) triggers .github/workflows/release.yml; for tag-triggered runs, the workflow will publish to PyPI after build and validation succeed. 2. Commit and push to main. 3. Create and push a release tag matching that version:

git tag vX.Y.Z
git push origin vX.Y.Z

Verify the Release workflow completes successfully:
- wheels built for Linux/macOS/Windows
- sdist built
- wheel and sdist smoke tests pass
- publish job succeeds via uv publish
Confirm artifacts on PyPI:
- https://pypi.org/project/remerge-mwe/

Manual dry-run/backfill:

Trigger Release via workflow_dispatch.
Leave publish=false for build/validation only.
Set publish=true and release_tag=vX.Y.Z only when you intend to upload.

Local package build/publish commands (same toolchain used in CI):

# Build wheel + sdist from a clean source tree snapshot.
uv build --no-sources

# Publish previously built artifacts in dist/ (requires credentials/OIDC).
uv publish dist/*

Trusted Publisher (PyPI OIDC) setup:

In PyPI project settings for remerge-mwe, add a Trusted Publisher.
Set owner/repo to this GitHub repository.
Set workflow file to release.yml.
If using GitHub environments, set environment to pypi.
In GitHub, optionally add protection rules/required reviewers for the pypi environment.

Rust/PyO3 Backlog

Planned follow-up after stabilization: split into a pure Rust core crate plus a thin PyO3 bindings crate.

Trigger for this split:

Parity tests remain stable across 2 consecutive PRs.
CI remains green across 2 consecutive PRs.
Benchmark baseline from bin/benchmark-remerge.sh --build release --runs 1 --iterations 5 is stable across 2 consecutive PRs.

How it works

The algorithm operates iteratively in two stages: first, it collects all bigrams of co-occurring lexemes in the corpus. A measure is calculated on the set of all bigrams to determine a winner. The two lexemes that comprise the winning bigram are merged into a single lexeme. Instances of that bigram (lexeme pair) in the corpus are replaced with the merged lexeme. Outdated bigrams, i.e. those that don't exist anymore because one of their elements is now a merged lexeme, are subtracted from the bigram data. New bigrams, i.e. those where one element is now a merged lexeme, are added to the bigram data. With this new set of bigram data, the process repeats and a new winner is selected.

At initialization, a lexeme consists of only a single token, but as the algorithm iterates lexemes become multi-word expressions formed from the winning bigrams. Lexemes contain two parts: a word which is a tuple of strings, and an index which represents the position of that specific token in a MWE. For example, if the winning bigram is (you, know), occurrences of that sequence of lexemes will be replaced with [(you, know), 0] and [(you, know), 1] in the corpus. When bigrams are counted, only a root lexeme (where the index is 0) can form a bigram, so merged tokens don't get double counted. For a more visual explanation of a few iterations assuming specific winners, see the image below.

Limitations

This implementation is still a greedy agglomerative procedure, so local winner choices can influence later merges. Different selection methods (frequency, log_likelihood, npmi) can lead to materially different MWE inventories depending on corpus size and domain.

Issues with Original Algorithm

Single Bigrams with discontinuities forming from distinct Lexeme positions

One issue with discontinuities or gaps in the original algorithm is that it did not distinguish the position of a satellite lexeme occurring to the left or right of a bigram with a gap.

Take for example these two example sentences, using - to represent an arbitrary token:

a b c -
a - c b

Assume in a prior iteration, a winning bigram was (a, _, c), representing the token a, a gap of 1, and then the token c. with a gapsize of 1. The past algorithm, run on the above corpus, would count the token b twice towards the same n-gram (a, b, c), despite there being two distinct n-grams represented here: (a, b, c) and (a, _, c, b).

I think the algorithm is counting on the fact that it would be very rare to encounter this sequence of lexemes in a realistic corpus, where the same word would appear within the gap and after the gap. I think this is more of an artifact of this specific example with an unrealistically small vocabulary.

References

[^1]: awahl1, MERGE. 2017. Accessed: Jul. 11, 2022. [Online]. Available: https://github.com/awahl1/MERGE

[^2]: A. Wahl and S. Th. Gries, “Multi-word Expressions: A Novel Computational Approach to Their Bottom-Up Statistical Extraction,” in Lexical Collocation Analysis, P. Cantos-Gómez and M. Almela-Sánchez, Eds. Cham: Springer International Publishing, 2018, pp. 85–109. doi: 10.1007/978-3-319-92582-0_5.

[^3]: A. Wahl, “The Distributional Learning of Multi-Word Expressions: A Computational Approach,” p. 190.

[^4]: G. Bouma, “Normalized (Pointwise) Mutual Information in Collocation Extraction,” p. 11.

[^5]: T. Dunning, “Accurate Methods for the Statistics of Surprise and Coincidence,” Computational Linguistics, vol. 19, no. 1, p. 14.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.1

Feb 18, 2026

This version

0.4.0

Feb 16, 2026

0.3.1

Feb 16, 2026

0.3.0

Feb 16, 2026

0.2.1

Oct 12, 2022

0.2.0

Oct 4, 2022

0.1.1

Sep 30, 2022

0.1.0

Sep 30, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

remerge_mwe-0.4.0.tar.gz (218.4 kB view details)

Uploaded Feb 16, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

remerge_mwe-0.4.0-cp312-abi3-win_amd64.whl (1.0 MB view details)

Uploaded Feb 16, 2026 CPython 3.12+Windows x86-64

remerge_mwe-0.4.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded Feb 16, 2026 CPython 3.12+manylinux: glibc 2.17+ x86-64

remerge_mwe-0.4.0-cp312-abi3-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded Feb 16, 2026 CPython 3.12+macOS 11.0+ ARM64

File details

Details for the file remerge_mwe-0.4.0.tar.gz.

File metadata

Download URL: remerge_mwe-0.4.0.tar.gz
Upload date: Feb 16, 2026
Size: 218.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for remerge_mwe-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`e51188dcad36a937d04673c9cc97261879eed2c7bb82e45e7fbea16278871060`
MD5	`34e219a3ed169442498c5e69c8bb3289`
BLAKE2b-256	`131faea01161475e037db3e5621823181f1bd7936abd1140e54a7e8993c8170d`

See more details on using hashes here.

File details

Details for the file remerge_mwe-0.4.0-cp312-abi3-win_amd64.whl.

File metadata

Download URL: remerge_mwe-0.4.0-cp312-abi3-win_amd64.whl
Upload date: Feb 16, 2026
Size: 1.0 MB
Tags: CPython 3.12+, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for remerge_mwe-0.4.0-cp312-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`e340c95553e98da02ff4811dcc9b91f8fcd98286babc6902ba315f51657d9793`
MD5	`b400814879730aa1819a154912068153`
BLAKE2b-256	`2b148e13e9a489ac3e46d381b0e6811495e5d869dd899da5f350e60ec2c535b9`

See more details on using hashes here.

File details

Details for the file remerge_mwe-0.4.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: remerge_mwe-0.4.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Feb 16, 2026
Size: 1.4 MB
Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for remerge_mwe-0.4.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`a64afe01ef94b937c0d716a23b9d9f60128564e4b9ea880ca44aefc95c2a2917`
MD5	`43914bd73278785a986d762d7161d01b`
BLAKE2b-256	`dac6c3206d5255044f56a53323a09b55deb10c6ad46c4adad5d794a5e362aa25`

See more details on using hashes here.

File details

Details for the file remerge_mwe-0.4.0-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: remerge_mwe-0.4.0-cp312-abi3-macosx_11_0_arm64.whl
Upload date: Feb 16, 2026
Size: 1.1 MB
Tags: CPython 3.12+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for remerge_mwe-0.4.0-cp312-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`e1c35c208256546dbad802ec8af4d1337200046d453045922840cca96328a2c9`
MD5	`66061dae520f229a87de2171455d1615`
BLAKE2b-256	`6b9cc5807b1133e4142595f8ccbd43d6ed797ee93aa74d23cbc5ee251d4c1880`

See more details on using hashes here.

remerge-mwe 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

REMERGE - Multi-Word Expression discovery algorithm

Usage

Migration (0.4.0)

API - remerge.run

API - remerge.annotate

Install

Development

Releasing

Rust/PyO3 Backlog

How it works

Limitations

Issues with Original Algorithm

Single Bigrams with discontinuities forming from distinct Lexeme positions

References

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

Migration (`0.4.0`)

API - `remerge.run`

API - `remerge.annotate`