REMERGE is a Multi-Word Expression (MWE) discovery algorithm derived from the MERGE algorithm.

These details have not been verified by PyPI

Project description

REMERGE - Multi-Word Expression discovery algorithm

REMERGE is a greedy, bottom-up Multi-Word Expression (MWE) discovery algorithm derived from MERGE[^1][^2][^3].

Current implementation:

Rust core engine (PyO3 extension) for fast iteration on large corpora
Python API (remerge.run, remerge.annotate) with typed ergonomic results
Deterministic tie-breaking and configurable scoring methods

The algorithm is non-parametric with respect to final MWE length: you set iterations, and merged expressions can grow as long as the corpus supports.

Install

Latest release:

pip install -U remerge-mwe

Latest from GitHub:

pip install git+https://github.com/pmbaumgartner/remerge-mwe.git

Quickstart

import remerge

corpus = [
    "a list of already tokenized texts",
    "where each item is a document string",
    "isn't this API nice",
]

winners = remerge.run(corpus, iterations=1, method="frequency")

winner = winners[0]
print(winner)                    # merged phrase text (via WinnerInfo.__str__)
print(winner.merged_lexeme.word) # ('a', 'list')
print(winner.score)              # score used for winner selection

Selection methods

Available winner-selection methods:

"log_likelihood" (default; [Log-Likelihood / G²][^5])
"npmi" ([Normalized PMI][^4])
"frequency"

You can pass either enum values (remerge.SelectionMethod.frequency) or strings ("frequency").

Tie-breaking is deterministic: score, then frequency, then lexicographic merged-token order.

NPMI guidance

NPMI can saturate near 1.0 for rare but exclusive pairs, especially on large corpora. That can surface low-frequency artifacts unless min_count is set high enough.

Practical guidance:

For discovery and ranking stability, prefer log_likelihood (the default).
For NPMI, tune min_count aggressively and sweep upward until top results stabilize.
On real corpora, values significantly above toy defaults are often needed.

# Example starting point for larger corpora (tune further as needed)
winners = remerge.run(corpus, 100, method="npmi", min_count=200)

This aligns with the PMI caveat that infrequent pairs can dominate rankings[^4].

Live progress output

For long runs, set progress=True to print live merge progress to stderr:

winners = remerge.run(corpus, 500, progress=True)

API - `remerge.run`

Argument	Type	Description
`corpus`	`list[str]`	Corpus of document strings. Documents are split into segments by `splitter`, then tokenized with Rust whitespace splitting.
`iterations`	`int`	Maximum number of iterations.
`method`	`SelectionMethod	str`, optional
`min_count`	`int`, optional	Minimum bigram frequency required to be considered for winner selection. Default: `0`.
`splitter`	`Splitter	str`, optional
`line_delimiter`	`str	None`, optional
`sentencex_language`	`str`, optional	Language code for `splitter="sentencex"`. Default: `"en"`.
`rescore_interval`	`int`, optional	Full-rescore interval for LL/NPMI. `1` means full rescore every iteration; larger values trade exactness for speed. Default: `25`.
`on_exhausted`	`ExhaustionPolicy	str`, optional
`min_score`	`float	None`, optional
`progress`	`bool`, optional	If `True`, prints live merge progress to `stderr`. Default: `False`.

run() returns list[WinnerInfo].

Each WinnerInfo contains:

bigram
merged_lexeme
score
merge_token_count

Convenience helpers:

str(winner) and winner.text for merged phrase text
winner.token_count (alias of winner.n_lexemes) for merged token count
str(winner.merged_lexeme) / winner.merged_lexeme.text for lexeme text
winner.merged_lexeme.token_count for lexeme token count

API - `remerge.annotate`

annotate() runs the same merge process as run(), then returns:

(winners, annotated_docs, labels)

Where:

winners: list[WinnerInfo]
annotated_docs: list[str] of annotated output documents
labels: sorted unique list of annotation labels generated

Arguments shared with run():

corpus, iterations, method, min_count, splitter, line_delimiter
sentencex_language, rescore_interval, on_exhausted, min_score, progress

annotate()-specific arguments:

mwe_prefix: str = "<mwe:"
mwe_suffix: str = ">"
token_separator: str = "_"

Tokenization and output normalization

Tokenization uses Rust split_whitespace().

Implications:

Original whitespace formatting is not preserved.
Annotated output reconstructs segments using normalized single-space joins.

Performance and scaling notes

Internal location tracking is intentionally memory-intensive.
For large corpora, tune min_count and keep iterations practical.
rescore_interval=1 gives exact LL/NPMI rescoring each iteration; larger values trade exactness for speed.

Development

This project uses uv, ruff, and ty.

# Sync environment
uv sync --all-groups

# Build/install Rust extension into the active env
uv run --no-sync maturin develop

# Python checks
uv run ruff format src tests
uv run ruff check src tests
uv run ty check src tests
uv run --no-sync pytest -v -m "not corpus and not parity"

# Slower corpus/parity suite
uv run --no-sync pytest -v -m "corpus or parity"

If you change files under rust/, rebuild the extension before running Python tests:

uv run --no-sync maturin develop

Releasing (maintainers)

Releases are automated by .github/workflows/release.yml.

At a minimum:

Keep pyproject.toml and Cargo.toml versions aligned.
Push to main.
Tag with vX.Y.Z to trigger release publication.

Use bin/pypi-smoke.py to validate the newest published package from PyPI.

How it works

Each iteration:

Score candidate bigrams.
Select the winner.
Merge winner occurrences into a new lexeme.
Update internal bigram/lexeme state.

Lexemes use (word, ix) semantics, where ix=0 is the root position and only root lexemes participate in bigram formation.

Limitations

REMERGE is greedy/agglomerative: early winner choices can influence later merges.
Different methods (frequency, log_likelihood, npmi) can produce materially different inventories depending on corpus/domain.

Notes on the original MERGE gapsize behavior

This implementation intentionally excludes discontinuous/gapped bigram merging. The old gapsize path could conflate distinct positional configurations in edge cases, which made behavior harder to reason about and validate.

References

[^1]: awahl1, MERGE. 2017. Accessed: Jul. 11, 2022. [Online]. Available: https://github.com/awahl1/MERGE

[^2]: A. Wahl and S. Th. Gries, “Multi-word Expressions: A Novel Computational Approach to Their Bottom-Up Statistical Extraction,” in Lexical Collocation Analysis, P. Cantos-Gómez and M. Almela-Sánchez, Eds. Cham: Springer International Publishing, 2018, pp. 85–109. doi: 10.1007/978-3-319-92582-0_5.

[^3]: A. Wahl, “The Distributional Learning of Multi-Word Expressions: A Computational Approach,” p. 190.

[^4]: G. Bouma, “Normalized (Pointwise) Mutual Information in Collocation Extraction,” p. 11.

[^5]: T. Dunning, “Accurate Methods for the Statistics of Surprise and Coincidence,” Computational Linguistics, vol. 19, no. 1, p. 14.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.1

Feb 18, 2026

0.4.0

Feb 16, 2026

0.3.1

Feb 16, 2026

0.3.0

Feb 16, 2026

0.2.1

Oct 12, 2022

0.2.0

Oct 4, 2022

0.1.1

Sep 30, 2022

0.1.0

Sep 30, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

remerge_mwe-0.4.1.tar.gz (214.6 kB view details)

Uploaded Feb 18, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

remerge_mwe-0.4.1-cp312-abi3-win_amd64.whl (1.0 MB view details)

Uploaded Feb 18, 2026 CPython 3.12+Windows x86-64

remerge_mwe-0.4.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded Feb 18, 2026 CPython 3.12+manylinux: glibc 2.17+ x86-64

remerge_mwe-0.4.1-cp312-abi3-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded Feb 18, 2026 CPython 3.12+macOS 11.0+ ARM64

File details

Details for the file remerge_mwe-0.4.1.tar.gz.

File metadata

Download URL: remerge_mwe-0.4.1.tar.gz
Upload date: Feb 18, 2026
Size: 214.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for remerge_mwe-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`7376a440b83fce0e48e15e942f3ac1078fcbee9b7485d78d04cedb643de9e263`
MD5	`9db9bb541948155a63b963b4d035774a`
BLAKE2b-256	`da425b730c53d8d9ee328c402ec23eb3dc2002831dfeca01676559c2f981c456`

See more details on using hashes here.

File details

Details for the file remerge_mwe-0.4.1-cp312-abi3-win_amd64.whl.

File metadata

Download URL: remerge_mwe-0.4.1-cp312-abi3-win_amd64.whl
Upload date: Feb 18, 2026
Size: 1.0 MB
Tags: CPython 3.12+, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for remerge_mwe-0.4.1-cp312-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`84b3e2da57cdc7e0a1862b961f51b86dd02bd0a87e11ba51b74c5925cbaca8a8`
MD5	`fe1ca341c4e97f0a1d8ea1eec4e2d676`
BLAKE2b-256	`92cef6a9c97d81c120d46b5af3788d87714f8944ffb6333d48626824f1af1bf1`

See more details on using hashes here.

File details

Details for the file remerge_mwe-0.4.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: remerge_mwe-0.4.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Feb 18, 2026
Size: 1.4 MB
Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for remerge_mwe-0.4.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`9760decbcff82042a9d3f39ac01ea6f054fae4ba6d2c66e5532cd296453f1886`
MD5	`9ce4e6312f301baad2743e5252caac56`
BLAKE2b-256	`1f730a9b0a59868c3f97bc56dad013d6dd59759f075d6c14a7b88972780783f5`

See more details on using hashes here.

File details

Details for the file remerge_mwe-0.4.1-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: remerge_mwe-0.4.1-cp312-abi3-macosx_11_0_arm64.whl
Upload date: Feb 18, 2026
Size: 1.1 MB
Tags: CPython 3.12+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for remerge_mwe-0.4.1-cp312-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`e559e420b2541bb17f79af497f4e5ceed1746635aca9043c2fa8db248c0e0ef2`
MD5	`bb2e0ee816c43dea1259701650e5e171`
BLAKE2b-256	`bd4898c02b6a258360eface18e45b2614ab4cf076d30c7163d600adfa19ab704`

See more details on using hashes here.

remerge-mwe 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

REMERGE - Multi-Word Expression discovery algorithm

Install

Quickstart

Selection methods

NPMI guidance

Live progress output

API - remerge.run

API - remerge.annotate

Tokenization and output normalization

Performance and scaling notes

Development

Releasing (maintainers)

How it works

Limitations

Notes on the original MERGE gapsize behavior

References

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

API - `remerge.run`

API - `remerge.annotate`