REMERGE is a Multi-Word Expression (MWE) discovery algorithm derived from the MERGE algorithm.
Project description
REMERGE - Multi-Word Expression discovery algorithm
REMERGE is a greedy, bottom-up Multi-Word Expression (MWE) discovery algorithm derived from MERGE[^1][^2][^3].
Current implementation:
- Rust core engine (PyO3 extension) for fast iteration on large corpora
- Python API (
remerge.run,remerge.annotate) with typed ergonomic results - Deterministic tie-breaking and configurable scoring methods
The algorithm is non-parametric with respect to final MWE length: you set iterations, and merged expressions can grow as long as the corpus supports.
Install
Latest release:
pip install -U remerge-mwe
Latest from GitHub:
pip install git+https://github.com/pmbaumgartner/remerge-mwe.git
Quickstart
import remerge
corpus = [
"a list of already tokenized texts",
"where each item is a document string",
"isn't this API nice",
]
winners = remerge.run(corpus, iterations=1, method="frequency")
winner = winners[0]
print(winner) # merged phrase text (via WinnerInfo.__str__)
print(winner.merged_lexeme.word) # ('a', 'list')
print(winner.score) # score used for winner selection
Selection methods
Available winner-selection methods:
"log_likelihood"(default; [Log-Likelihood / G²][^5])"npmi"([Normalized PMI][^4])"frequency"
You can pass either enum values (remerge.SelectionMethod.frequency) or strings ("frequency").
Tie-breaking is deterministic: score, then frequency, then lexicographic merged-token order.
NPMI guidance
NPMI can saturate near 1.0 for rare but exclusive pairs, especially on large corpora. That can surface low-frequency artifacts unless min_count is set high enough.
Practical guidance:
- For discovery and ranking stability, prefer
log_likelihood(the default). - For NPMI, tune
min_countaggressively and sweep upward until top results stabilize. - On real corpora, values significantly above toy defaults are often needed.
# Example starting point for larger corpora (tune further as needed)
winners = remerge.run(corpus, 100, method="npmi", min_count=200)
This aligns with the PMI caveat that infrequent pairs can dominate rankings[^4].
Live progress output
For long runs, set progress=True to print live merge progress to stderr:
winners = remerge.run(corpus, 500, progress=True)
API - remerge.run
| Argument | Type | Description |
|---|---|---|
corpus |
list[str] |
Corpus of document strings. Documents are split into segments by splitter, then tokenized with Rust whitespace splitting. |
iterations |
int |
Maximum number of iterations. |
method |
`SelectionMethod | str`, optional |
min_count |
int, optional |
Minimum bigram frequency required to be considered for winner selection. Default: 0. |
splitter |
`Splitter | str`, optional |
line_delimiter |
`str | None`, optional |
sentencex_language |
str, optional |
Language code for splitter="sentencex". Default: "en". |
rescore_interval |
int, optional |
Full-rescore interval for LL/NPMI. 1 means full rescore every iteration; larger values trade exactness for speed. Default: 25. |
on_exhausted |
`ExhaustionPolicy | str`, optional |
min_score |
`float | None`, optional |
progress |
bool, optional |
If True, prints live merge progress to stderr. Default: False. |
run() returns list[WinnerInfo].
Each WinnerInfo contains:
bigrammerged_lexemescoremerge_token_count
Convenience helpers:
str(winner)andwinner.textfor merged phrase textwinner.token_count(alias ofwinner.n_lexemes) for merged token countstr(winner.merged_lexeme)/winner.merged_lexeme.textfor lexeme textwinner.merged_lexeme.token_countfor lexeme token count
API - remerge.annotate
annotate() runs the same merge process as run(), then returns:
(winners, annotated_docs, labels)
Where:
winners:list[WinnerInfo]annotated_docs:list[str]of annotated output documentslabels: sorted unique list of annotation labels generated
Arguments shared with run():
corpus,iterations,method,min_count,splitter,line_delimitersentencex_language,rescore_interval,on_exhausted,min_score,progress
annotate()-specific arguments:
mwe_prefix: str = "<mwe:"mwe_suffix: str = ">"token_separator: str = "_"
Tokenization and output normalization
Tokenization uses Rust split_whitespace().
Implications:
- Original whitespace formatting is not preserved.
- Annotated output reconstructs segments using normalized single-space joins.
Performance and scaling notes
- Internal location tracking is intentionally memory-intensive.
- For large corpora, tune
min_countand keepiterationspractical. rescore_interval=1gives exact LL/NPMI rescoring each iteration; larger values trade exactness for speed.
Development
This project uses uv, ruff, and ty.
# Sync environment
uv sync --all-groups
# Build/install Rust extension into the active env
uv run --no-sync maturin develop
# Python checks
uv run ruff format src tests
uv run ruff check src tests
uv run ty check src tests
uv run --no-sync pytest -v -m "not corpus and not parity"
# Slower corpus/parity suite
uv run --no-sync pytest -v -m "corpus or parity"
If you change files under rust/, rebuild the extension before running Python tests:
uv run --no-sync maturin develop
Releasing (maintainers)
Releases are automated by .github/workflows/release.yml.
At a minimum:
- Keep
pyproject.tomlandCargo.tomlversions aligned. - Push to
main. - Tag with
vX.Y.Zto trigger release publication.
Use bin/pypi-smoke.py to validate the newest published package from PyPI.
How it works
Each iteration:
- Score candidate bigrams.
- Select the winner.
- Merge winner occurrences into a new lexeme.
- Update internal bigram/lexeme state.
Lexemes use (word, ix) semantics, where ix=0 is the root position and only root lexemes participate in bigram formation.
Limitations
- REMERGE is greedy/agglomerative: early winner choices can influence later merges.
- Different methods (
frequency,log_likelihood,npmi) can produce materially different inventories depending on corpus/domain.
Notes on the original MERGE gapsize behavior
This implementation intentionally excludes discontinuous/gapped bigram merging. The old gapsize path could conflate distinct positional configurations in edge cases, which made behavior harder to reason about and validate.
References
[^1]: awahl1, MERGE. 2017. Accessed: Jul. 11, 2022. [Online]. Available: https://github.com/awahl1/MERGE
[^2]: A. Wahl and S. Th. Gries, “Multi-word Expressions: A Novel Computational Approach to Their Bottom-Up Statistical Extraction,” in Lexical Collocation Analysis, P. Cantos-Gómez and M. Almela-Sánchez, Eds. Cham: Springer International Publishing, 2018, pp. 85–109. doi: 10.1007/978-3-319-92582-0_5.
[^3]: A. Wahl, “The Distributional Learning of Multi-Word Expressions: A Computational Approach,” p. 190.
[^4]: G. Bouma, “Normalized (Pointwise) Mutual Information in Collocation Extraction,” p. 11.
[^5]: T. Dunning, “Accurate Methods for the Statistics of Surprise and Coincidence,” Computational Linguistics, vol. 19, no. 1, p. 14.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file remerge_mwe-0.4.1.tar.gz.
File metadata
- Download URL: remerge_mwe-0.4.1.tar.gz
- Upload date:
- Size: 214.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7376a440b83fce0e48e15e942f3ac1078fcbee9b7485d78d04cedb643de9e263
|
|
| MD5 |
9db9bb541948155a63b963b4d035774a
|
|
| BLAKE2b-256 |
da425b730c53d8d9ee328c402ec23eb3dc2002831dfeca01676559c2f981c456
|
File details
Details for the file remerge_mwe-0.4.1-cp312-abi3-win_amd64.whl.
File metadata
- Download URL: remerge_mwe-0.4.1-cp312-abi3-win_amd64.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.12+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84b3e2da57cdc7e0a1862b961f51b86dd02bd0a87e11ba51b74c5925cbaca8a8
|
|
| MD5 |
fe1ca341c4e97f0a1d8ea1eec4e2d676
|
|
| BLAKE2b-256 |
92cef6a9c97d81c120d46b5af3788d87714f8944ffb6333d48626824f1af1bf1
|
File details
Details for the file remerge_mwe-0.4.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: remerge_mwe-0.4.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9760decbcff82042a9d3f39ac01ea6f054fae4ba6d2c66e5532cd296453f1886
|
|
| MD5 |
9ce4e6312f301baad2743e5252caac56
|
|
| BLAKE2b-256 |
1f730a9b0a59868c3f97bc56dad013d6dd59759f075d6c14a7b88972780783f5
|
File details
Details for the file remerge_mwe-0.4.1-cp312-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: remerge_mwe-0.4.1-cp312-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.12+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e559e420b2541bb17f79af497f4e5ceed1746635aca9043c2fa8db248c0e0ef2
|
|
| MD5 |
bb2e0ee816c43dea1259701650e5e171
|
|
| BLAKE2b-256 |
bd4898c02b6a258360eface18e45b2614ab4cf076d30c7163d600adfa19ab704
|