Skip to main content

REMERGE is a Multi-Word Expression (MWE) discovery algorithm derived from the MERGE algorithm.

Project description

REMERGE - Multi-Word Expression discovery algorithm

REMERGE is a greedy, bottom-up Multi-Word Expression (MWE) discovery algorithm derived from MERGE[^1][^2][^3].

Current implementation:

  • Rust core engine (PyO3 extension) for fast iteration on large corpora
  • Python API (remerge.run, remerge.annotate) with typed ergonomic results
  • Deterministic tie-breaking and configurable scoring methods

The algorithm is non-parametric with respect to final MWE length: you set iterations, and merged expressions can grow as long as the corpus supports.

Install

Latest release:

pip install -U remerge-mwe

Latest from GitHub:

pip install git+https://github.com/pmbaumgartner/remerge-mwe.git

Quickstart

import remerge

corpus = [
    "a list of already tokenized texts",
    "where each item is a document string",
    "isn't this API nice",
]

winners = remerge.run(corpus, iterations=1, method="frequency")

winner = winners[0]
print(winner)                    # merged phrase text (via WinnerInfo.__str__)
print(winner.merged_lexeme.word) # ('a', 'list')
print(winner.score)              # score used for winner selection

Selection methods

Available winner-selection methods:

  • "log_likelihood" (default; [Log-Likelihood / G²][^5])
  • "npmi" ([Normalized PMI][^4])
  • "frequency"

You can pass either enum values (remerge.SelectionMethod.frequency) or strings ("frequency").

Tie-breaking is deterministic: score, then frequency, then lexicographic merged-token order.

NPMI guidance

NPMI can saturate near 1.0 for rare but exclusive pairs, especially on large corpora. That can surface low-frequency artifacts unless min_count is set high enough.

Practical guidance:

  • For discovery and ranking stability, prefer log_likelihood (the default).
  • For NPMI, tune min_count aggressively and sweep upward until top results stabilize.
  • On real corpora, values significantly above toy defaults are often needed.
# Example starting point for larger corpora (tune further as needed)
winners = remerge.run(corpus, 100, method="npmi", min_count=200)

This aligns with the PMI caveat that infrequent pairs can dominate rankings[^4].

Live progress output

For long runs, set progress=True to print live merge progress to stderr:

winners = remerge.run(corpus, 500, progress=True)

API - remerge.run

Argument Type Description
corpus list[str] Corpus of document strings. Documents are split into segments by splitter, then tokenized with Rust whitespace splitting.
iterations int Maximum number of iterations.
method `SelectionMethod str`, optional
min_count int, optional Minimum bigram frequency required to be considered for winner selection. Default: 0.
splitter `Splitter str`, optional
line_delimiter `str None`, optional
sentencex_language str, optional Language code for splitter="sentencex". Default: "en".
rescore_interval int, optional Full-rescore interval for LL/NPMI. 1 means full rescore every iteration; larger values trade exactness for speed. Default: 25.
on_exhausted `ExhaustionPolicy str`, optional
min_score `float None`, optional
progress bool, optional If True, prints live merge progress to stderr. Default: False.

run() returns list[WinnerInfo].

Each WinnerInfo contains:

  • bigram
  • merged_lexeme
  • score
  • merge_token_count

Convenience helpers:

  • str(winner) and winner.text for merged phrase text
  • winner.token_count (alias of winner.n_lexemes) for merged token count
  • str(winner.merged_lexeme) / winner.merged_lexeme.text for lexeme text
  • winner.merged_lexeme.token_count for lexeme token count

API - remerge.annotate

annotate() runs the same merge process as run(), then returns:

(winners, annotated_docs, labels)

Where:

  • winners: list[WinnerInfo]
  • annotated_docs: list[str] of annotated output documents
  • labels: sorted unique list of annotation labels generated

Arguments shared with run():

  • corpus, iterations, method, min_count, splitter, line_delimiter
  • sentencex_language, rescore_interval, on_exhausted, min_score, progress

annotate()-specific arguments:

  • mwe_prefix: str = "<mwe:"
  • mwe_suffix: str = ">"
  • token_separator: str = "_"

Tokenization and output normalization

Tokenization uses Rust split_whitespace().

Implications:

  • Original whitespace formatting is not preserved.
  • Annotated output reconstructs segments using normalized single-space joins.

Performance and scaling notes

  • Internal location tracking is intentionally memory-intensive.
  • For large corpora, tune min_count and keep iterations practical.
  • rescore_interval=1 gives exact LL/NPMI rescoring each iteration; larger values trade exactness for speed.

Development

This project uses uv, ruff, and ty.

# Sync environment
uv sync --all-groups

# Build/install Rust extension into the active env
uv run --no-sync maturin develop

# Python checks
uv run ruff format src tests
uv run ruff check src tests
uv run ty check src tests
uv run --no-sync pytest -v -m "not corpus and not parity"

# Slower corpus/parity suite
uv run --no-sync pytest -v -m "corpus or parity"

If you change files under rust/, rebuild the extension before running Python tests:

uv run --no-sync maturin develop

Releasing (maintainers)

Releases are automated by .github/workflows/release.yml.

At a minimum:

  1. Keep pyproject.toml and Cargo.toml versions aligned.
  2. Push to main.
  3. Tag with vX.Y.Z to trigger release publication.

Use bin/pypi-smoke.py to validate the newest published package from PyPI.

How it works

Each iteration:

  1. Score candidate bigrams.
  2. Select the winner.
  3. Merge winner occurrences into a new lexeme.
  4. Update internal bigram/lexeme state.

Lexemes use (word, ix) semantics, where ix=0 is the root position and only root lexemes participate in bigram formation.

An explanation of the remerge algorithm

Limitations

  • REMERGE is greedy/agglomerative: early winner choices can influence later merges.
  • Different methods (frequency, log_likelihood, npmi) can produce materially different inventories depending on corpus/domain.

Notes on the original MERGE gapsize behavior

This implementation intentionally excludes discontinuous/gapped bigram merging. The old gapsize path could conflate distinct positional configurations in edge cases, which made behavior harder to reason about and validate.

References

[^1]: awahl1, MERGE. 2017. Accessed: Jul. 11, 2022. [Online]. Available: https://github.com/awahl1/MERGE

[^2]: A. Wahl and S. Th. Gries, “Multi-word Expressions: A Novel Computational Approach to Their Bottom-Up Statistical Extraction,” in Lexical Collocation Analysis, P. Cantos-Gómez and M. Almela-Sánchez, Eds. Cham: Springer International Publishing, 2018, pp. 85–109. doi: 10.1007/978-3-319-92582-0_5.

[^3]: A. Wahl, “The Distributional Learning of Multi-Word Expressions: A Computational Approach,” p. 190.

[^4]: G. Bouma, “Normalized (Pointwise) Mutual Information in Collocation Extraction,” p. 11.

[^5]: T. Dunning, “Accurate Methods for the Statistics of Surprise and Coincidence,” Computational Linguistics, vol. 19, no. 1, p. 14.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

remerge_mwe-0.4.1.tar.gz (214.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

remerge_mwe-0.4.1-cp312-abi3-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.12+Windows x86-64

remerge_mwe-0.4.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

remerge_mwe-0.4.1-cp312-abi3-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

File details

Details for the file remerge_mwe-0.4.1.tar.gz.

File metadata

  • Download URL: remerge_mwe-0.4.1.tar.gz
  • Upload date:
  • Size: 214.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for remerge_mwe-0.4.1.tar.gz
Algorithm Hash digest
SHA256 7376a440b83fce0e48e15e942f3ac1078fcbee9b7485d78d04cedb643de9e263
MD5 9db9bb541948155a63b963b4d035774a
BLAKE2b-256 da425b730c53d8d9ee328c402ec23eb3dc2002831dfeca01676559c2f981c456

See more details on using hashes here.

File details

Details for the file remerge_mwe-0.4.1-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: remerge_mwe-0.4.1-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for remerge_mwe-0.4.1-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 84b3e2da57cdc7e0a1862b961f51b86dd02bd0a87e11ba51b74c5925cbaca8a8
MD5 fe1ca341c4e97f0a1d8ea1eec4e2d676
BLAKE2b-256 92cef6a9c97d81c120d46b5af3788d87714f8944ffb6333d48626824f1af1bf1

See more details on using hashes here.

File details

Details for the file remerge_mwe-0.4.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

  • Download URL: remerge_mwe-0.4.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for remerge_mwe-0.4.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9760decbcff82042a9d3f39ac01ea6f054fae4ba6d2c66e5532cd296453f1886
MD5 9ce4e6312f301baad2743e5252caac56
BLAKE2b-256 1f730a9b0a59868c3f97bc56dad013d6dd59759f075d6c14a7b88972780783f5

See more details on using hashes here.

File details

Details for the file remerge_mwe-0.4.1-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

  • Download URL: remerge_mwe-0.4.1-cp312-abi3-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.12+, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for remerge_mwe-0.4.1-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e559e420b2541bb17f79af497f4e5ceed1746635aca9043c2fa8db248c0e0ef2
MD5 bb2e0ee816c43dea1259701650e5e171
BLAKE2b-256 bd4898c02b6a258360eface18e45b2614ab4cf076d30c7163d600adfa19ab704

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page