Cluster business names into entity groups; Python + Rust
Project description
name-cluster
Cluster business names into groups representing the same legal entity.
Given customs-form-style data where the same exporter appears as
"IBM USA", "International Business Machines Inc", "00 IBM", etc., the
library produces stable cluster IDs grouping these as one entity, plus a
canonical name per cluster. Designed for millions of names on CPU-only
deployments with limited memory; no GPU, no model downloads, no network.
Rust core (PyO3 binding via maturin) + Python adapter over narwhals, so the same call works on polars, pandas, and pyarrow inputs.
Install
PyPI publishing is pending. From source (requires rust toolchain):
git clone https://github.com/jessetweedle/name-cluster && cd name-cluster
uv venv .venv && source .venv/bin/activate
uv pip install maturin
maturin develop --release
Runtime deps (resolved automatically): narwhals, pyarrow. The lib has
no model weights, no data files, no network calls.
Quickstart
A runnable end-to-end demo lives at examples/quickstart.py
— exercises every public symbol on synthetic data, plus an optional
real-data section that activates if you've populated the dev cache via
scripts/download_*.py. Run it with python examples/quickstart.py.
The minimal version:
import namecluster as nc
import polars as pl
df = pl.DataFrame({
"name": [
"Acme Corporation",
"ACME Corp",
"Acme Corporation Inc",
"Apple Computer Co.",
"Apple Inc",
None,
],
})
result = nc.cluster(df, name_col="name")
shape: (6, 3)
┌──────────────────────┬────────────┬────────────────┐
│ name ┆ cluster_id ┆ canonical_name │
╞══════════════════════╪════════════╪════════════════╡
│ Acme Corporation ┆ 0 ┆ acme │
│ ACME Corp ┆ 0 ┆ acme │
│ Acme Corporation Inc ┆ 0 ┆ acme │
│ Apple Computer Co. ┆ 2 ┆ apple computer │
│ Apple Inc ┆ 1 ┆ apple │
│ null ┆ null ┆ null │
└──────────────────────┴────────────┴────────────────┘
(Cluster IDs are assigned 0..N-1 sorted by canonical name ascending, so
acme < apple < apple computer. Deterministic given the same seed.)
Same call works on a pandas DataFrame or a pyarrow Table — narwhals
detects the input type and returns the same type.
Public API
import namecluster as nc
# Main entry — cluster a name column on any dataframe
nc.cluster(data, name_col="name", threshold=0.85, seed=0, ...)
# REPL/notebook convenience for a flat list of strings
nc.cluster_names(["IBM Corp", "IBM Inc", "Apple Inc"]) # -> [0, 0, 1]
# Single-name normalization (debug what the lib actually compares)
nc.normalize("00 IBM Corp.") # -> "ibm"
# Synthetic data + cluster-quality metrics for evaluation
ds = nc.generate_examples(n_entities=100, difficulty="medium", seed=0)
metrics = nc.score_clusters(predicted, true) # ARI, F1, precision, recall
# LSH config picker for a target Jaccard + recall
nc.lsh_calibrate(target_jaccard=0.6, target_recall=0.95)
# -> {"bands": ..., "rows": ..., "num_perm": ..., "p_at_target": ..., "p_at_fp": ...}
# Debug: what pairs did LSH propose, and at what cosine score?
nc.candidates(df, name_col="name", min_score=0.5)
# -> df with name_a, name_b, normalized_a, normalized_b, score
# Debug: what's inside one cluster — members, edges, hub eccentricity?
nc.explain(result, cluster_id=42)
# -> {"canonical": "...", "members": [...], "edges": [(a, b, score), ...],
# "hub_radius": int, "size": int}
# Discover acronym↔expansion candidates from the corpus (feeds `aliases=`)
nc.acronym_map(df, name_col="name")
# -> df with (acronym, expansion_count, expansions, acronym_examples)
Common patterns
Block by country (recommended for cross-country corpora)
The library doesn't take a country_col kwarg — users do hard blocking
themselves to keep the API tight. Polars idiom:
result = (
df.group_by("country", maintain_order=True)
.map_groups(lambda g: nc.cluster(g, name_col="name"))
)
# Note: cluster_ids are call-local (each group starts at 0). If you
# concatenate groups, offset cluster_ids per group to make them unique.
Tune the threshold
Higher threshold = more clusters (precision-favoring); lower = fewer
(recall-favoring). Default 0.85 is high precision. Sweep on a labeled
sample to find your domain's sweet spot:
for t in [0.75, 0.80, 0.85, 0.90, 0.95]:
r = nc.cluster(df, name_col="name", threshold=t)
metrics = nc.score_clusters(r["cluster_id"], known_labels)
print(t, metrics)
Generator + round-trip evaluation
# Cap n_entities <= 50 (the embedded toy-canonical pool size) to avoid
# the wrap-around disambig suffix; or pass canonicals=[...] to use your own.
ds = nc.generate_examples(n_entities=40, difficulty="easy", seed=42)
result = nc.cluster(ds, name_col="variant_name")
metrics = nc.score_clusters(
result["cluster_id"].to_pylist(),
ds["true_entity_id"].to_pylist(),
)
# Easy difficulty target: ARI > 0.85, recall > 0.95
Difficulty levels (per ARCHITECTURE.md):
| level | edits per variant | typical cosine |
|---|---|---|
| easy | case + punct + suffix swap | ≥ 0.95 |
| medium | + abbr expansion + leading garbage + accent + THE toggle |
≥ 0.85 |
| hard | + char typos + word drop + geo suffix + spacing oddities | ≥ 0.70 |
Acronym / expansion aliases
Pass aliases={canonical: [alias, ...]} to force-merge an acronym with
its expansion (or any other variant pair the lib's char-n-gram cosine
won't bridge by itself):
nc.cluster(
df, name_col="name",
aliases={"International Business Machines": ["IBM", "I.B.M."]},
)
# All four — "IBM Corp", "I.B.M. Inc", "International Business Machines",
# "International Business Machines Inc" — collapse into one cluster with
# canonical_name="intl business machines".
Each canonical and each alias is normalized; post-normalize, every alias form is rewritten to the canonical's form before MinHash/LSH/TF-IDF run. If two canonicals map the same alias, the last one wins.
Custom suffix lists / abbreviations
Pass extra_suffixes=[...] and extra_canonical={...} to extend the
shipped lists for niche jurisdictions or industry abbreviations. Defaults
already cover ~45 international legal-form suffixes plus 15 descriptor
canonicalizations (see ARCHITECTURE.md § Normalization).
Note:
extra_suffixes/extra_canonicalkwargs are spec'd in ARCHITECTURE.md but not yet plumbed through the public API. Tracked as a v1.x follow-up.
Configuration
All knobs are flat kwargs on cluster():
| kwarg | default | what it controls |
|---|---|---|
threshold |
0.85 | cosine cutoff for the TF-IDF rerank |
seed |
0 | deterministic RNG seed (MinHash + LSH bucket hashing) |
ngram_size |
3 | char-n-gram window for vectors |
lsh_bands |
32 | LSH band count |
lsh_rows |
4 | LSH rows/band; num_perm = bands × rows |
hub_radius_max |
2 | per-cluster diameter check threshold for the hub-radius split |
diameter_check_min_size |
5 | skip diameter check on small clusters |
max_name_length |
256 | truncate raw input names beyond this many bytes |
aliases |
None |
acronym/expansion override map: {canonical: [alias, ...]} |
Scope
- Input language: English with light non-ASCII accents (
Café→cafe). Non-Latin-script names (CJK, Arabic, Cyrillic, Hebrew, etc.) post-normalize to empty strings and are returned withcluster_id=null. - Performance target: millions of names per CPU-only k8s notebook, < 16 GB peak RAM at 10M-name scale with country blocking.
- Determinism: same input + same seed → byte-identical output, on the
same wheel. Guaranteed within a lib version; cluster IDs may differ
across
0.x→0.yreleases. - Out of v1: soft-scoring side-features (country / products as
signals rather than block keys); incremental fit/predict; cross-language
synonym translation (e.g. acronym ↔ expansion). See
ARCHITECTURE.md§ Future work for the parked-task list.
How it works (brief)
input names
│
▼
normalize NFKD → lower → punct policy → bidirectional legal-form strip
│ → multi-token compound canonicalize → descriptor abbreviate
▼
char-n-gram default n=3, byte-level on ASCII-guaranteed normalized strings
│
▼
MinHash universal hashing (Mersenne-prime 2^61-1 reduction),
│ deterministic via rand_chacha
▼
LSH banded bucketing, default 32×4
│
▼
candidates pairs that collide in any band
│
▼
TF-IDF sparse char-n-gram vectors, L2-normalized at build time
rerank cosine on sorted-merge dot product
│
▼
threshold drop pairs below `threshold`
│
▼
union-find connected components
│
▼
hub max-degree node per CC = canonical
│
▼
sort + relabel cluster IDs assigned by canonical name asc
Full design rationale, audit findings, and the decision history live in
ARCHITECTURE.md.
Development
Run all tests (rust + python integration):
cargo test --lib # 78 rust unit tests
maturin develop --release # rebuild + reinstall extension
pytest tests/test_public_api.py # 21 python integration tests
Pre-push hook (local CI)
scripts/ci.sh runs ruff (check + format), cargo fmt, cargo clippy -D warnings, cargo test --lib, maturin develop (only if rust changed),
and pytest. Wire it as a pre-push hook once:
git config core.hooksPath .githooks
Subsequent git push runs the script and aborts on failure. Bypass with
git push --no-verify. The script can also be run manually: scripts/ci.sh.
Re-run the normalization audit against real corpora (downloads ~600 MB on first run; auth required for SAM.gov):
uv run scripts/download_corpora.py all
uv run scripts/download_sam.py # requires SAM_API_KEY
uv run scripts/validate_normalization.py
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file name_cluster-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: name_cluster-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 447.8 kB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44eb7bc7d9788fe462fa5d15bbe971281a06ba13ee9967aad48e9656770704f3
|
|
| MD5 |
30500d3d8c1204e525ac4e3b6d5589e3
|
|
| BLAKE2b-256 |
e24ef3a74601467626c9014c7e069aa81c72004e430745addfeb162f35c49b66
|
Provenance
The following attestation bundles were made for name_cluster-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on twedl/name-cluster
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
name_cluster-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
44eb7bc7d9788fe462fa5d15bbe971281a06ba13ee9967aad48e9656770704f3 - Sigstore transparency entry: 1449708372
- Sigstore integration time:
-
Permalink:
twedl/name-cluster@e56d87bb6bad22607633be79c0089d2e40c7417e -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/twedl
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e56d87bb6bad22607633be79c0089d2e40c7417e -
Trigger Event:
push
-
Statement type:
File details
Details for the file name_cluster-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: name_cluster-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 431.2 kB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7878373e8772911cbc40fe78d645956dd1f1520dcfdc6c18d723945a446a802b
|
|
| MD5 |
a219e41c80d51f57187787157aacfa83
|
|
| BLAKE2b-256 |
6799eb8449381b3897c8828d12c12b5433c1393a9ebde06f26759acebb659db2
|
Provenance
The following attestation bundles were made for name_cluster-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
publish.yml on twedl/name-cluster
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
name_cluster-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
7878373e8772911cbc40fe78d645956dd1f1520dcfdc6c18d723945a446a802b - Sigstore transparency entry: 1449708376
- Sigstore integration time:
-
Permalink:
twedl/name-cluster@e56d87bb6bad22607633be79c0089d2e40c7417e -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/twedl
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e56d87bb6bad22607633be79c0089d2e40c7417e -
Trigger Event:
push
-
Statement type: