Cluster business names into entity groups; Python + Rust

These details have not been verified by PyPI

Project description

name-cluster

Cluster business names into groups representing the same legal entity.

Given customs-form-style data where the same exporter appears as "IBM USA", "International Business Machines Inc", "00 IBM", etc., the library produces stable cluster IDs grouping these as one entity, plus a canonical name per cluster. Designed for millions of names on CPU-only deployments with limited memory; no GPU, no model downloads, no network.

Rust core (PyO3 binding via maturin) + Python adapter over narwhals, so the same call works on polars, pandas, and pyarrow inputs.

Install

PyPI publishing is pending. From source (requires rust toolchain):

git clone https://github.com/jessetweedle/name-cluster && cd name-cluster
uv venv .venv && source .venv/bin/activate
uv pip install maturin
maturin develop --release

Runtime deps (resolved automatically): narwhals, pyarrow. The lib has no model weights, no data files, no network calls.

Quickstart

A runnable end-to-end demo lives at examples/quickstart.py — exercises every public symbol on synthetic data, plus an optional real-data section that activates if you've populated the dev cache via scripts/download_*.py. Run it with python examples/quickstart.py.

The minimal version:

import namecluster as nc
import polars as pl

df = pl.DataFrame({
    "name": [
        "Acme Corporation",
        "ACME Corp",
        "Acme Corporation Inc",
        "Apple Computer Co.",
        "Apple Inc",
        None,
    ],
})

result = nc.cluster(df, name_col="name")

shape: (6, 3)
┌──────────────────────┬────────────┬────────────────┐
│ name                 ┆ cluster_id ┆ canonical_name │
╞══════════════════════╪════════════╪════════════════╡
│ Acme Corporation     ┆ 0          ┆ acme           │
│ ACME Corp            ┆ 0          ┆ acme           │
│ Acme Corporation Inc ┆ 0          ┆ acme           │
│ Apple Computer Co.   ┆ 2          ┆ apple computer │
│ Apple Inc            ┆ 1          ┆ apple          │
│ null                 ┆ null       ┆ null           │
└──────────────────────┴────────────┴────────────────┘

(Cluster IDs are assigned 0..N-1 sorted by canonical name ascending, so acme < apple < apple computer. Deterministic given the same seed.)

Same call works on a pandas DataFrame or a pyarrow Table — narwhals detects the input type and returns the same type.

Public API

import namecluster as nc

# Main entry — cluster a name column on any dataframe
nc.cluster(data, name_col="name", threshold=0.85, seed=0, ...)

# REPL/notebook convenience for a flat list of strings
nc.cluster_names(["IBM Corp", "IBM Inc", "Apple Inc"])  # -> [0, 0, 1]

# Single-name normalization (debug what the lib actually compares)
nc.normalize("00 IBM Corp.")  # -> "ibm"

# Synthetic data + cluster-quality metrics for evaluation
ds = nc.generate_examples(n_entities=100, difficulty="medium", seed=0)
metrics = nc.score_clusters(predicted, true)  # ARI, F1, precision, recall

# LSH config picker for a target Jaccard + recall
nc.lsh_calibrate(target_jaccard=0.6, target_recall=0.95)
# -> {"bands": ..., "rows": ..., "num_perm": ..., "p_at_target": ..., "p_at_fp": ...}

# Debug: what pairs did LSH propose, and at what cosine score?
nc.candidates(df, name_col="name", min_score=0.5)
# -> df with name_a, name_b, normalized_a, normalized_b, score

# Debug: what's inside one cluster — members, edges, hub eccentricity?
nc.explain(result, cluster_id=42)
# -> {"canonical": "...", "members": [...], "edges": [(a, b, score), ...],
#     "hub_radius": int, "size": int}

# Discover acronym↔expansion candidates from the corpus (feeds `aliases=`)
nc.acronym_map(df, name_col="name")
# -> df with (acronym, expansion_count, expansions, acronym_examples)

Common patterns

Block by country (recommended for cross-country corpora)

The library doesn't take a country_col kwarg — users do hard blocking themselves to keep the API tight. Polars idiom:

result = (
    df.group_by("country", maintain_order=True)
      .map_groups(lambda g: nc.cluster(g, name_col="name"))
)
# Note: cluster_ids are call-local (each group starts at 0). If you
# concatenate groups, offset cluster_ids per group to make them unique.

Tune the threshold

Higher threshold = more clusters (precision-favoring); lower = fewer (recall-favoring). Default 0.85 is high precision. Sweep on a labeled sample to find your domain's sweet spot:

for t in [0.75, 0.80, 0.85, 0.90, 0.95]:
    r = nc.cluster(df, name_col="name", threshold=t)
    metrics = nc.score_clusters(r["cluster_id"], known_labels)
    print(t, metrics)

Generator + round-trip evaluation

# Cap n_entities <= 50 (the embedded toy-canonical pool size) to avoid
# the wrap-around disambig suffix; or pass canonicals=[...] to use your own.
ds = nc.generate_examples(n_entities=40, difficulty="easy", seed=42)
result = nc.cluster(ds, name_col="variant_name")
metrics = nc.score_clusters(
    result["cluster_id"].to_pylist(),
    ds["true_entity_id"].to_pylist(),
)
# Easy difficulty target: ARI > 0.85, recall > 0.95

Difficulty levels (per ARCHITECTURE.md):

level	edits per variant	typical cosine
easy	case + punct + suffix swap	≥ 0.95
medium	+ abbr expansion + leading garbage + accent + `THE` toggle	≥ 0.85
hard	+ char typos + word drop + geo suffix + spacing oddities	≥ 0.70

Acronym / expansion aliases

Pass aliases={canonical: [alias, ...]} to force-merge an acronym with its expansion (or any other variant pair the lib's char-n-gram cosine won't bridge by itself):

nc.cluster(
    df, name_col="name",
    aliases={"International Business Machines": ["IBM", "I.B.M."]},
)
# All four — "IBM Corp", "I.B.M. Inc", "International Business Machines",
# "International Business Machines Inc" — collapse into one cluster with
# canonical_name="intl business machines".

Each canonical and each alias is normalized; post-normalize, every alias form is rewritten to the canonical's form before MinHash/LSH/TF-IDF run. If two canonicals map the same alias, the last one wins.

Custom suffix lists / abbreviations

Pass extra_suffixes=[...] and extra_canonical={...} to extend the shipped lists for niche jurisdictions or industry abbreviations. Defaults already cover ~45 international legal-form suffixes plus 15 descriptor canonicalizations (see ARCHITECTURE.md § Normalization).

Note: extra_suffixes / extra_canonical kwargs are spec'd in ARCHITECTURE.md but not yet plumbed through the public API. Tracked as a v1.x follow-up.

Configuration

All knobs are flat kwargs on cluster():

kwarg	default	what it controls
`threshold`	0.85	cosine cutoff for the TF-IDF rerank
`seed`	0	deterministic RNG seed (MinHash + LSH bucket hashing)
`ngram_size`	3	char-n-gram window for vectors
`lsh_bands`	32	LSH band count
`lsh_rows`	4	LSH rows/band; `num_perm = bands × rows`
`hub_radius_max`	2	per-cluster diameter check threshold for the hub-radius split
`diameter_check_min_size`	5	skip diameter check on small clusters
`max_name_length`	256	truncate raw input names beyond this many bytes
`aliases`	`None`	acronym/expansion override map: `{canonical: [alias, ...]}`

Scope

Input language: English with light non-ASCII accents (Café → cafe). Non-Latin-script names (CJK, Arabic, Cyrillic, Hebrew, etc.) post-normalize to empty strings and are returned with cluster_id=null.
Performance target: millions of names per CPU-only k8s notebook, < 16 GB peak RAM at 10M-name scale with country blocking.
Determinism: same input + same seed → byte-identical output, on the same wheel. Guaranteed within a lib version; cluster IDs may differ across 0.x → 0.y releases.
Out of v1: soft-scoring side-features (country / products as signals rather than block keys); incremental fit/predict; cross-language synonym translation (e.g. acronym ↔ expansion). See ARCHITECTURE.md § Future work for the parked-task list.

How it works (brief)

input names
   │
   ▼
 normalize    NFKD → lower → punct policy → bidirectional legal-form strip
   │          → multi-token compound canonicalize → descriptor abbreviate
   ▼
 char-n-gram  default n=3, byte-level on ASCII-guaranteed normalized strings
   │
   ▼
 MinHash      universal hashing (Mersenne-prime 2^61-1 reduction),
   │          deterministic via rand_chacha
   ▼
 LSH          banded bucketing, default 32×4
   │
   ▼
 candidates   pairs that collide in any band
   │
   ▼
 TF-IDF       sparse char-n-gram vectors, L2-normalized at build time
 rerank       cosine on sorted-merge dot product
   │
   ▼
 threshold    drop pairs below `threshold`
   │
   ▼
 union-find   connected components
   │
   ▼
 hub          max-degree node per CC = canonical
   │
   ▼
 sort + relabel  cluster IDs assigned by canonical name asc

Full design rationale, audit findings, and the decision history live in ARCHITECTURE.md.

Development

Run all tests (rust + python integration):

cargo test --lib                          # 78 rust unit tests
maturin develop --release                  # rebuild + reinstall extension
pytest tests/test_public_api.py            # 21 python integration tests

Pre-push hook (local CI)

scripts/ci.sh runs ruff (check + format), cargo fmt, cargo clippy -D warnings, cargo test --lib, maturin develop (only if rust changed), and pytest. Wire it as a pre-push hook once:

git config core.hooksPath .githooks

Subsequent git push runs the script and aborts on failure. Bypass with git push --no-verify. The script can also be run manually: scripts/ci.sh.

Re-run the normalization audit against real corpora (downloads ~600 MB on first run; auth required for SAM.gov):

uv run scripts/download_corpora.py all
uv run scripts/download_sam.py             # requires SAM_API_KEY
uv run scripts/validate_normalization.py

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- POSIX :: Linux
Programming Language

Release history Release notifications | RSS feed

This version

0.1.0

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

name_cluster-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (447.8 kB view details)

Uploaded May 6, 2026 CPython 3.12+manylinux: glibc 2.17+ x86-64

name_cluster-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (431.2 kB view details)

Uploaded May 6, 2026 CPython 3.12+manylinux: glibc 2.17+ ARM64

File details

Details for the file name_cluster-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: name_cluster-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: May 6, 2026
Size: 447.8 kB
Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for name_cluster-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`44eb7bc7d9788fe462fa5d15bbe971281a06ba13ee9967aad48e9656770704f3`
MD5	`30500d3d8c1204e525ac4e3b6d5589e3`
BLAKE2b-256	`e24ef3a74601467626c9014c7e069aa81c72004e430745addfeb162f35c49b66`

See more details on using hashes here.

Provenance

The following attestation bundles were made for name_cluster-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on twedl/name-cluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: name_cluster-0.1.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Subject digest: 44eb7bc7d9788fe462fa5d15bbe971281a06ba13ee9967aad48e9656770704f3
- Sigstore transparency entry: 1449708372
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: twedl/name-cluster@e56d87bb6bad22607633be79c0089d2e40c7417e
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/twedl
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e56d87bb6bad22607633be79c0089d2e40c7417e
- Trigger Event: push

File details

Details for the file name_cluster-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: name_cluster-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: May 6, 2026
Size: 431.2 kB
Tags: CPython 3.12+, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for name_cluster-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`7878373e8772911cbc40fe78d645956dd1f1520dcfdc6c18d723945a446a802b`
MD5	`a219e41c80d51f57187787157aacfa83`
BLAKE2b-256	`6799eb8449381b3897c8828d12c12b5433c1393a9ebde06f26759acebb659db2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for name_cluster-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on twedl/name-cluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: name_cluster-0.1.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Subject digest: 7878373e8772911cbc40fe78d645956dd1f1520dcfdc6c18d723945a446a802b
- Sigstore transparency entry: 1449708376
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: twedl/name-cluster@e56d87bb6bad22607633be79c0089d2e40c7417e
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/twedl
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e56d87bb6bad22607633be79c0089d2e40c7417e
- Trigger Event: push

name-cluster 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

name-cluster

Install

Quickstart

Public API

Common patterns

Block by country (recommended for cross-country corpora)

Tune the threshold

Generator + round-trip evaluation

Acronym / expansion aliases

Custom suffix lists / abbreviations

Configuration

Scope

How it works (brief)

Development

Pre-push hook (local CI)

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance