Record linkage with dense blocking using text embeddings and LLM matching

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

caalvaro

These details have not been verified by PyPI

Project description

denselinkage

Record linkage and deduplication for Python — dense blocking, optional LLM matching, and evaluation built in.

denselinkage finds the records that refer to the same real-world entity, whether they live in two datasets (record linkage) or one (deduplication). It shrinks the impossible all-pairs comparison down to a plausible few with embedding-based blocking, decides each candidate with a pluggable matcher — a fast similarity threshold or a large language model — then clusters and scores the result.

The core runs on numpy + pandas alone. FAISS, sentence-transformers, and LangChain are optional extras you reach for when you need approximate-nearest- neighbour search at scale, semantic embeddings, or LLM-based matching — import denselinkage pulls in none of them until you ask.

Highlights

🪶 Dependency-free core — pip install denselinkage is just numpy + pandas. The heavy ML backends are opt-in extras, and the import graph proves it: CI fails if a backend ever leaks into the core.
🔁 Swap any stage — the embedder, vector index, and matcher are independent components behind small Protocols. Go from lexical → semantic, brute-force → FAISS, threshold → LLM without rewriting your pipeline.
📦 End to end — block → match → cluster → evaluate, with linkage, blocking, and clustering (B³) metrics included.
🧊 Immutable by design — link / dedupe / match_pairs are single calls with no hidden fit/predict state. Build a reference index once and reuse it.
🧪 Typed, tested, stable — strict mypy, a shipped py.typed marker, 100% branch coverage, and a frozen 1.0 API (evolution is extend, never modify).

Installation

pip install denselinkage                           # core — numpy + pandas only

Add extras when you need them (or [all] for everything):

pip install "denselinkage[sentence-transformers]"  # semantic embeddings
pip install "denselinkage[faiss]"                  # FAISS approximate-NN index
pip install "denselinkage[langchain]"              # LLM matcher
pip install "denselinkage[all]"

Requires Python 3.10+.

Quickstart

Link two tables of companies with messy, inconsistent names — no configuration, one call:

import pandas as pd
from denselinkage import DenseLinker, LabeledPairs, Source
from denselinkage.metrics import linkage_metrics

left = pd.DataFrame({
    "id":   ["A1", "A2", "A3"],
    "name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
    "city": ["Cupertino", "Redmond", "Mountain View"],
})
right = pd.DataFrame({
    "id":   ["B1", "B2", "B3"],
    "name": ["Apple Incorporated", "Microsoft", "Google"],
    "city": ["Cupertino", "Redmond", "Mountain View"],
})

linker = DenseLinker.with_defaults()         # lexical stack: embed → index → threshold
result = linker.link(                         # one call — no fit/predict, no mutation
    Source(left, id_column="id"),
    Source(right, id_column="id"),
)

print(result.to_frame().query("match"))       # the decided matches, as a DataFrame
gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])
m = linkage_metrics(result, gold=gold)
print(f"precision={m.precision:.2f} recall={m.recall:.2f} f1={m.f1:.2f}")

  left_id right_id  similarity  match confidence reason
0      A1       B1    0.762443   True       None   None
3      A2       B2    0.833908   True       None   None
6      A3       B3    0.864126   True       None   None
precision=1.00 recall=1.00 f1=1.00

with_defaults() wires the dependency-free lexical stack — character n-gram embeddings, brute-force nearest-neighbour search, and a similarity threshold. It recovers abbreviations, punctuation, and typos (Apple Inc ↔ Apple Incorporated) out of the box.

How it works

denselinkage is a four-stage pipeline, and every stage is a swappable component:

 Sources ──► Block ──────► Match ──────► Cluster ──────► Evaluate
            (embed +      (threshold    (connected      (P/R/F1,
             top-k NN)     or LLM)        components)     B³, …)

Block — embed each record and retrieve its top-k nearest neighbours, turning an N × M comparison into a handful of candidate pairs.
Match — decide each candidate. ThresholdMatcher gates on similarity; LangChainMatcher asks an LLM and returns a typed decision.
Cluster — group the matches into entities with transitive connected_components.
Evaluate — score against gold labels with linkage, blocking, or clustering (B³) metrics.

Three verbs cover the common shapes — link (two datasets), dedupe (one dataset against itself), and match_pairs (you already have candidate pairs). index() builds a reusable reference index, so you embed once and query many times.

Scaling up: semantic + LLM matching

The lexical default is fast and free, but it only sees characters — it can't tell that Google and Alphabet are the same company. Swap in the heavy adapters for meaning (semantic embeddings), scale (FAISS), and judgment (an LLM), all behind the same ports:

Stage	Lexical (default)	Semantic + LLM
Embed	`HashedNGramEmbedder`	`SentenceTransformerEmbedder` · `[sentence-transformers]`
Index	`NumpyFlatIndex`	`FaissFlatIndex` · `[faiss]`
Match	`ThresholdMatcher`	`LangChainMatcher` · `[langchain]`
Catches	typos, abbreviations	+ semantic renames, + judgment calls

from denselinkage import DenseLinker
from denselinkage.blocking import DenseBlocker
from denselinkage.embedding import SentenceTransformerEmbedder
from denselinkage.indexing import FaissFlatIndex
from denselinkage.matching import LangChainMatcher
from langchain_openai import ChatOpenAI

linker = DenseLinker(
    blocker=DenseBlocker(
        embedder=SentenceTransformerEmbedder("all-MiniLM-L6-v2"),
        vector_index=FaissFlatIndex(),
        top_k=5, similarity_threshold=0.6,
    ),
    matcher=LangChainMatcher(
        llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
        prompt="Are these the same entity?\nA: {record_a}\nB: {record_b}",
    ),
)
result = linker.link(left, right)   # the call is unchanged

Because the score is cosine on both stacks, a similarity_threshold tuned on the lexical stack keeps its meaning here. See the Semantic + LLM guide for model selection, the prompt contract, retries, and cost.

Deduplicate and cluster

from denselinkage import DenseLinker, Source, connected_components

# df: one table that may contain duplicate records, with an "id" column
result   = DenseLinker.with_defaults().dedupe(Source(df, id_column="id"))
clusters = connected_components(result)        # transitive grouping → entities
print(clusters.to_frame())                     # record_id, cluster_id

dedupe links a dataset against itself and suppresses self-pairs internally. Clustering is transitive (A~~B, B~~C ⇒ one cluster), so a noisy matcher can over-merge — watch for B³ recall ≫ precision.

Evaluation

Metrics are first-class, split by what they measure:

Linkage — linkage_metrics → precision / recall / F1 over matched pairs (undecidable pairs are surfaced as errors and counted separately, never mixed in).
Blocking — blocking_metrics / pair_completeness_at_k → did blocking even surface the true pairs?
Clustering — clustering_metrics → B³ (Bagga–Baldwin) precision / recall / F1 over the entity clusters.

Plus tune_threshold for a P/R/F1 sweep and mine_hard_negatives for contrastive training material.

Design

denselinkage is contract-first (hexagonal / ports-and-adapters). Domain logic talks to small typing.Protocols — Embedder, VectorIndex, Matcher, … — and concrete adapters plug in behind them. Two consequences worth knowing:

The dependency cut is structural. Heavy backends import lazily, inside the methods that use them; a CI job asserts import denselinkage pulls in no FAISS / torch / LangChain.
The 1.0 contract is frozen. Signatures and field types won't change under you; the library evolves by adding (an optional field, a sibling type, a new classmethod), never by modifying. Stateful components follow spec → artifact: a stateless spec's build(...) returns an immutable, fitted artifact.

See the architecture overview for the full picture.

Documentation

📖 Full documentation →

Tutorial — link two tables stage by stage.
Semantic + LLM matching and Choosing components.
API reference.

Runnable scripts live in examples/ — 00_quickstart.py is the shortest path; 01/02 show the full semantic + LLM assembly.

Development

Requires uv.

uv sync --dev
uv run ruff check . && uv run ruff format --check . && uv run mypy && uv run pytest

CI runs lint, format, strict mypy, and the test suite on Python 3.10–3.13, with a separate job for the optional adapters. See CONTRIBUTING.md.

Changelog

See CHANGELOG.md.

Citing

If you use denselinkage in your research, please cite it — see CITATION.cff.

License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

caalvaro

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Jun 6, 2026

1.0.0b2 pre-release

Jun 6, 2026

1.0.0b1 pre-release

Jun 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

denselinkage-1.0.0.tar.gz (38.3 kB view details)

Uploaded Jun 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

denselinkage-1.0.0-py3-none-any.whl (62.0 kB view details)

Uploaded Jun 6, 2026 Python 3

File details

Details for the file denselinkage-1.0.0.tar.gz.

File metadata

Download URL: denselinkage-1.0.0.tar.gz
Upload date: Jun 6, 2026
Size: 38.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for denselinkage-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`4ccc61fb90190224ed3f1bb0c900e750b12f2eb5b8dead05feccfa86e803daed`
MD5	`f410aa6e5fba05f436106bf8480c4710`
BLAKE2b-256	`9f2148c93460c8f2cee8f6fe5425d1d3aef17142df720d39e149ab97dba13772`

See more details on using hashes here.

Provenance

The following attestation bundles were made for denselinkage-1.0.0.tar.gz:

Publisher: release.yml on caalvaro/denselinkage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: denselinkage-1.0.0.tar.gz
- Subject digest: 4ccc61fb90190224ed3f1bb0c900e750b12f2eb5b8dead05feccfa86e803daed
- Sigstore transparency entry: 1739872625
- Sigstore integration time: Jun 6, 2026
Source repository:
- Permalink: caalvaro/denselinkage@aad97eece632c4b67ca3d01f2b15e88542b07d26
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/caalvaro
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@aad97eece632c4b67ca3d01f2b15e88542b07d26
- Trigger Event: push

File details

Details for the file denselinkage-1.0.0-py3-none-any.whl.

File metadata

Download URL: denselinkage-1.0.0-py3-none-any.whl
Upload date: Jun 6, 2026
Size: 62.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for denselinkage-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d811c643eba32ac55fa925cfe99df642b5da0f8d4963288c81739d46f6f7f475`
MD5	`5dd4eb032ba9617faed0125a3429b09b`
BLAKE2b-256	`4587244749e3832bd5e6615e18efdd7571de0f7df3e9258782f65a0815005898`

See more details on using hashes here.

Provenance

The following attestation bundles were made for denselinkage-1.0.0-py3-none-any.whl:

Publisher: release.yml on caalvaro/denselinkage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: denselinkage-1.0.0-py3-none-any.whl
- Subject digest: d811c643eba32ac55fa925cfe99df642b5da0f8d4963288c81739d46f6f7f475
- Sigstore transparency entry: 1739872640
- Sigstore integration time: Jun 6, 2026
Source repository:
- Permalink: caalvaro/denselinkage@aad97eece632c4b67ca3d01f2b15e88542b07d26
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/caalvaro
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@aad97eece632c4b67ca3d01f2b15e88542b07d26
- Trigger Event: push

denselinkage 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

denselinkage

Highlights

Installation

Quickstart

How it works

Scaling up: semantic + LLM matching

Deduplicate and cluster

Evaluation

Design

Documentation

Development

Changelog

Citing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance