Record linkage with dense blocking using text embeddings and LLM matching
Project description
denselinkage
Record linkage and deduplication for Python — dense blocking, optional LLM matching, and evaluation built in.
denselinkage finds the records that refer to the same real-world entity, whether
they live in two datasets (record linkage) or one (deduplication). It
shrinks the impossible all-pairs comparison down to a plausible few with
embedding-based blocking, decides each candidate with a pluggable matcher —
a fast similarity threshold or a large language model — then clusters and scores
the result.
The core runs on numpy + pandas alone. FAISS, sentence-transformers, and
LangChain are optional extras you reach for when you need approximate-nearest-
neighbour search at scale, semantic embeddings, or LLM-based matching — import denselinkage pulls in none of them until you ask.
Highlights
- 🪶 Dependency-free core —
pip install denselinkageis just numpy + pandas. The heavy ML backends are opt-in extras, and the import graph proves it: CI fails if a backend ever leaks into the core. - 🔁 Swap any stage — the embedder, vector index, and matcher are independent
components behind small
Protocols. Go from lexical → semantic, brute-force → FAISS, threshold → LLM without rewriting your pipeline. - 📦 End to end — block → match → cluster → evaluate, with linkage, blocking, and clustering (B³) metrics included.
- 🧊 Immutable by design —
link/dedupe/match_pairsare single calls with no hiddenfit/predictstate. Build a reference index once and reuse it. - 🧪 Typed, tested, stable — strict
mypy, a shippedpy.typedmarker, 100% branch coverage, and a frozen 1.0 API (evolution is extend, never modify).
Installation
pip install denselinkage # core — numpy + pandas only
Add extras when you need them (or [all] for everything):
pip install "denselinkage[sentence-transformers]" # semantic embeddings
pip install "denselinkage[faiss]" # FAISS approximate-NN index
pip install "denselinkage[langchain]" # LLM matcher
pip install "denselinkage[all]"
Requires Python 3.10+.
Quickstart
Link two tables of companies with messy, inconsistent names — no configuration, one call:
import pandas as pd
from denselinkage import DenseLinker, LabeledPairs, Source
from denselinkage.metrics import linkage_metrics
left = pd.DataFrame({
"id": ["A1", "A2", "A3"],
"name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
"city": ["Cupertino", "Redmond", "Mountain View"],
})
right = pd.DataFrame({
"id": ["B1", "B2", "B3"],
"name": ["Apple Incorporated", "Microsoft", "Google"],
"city": ["Cupertino", "Redmond", "Mountain View"],
})
linker = DenseLinker.with_defaults() # lexical stack: embed → index → threshold
result = linker.link( # one call — no fit/predict, no mutation
Source(left, id_column="id"),
Source(right, id_column="id"),
)
print(result.to_frame().query("match")) # the decided matches, as a DataFrame
gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])
m = linkage_metrics(result, gold=gold)
print(f"precision={m.precision:.2f} recall={m.recall:.2f} f1={m.f1:.2f}")
left_id right_id similarity match confidence reason
0 A1 B1 0.762443 True None None
3 A2 B2 0.833908 True None None
6 A3 B3 0.864126 True None None
precision=1.00 recall=1.00 f1=1.00
with_defaults() wires the dependency-free lexical stack — character n-gram
embeddings, brute-force nearest-neighbour search, and a similarity threshold. It
recovers abbreviations, punctuation, and typos (Apple Inc ↔ Apple Incorporated)
out of the box.
How it works
denselinkage is a four-stage pipeline, and every stage is a swappable component:
Sources ──► Block ──────► Match ──────► Cluster ──────► Evaluate
(embed + (threshold (connected (P/R/F1,
top-k NN) or LLM) components) B³, …)
- Block — embed each record and retrieve its top-k nearest neighbours, turning
an
N × Mcomparison into a handful of candidate pairs. - Match — decide each candidate.
ThresholdMatchergates on similarity;LangChainMatcherasks an LLM and returns a typed decision. - Cluster — group the matches into entities with transitive
connected_components. - Evaluate — score against gold labels with linkage, blocking, or clustering (B³) metrics.
Three verbs cover the common shapes — link (two datasets), dedupe (one
dataset against itself), and match_pairs (you already have candidate pairs).
index() builds a reusable reference index, so you embed once and query many times.
Scaling up: semantic + LLM matching
The lexical default is fast and free, but it only sees characters — it can't tell that Google and Alphabet are the same company. Swap in the heavy adapters for meaning (semantic embeddings), scale (FAISS), and judgment (an LLM), all behind the same ports:
| Stage | Lexical (default) | Semantic + LLM |
|---|---|---|
| Embed | HashedNGramEmbedder |
SentenceTransformerEmbedder · [sentence-transformers] |
| Index | NumpyFlatIndex |
FaissFlatIndex · [faiss] |
| Match | ThresholdMatcher |
LangChainMatcher · [langchain] |
| Catches | typos, abbreviations | + semantic renames, + judgment calls |
from denselinkage import DenseLinker
from denselinkage.blocking import DenseBlocker
from denselinkage.embedding import SentenceTransformerEmbedder
from denselinkage.indexing import FaissFlatIndex
from denselinkage.matching import LangChainMatcher
from langchain_openai import ChatOpenAI
linker = DenseLinker(
blocker=DenseBlocker(
embedder=SentenceTransformerEmbedder("all-MiniLM-L6-v2"),
vector_index=FaissFlatIndex(),
top_k=5, similarity_threshold=0.6,
),
matcher=LangChainMatcher(
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
prompt="Are these the same entity?\nA: {record_a}\nB: {record_b}",
),
)
result = linker.link(left, right) # the call is unchanged
Because the score is cosine on both stacks, a similarity_threshold tuned on the
lexical stack keeps its meaning here. See the
Semantic + LLM guide
for model selection, the prompt contract, retries, and cost.
Deduplicate and cluster
from denselinkage import DenseLinker, Source, connected_components
# df: one table that may contain duplicate records, with an "id" column
result = DenseLinker.with_defaults().dedupe(Source(df, id_column="id"))
clusters = connected_components(result) # transitive grouping → entities
print(clusters.to_frame()) # record_id, cluster_id
dedupe links a dataset against itself and suppresses self-pairs internally.
Clustering is transitive (AB, BC ⇒ one cluster), so a noisy matcher can
over-merge — watch for B³ recall ≫ precision.
Evaluation
Metrics are first-class, split by what they measure:
- Linkage —
linkage_metrics→ precision / recall / F1 over matched pairs (undecidable pairs are surfaced as errors and counted separately, never mixed in). - Blocking —
blocking_metrics/pair_completeness_at_k→ did blocking even surface the true pairs? - Clustering —
clustering_metrics→ B³ (Bagga–Baldwin) precision / recall / F1 over the entity clusters.
Plus tune_threshold for a P/R/F1 sweep and mine_hard_negatives for contrastive
training material.
Design
denselinkage is contract-first (hexagonal / ports-and-adapters). Domain logic
talks to small typing.Protocols — Embedder, VectorIndex, Matcher, … — and
concrete adapters plug in behind them. Two consequences worth knowing:
- The dependency cut is structural. Heavy backends import lazily, inside the
methods that use them; a CI job asserts
import denselinkagepulls in no FAISS / torch / LangChain. - The 1.0 contract is frozen. Signatures and field types won't change under
you; the library evolves by adding (an optional field, a sibling type, a new
classmethod), never by modifying. Stateful components follow spec → artifact:
a stateless spec's
build(...)returns an immutable, fitted artifact.
See the architecture overview for the full picture.
Documentation
- Tutorial — link two tables stage by stage.
- Semantic + LLM matching and Choosing components.
- API reference.
Runnable scripts live in examples/ — 00_quickstart.py is the
shortest path; 01/02 show the full semantic + LLM assembly.
Development
Requires uv.
uv sync --dev
uv run ruff check . && uv run ruff format --check . && uv run mypy && uv run pytest
CI runs lint, format, strict mypy, and the test suite on Python 3.10–3.13, with a separate job for the optional adapters. See CONTRIBUTING.md.
Changelog
See CHANGELOG.md.
Citing
If you use denselinkage in your research, please cite it — see
CITATION.cff.
License
MIT © 2026 Alvaro
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file denselinkage-1.0.0.tar.gz.
File metadata
- Download URL: denselinkage-1.0.0.tar.gz
- Upload date:
- Size: 38.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ccc61fb90190224ed3f1bb0c900e750b12f2eb5b8dead05feccfa86e803daed
|
|
| MD5 |
f410aa6e5fba05f436106bf8480c4710
|
|
| BLAKE2b-256 |
9f2148c93460c8f2cee8f6fe5425d1d3aef17142df720d39e149ab97dba13772
|
Provenance
The following attestation bundles were made for denselinkage-1.0.0.tar.gz:
Publisher:
release.yml on caalvaro/denselinkage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
denselinkage-1.0.0.tar.gz -
Subject digest:
4ccc61fb90190224ed3f1bb0c900e750b12f2eb5b8dead05feccfa86e803daed - Sigstore transparency entry: 1739872625
- Sigstore integration time:
-
Permalink:
caalvaro/denselinkage@aad97eece632c4b67ca3d01f2b15e88542b07d26 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/caalvaro
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@aad97eece632c4b67ca3d01f2b15e88542b07d26 -
Trigger Event:
push
-
Statement type:
File details
Details for the file denselinkage-1.0.0-py3-none-any.whl.
File metadata
- Download URL: denselinkage-1.0.0-py3-none-any.whl
- Upload date:
- Size: 62.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d811c643eba32ac55fa925cfe99df642b5da0f8d4963288c81739d46f6f7f475
|
|
| MD5 |
5dd4eb032ba9617faed0125a3429b09b
|
|
| BLAKE2b-256 |
4587244749e3832bd5e6615e18efdd7571de0f7df3e9258782f65a0815005898
|
Provenance
The following attestation bundles were made for denselinkage-1.0.0-py3-none-any.whl:
Publisher:
release.yml on caalvaro/denselinkage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
denselinkage-1.0.0-py3-none-any.whl -
Subject digest:
d811c643eba32ac55fa925cfe99df642b5da0f8d4963288c81739d46f6f7f475 - Sigstore transparency entry: 1739872640
- Sigstore integration time:
-
Permalink:
caalvaro/denselinkage@aad97eece632c4b67ca3d01f2b15e88542b07d26 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/caalvaro
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@aad97eece632c4b67ca3d01f2b15e88542b07d26 -
Trigger Event:
push
-
Statement type: