Entity Resolution infrastructure for fragmented, noisy, large-scale datasets
Project description
sandx-er
Entity Resolution infrastructure for fragmented, noisy, large-scale datasets.
Part of the SandX Lab computational infrastructure ecosystem.
What It Does
sandx-er resolves the identity of real-world entities across datasets where the same entity appears as multiple, inconsistent, or duplicate records. Pipeline:
Raw records → Blocking → Matching → Clustering → Resolved identity graph
(LSH, (Jaccard, (Connected
SNM, cosine) components,
ANN) Correlation)
Each stage is independently configurable. Every output carries a probabilistic confidence score — not a binary decision.
Status
v0.1 — Phase 2 active development
| Component | Status |
|---|---|
EntityResolver — pipeline orchestrator |
Working |
LSHBlocking — MinHash LSH |
Working |
SortedNeighborhoodBlocking — SNM |
Working |
EmbeddingANNBlocking — ANN via sandx-embed |
Working |
JaccardScorer — character shingle Jaccard |
Working |
CosineSimilarityScorer — embedding cosine |
Working |
ConnectedComponentsClustering |
Working |
CorrelationClustering — Kwik-Cluster |
Working |
| Abt-Buy benchmark | Working |
| PyPI package | Planned |
Installation
pip install sandx-er
Or from source:
git clone https://github.com/sandxlab/sandx-er
cd sandx-er
pip install -e ".[dev]"
For embedding-based blocking and matching:
pip install "sandx-er[embed]"
Quick Start
import pandas as pd
from sandx_er import EntityResolver
records = pd.DataFrame({
"name": ["Acme Corp", "Acme Corp.", "GlobalTech Inc", "Global Tech"],
"city": ["Boston", "Boston", "New York", "New York"],
})
er = EntityResolver(
blocking="lsh", # MinHash LSH candidate generation
similarity="jaccard", # character Jaccard similarity scoring
threshold=0.4,
)
result = er.resolve(records)
print(f"Resolved {result.n_records} records → {result.n_clusters} entities")
for cluster in result.clusters:
print(f" {cluster.canonical_id[:8]} size={cluster.size} conf={cluster.confidence:.2f}")
print(f" records: {cluster.record_ids}")
Output:
Resolved 4 records → 2 entities
3f2a1b8c size=2 conf=0.81
records: ['0', '1']
7e9d4c2a size=2 conf=0.76
records: ['2', '3']
Pipeline Stages
Blocking
Reduces O(N²) comparisons to a tractable candidate set.
from sandx_er import LSHBlocking, SortedNeighborhoodBlocking, EmbeddingANNBlocking
# MinHash LSH — works on all string fields, no key required
er = EntityResolver(blocking="lsh")
# Sorted Neighborhood Method — fast, requires a sort key
er = EntityResolver(blocking="snm", key_field="name")
# Embedding ANN — semantic similarity (requires sandx-embed)
er = EntityResolver(blocking="embedding", embed_model="sentence-bert")
# Or pass a custom BlockingMethod instance
er = EntityResolver(blocking=LSHBlocking(n_bands=30, n_rows=4))
Matching
Scores each candidate pair.
from sandx_er import JaccardScorer, CosineSimilarityScorer
er = EntityResolver(similarity="jaccard") # no deps; fast
er = EntityResolver(similarity="embedding") # requires sandx-embed
er = EntityResolver(similarity=JaccardScorer(shingle_size=2, fields=["name"]))
Clustering
Reconciles pairwise decisions into globally consistent entity clusters.
er = EntityResolver(clustering="connected_components") # fast; may over-merge
er = EntityResolver(clustering="correlation") # slower; corrects transitivity errors
Benchmark — Febrl4
python -m benchmarks.abt_buy # LSH + Jaccard, threshold 0.3
python -m benchmarks.abt_buy --blocking snm --key-field surname # SNM + Jaccard
Uses the Febrl4 person record linkage dataset (built into recordlinkage — no download required).
5,000 records per table · 5,000 true 1:1 matches · synthetic Australian person records with realistic noise.
| Config | Precision | Recall | F1 | Time |
|---|---|---|---|---|
| LSH + Jaccard · threshold=0.3 | 1.000 | 0.955 | 0.977 | 1.1s |
| SNM (surname) + Jaccard · threshold=0.3 | 1.000 | 0.384 | 0.555 | 0.4s |
LSH generalizes across all field variations; SNM recall drops when the blocking key (surname) is noisy.
All results are reproducible: pip install recordlinkage && python -m benchmarks.abt_buy.
Architecture
sandx_er/
├── resolver.py EntityResolver — pipeline orchestrator
├── blocking.py LSHBlocking, SortedNeighborhoodBlocking, EmbeddingANNBlocking
├── matching.py JaccardScorer, CosineSimilarityScorer
└── clustering.py ConnectedComponentsClustering, CorrelationClustering
Optional dependency: sandx-embed for embedding-based blocking and matching.
Benchmark Datasets
| Dataset | Domain | Table A | Table B | Matches |
|---|---|---|---|---|
| Abt-Buy | E-commerce | 1,081 | 1,092 | ~1,097 |
| DBLP-ACM | Academic | 2,616 | 2,294 | 2,224 |
| DBLP-Scholar | Academic | 2,616 | 64,263 | 5,347 |
| Cora | Citations | 1,295 | — | dedup |
All benchmark runs are version-tagged and fully reproducible from public data.
Related
sandx-embed— shared embedding infrastructuresandx-graph— graph intelligence over resolved entities- sandx.io — project home
License
Apache 2.0 — see LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sandx_er-0.1.0.tar.gz.
File metadata
- Download URL: sandx_er-0.1.0.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a18900f4d0ef065c03ea8498e0a7da81f813e7f76e26b943c9c0069d3b34293
|
|
| MD5 |
8fc6f8afccf4a52f9c567a7ea3876858
|
|
| BLAKE2b-256 |
2d633098ab46a76e670fdd38b5329a8ea51a438e682b639adfbf3650b7e7f621
|
File details
Details for the file sandx_er-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sandx_er-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59020a2591ac4efa61b93775cd2a72ee7232fbf29d377d3453ff13866a9a33d8
|
|
| MD5 |
f41658b30d62e8e0ea2fa6e3d7250831
|
|
| BLAKE2b-256 |
43fe74c1c2553963381ff8bd6a8f79d86c1cecee2c76d23e285772f254a8329a
|