Skip to main content

Entity Resolution infrastructure for fragmented, noisy, large-scale datasets

Project description

sandx-er

Entity Resolution infrastructure for fragmented, noisy, large-scale datasets.

CI Python 3.10+ License: Apache 2.0

Part of the SandX Lab computational infrastructure ecosystem.


What It Does

sandx-er resolves the identity of real-world entities across datasets where the same entity appears as multiple, inconsistent, or duplicate records. Pipeline:

Raw records  →  Blocking  →  Matching  →  Clustering  →  Resolved identity graph
                 (LSH,          (Jaccard,    (Connected
                  SNM,           cosine)      components,
                  ANN)                        Correlation)

Each stage is independently configurable. Every output carries a probabilistic confidence score — not a binary decision.

Status

v0.1 — Phase 2 active development

Component Status
EntityResolver — pipeline orchestrator Working
LSHBlocking — MinHash LSH Working
SortedNeighborhoodBlocking — SNM Working
EmbeddingANNBlocking — ANN via sandx-embed Working
JaccardScorer — character shingle Jaccard Working
CosineSimilarityScorer — embedding cosine Working
ConnectedComponentsClustering Working
CorrelationClustering — Kwik-Cluster Working
Abt-Buy benchmark Working
PyPI package Planned

Installation

pip install sandx-er

Or from source:

git clone https://github.com/sandxlab/sandx-er
cd sandx-er
pip install -e ".[dev]"

For embedding-based blocking and matching:

pip install "sandx-er[embed]"

Quick Start

import pandas as pd
from sandx_er import EntityResolver

records = pd.DataFrame({
    "name":  ["Acme Corp", "Acme Corp.", "GlobalTech Inc", "Global Tech"],
    "city":  ["Boston",    "Boston",     "New York",       "New York"],
})

er = EntityResolver(
    blocking="lsh",       # MinHash LSH candidate generation
    similarity="jaccard", # character Jaccard similarity scoring
    threshold=0.4,
)

result = er.resolve(records)

print(f"Resolved {result.n_records} records → {result.n_clusters} entities")
for cluster in result.clusters:
    print(f"  {cluster.canonical_id[:8]}  size={cluster.size}  conf={cluster.confidence:.2f}")
    print(f"    records: {cluster.record_ids}")

Output:

Resolved 4 records → 2 entities
  3f2a1b8c  size=2  conf=0.81
    records: ['0', '1']
  7e9d4c2a  size=2  conf=0.76
    records: ['2', '3']

Pipeline Stages

Blocking

Reduces O(N²) comparisons to a tractable candidate set.

from sandx_er import LSHBlocking, SortedNeighborhoodBlocking, EmbeddingANNBlocking

# MinHash LSH — works on all string fields, no key required
er = EntityResolver(blocking="lsh")

# Sorted Neighborhood Method — fast, requires a sort key
er = EntityResolver(blocking="snm", key_field="name")

# Embedding ANN — semantic similarity (requires sandx-embed)
er = EntityResolver(blocking="embedding", embed_model="sentence-bert")

# Or pass a custom BlockingMethod instance
er = EntityResolver(blocking=LSHBlocking(n_bands=30, n_rows=4))

Matching

Scores each candidate pair.

from sandx_er import JaccardScorer, CosineSimilarityScorer

er = EntityResolver(similarity="jaccard")               # no deps; fast
er = EntityResolver(similarity="embedding")             # requires sandx-embed
er = EntityResolver(similarity=JaccardScorer(shingle_size=2, fields=["name"]))

Clustering

Reconciles pairwise decisions into globally consistent entity clusters.

er = EntityResolver(clustering="connected_components")  # fast; may over-merge
er = EntityResolver(clustering="correlation")           # slower; corrects transitivity errors

Benchmark — Febrl4

python -m benchmarks.abt_buy                                    # LSH + Jaccard, threshold 0.3
python -m benchmarks.abt_buy --blocking snm --key-field surname # SNM + Jaccard

Uses the Febrl4 person record linkage dataset (built into recordlinkage — no download required). 5,000 records per table · 5,000 true 1:1 matches · synthetic Australian person records with realistic noise.

Config Precision Recall F1 Time
LSH + Jaccard · threshold=0.3 1.000 0.955 0.977 1.1s
SNM (surname) + Jaccard · threshold=0.3 1.000 0.384 0.555 0.4s

LSH generalizes across all field variations; SNM recall drops when the blocking key (surname) is noisy. All results are reproducible: pip install recordlinkage && python -m benchmarks.abt_buy.

Architecture

sandx_er/
├── resolver.py     EntityResolver — pipeline orchestrator
├── blocking.py     LSHBlocking, SortedNeighborhoodBlocking, EmbeddingANNBlocking
├── matching.py     JaccardScorer, CosineSimilarityScorer
└── clustering.py   ConnectedComponentsClustering, CorrelationClustering

Optional dependency: sandx-embed for embedding-based blocking and matching.

Benchmark Datasets

Dataset Domain Table A Table B Matches
Abt-Buy E-commerce 1,081 1,092 ~1,097
DBLP-ACM Academic 2,616 2,294 2,224
DBLP-Scholar Academic 2,616 64,263 5,347
Cora Citations 1,295 dedup

All benchmark runs are version-tagged and fully reproducible from public data.

Related

License

Apache 2.0 — see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sandx_er-0.1.0.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sandx_er-0.1.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file sandx_er-0.1.0.tar.gz.

File metadata

  • Download URL: sandx_er-0.1.0.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sandx_er-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1a18900f4d0ef065c03ea8498e0a7da81f813e7f76e26b943c9c0069d3b34293
MD5 8fc6f8afccf4a52f9c567a7ea3876858
BLAKE2b-256 2d633098ab46a76e670fdd38b5329a8ea51a438e682b639adfbf3650b7e7f621

See more details on using hashes here.

File details

Details for the file sandx_er-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sandx_er-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sandx_er-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 59020a2591ac4efa61b93775cd2a72ee7232fbf29d377d3453ff13866a9a33d8
MD5 f41658b30d62e8e0ea2fa6e3d7250831
BLAKE2b-256 43fe74c1c2553963381ff8bd6a8f79d86c1cecee2c76d23e285772f254a8329a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page