High-precision record linkage library with multi-pass support
Project description
preclink
High-precision record linkage library implementing a 7-step pipeline with multi-pass support.
Installation
pip install preclink
Quick Start
import pandas as pd
from preclink import Pipeline, StringComparison, ExactComparison
df_left = pd.DataFrame({
"id": [1, 2, 3],
"first_name": ["John", "Jane", "Bob"],
"last_name": ["Smith", "Doe", "Johnson"],
"state": ["CA", "NY", "CA"],
})
df_right = pd.DataFrame({
"id": [101, 102, 103],
"first_name": ["Jon", "Jane", "Robert"],
"last_name": ["Smith", "Doe", "Johnson"],
"state": ["CA", "NY", "CA"],
})
result = (
Pipeline()
.preprocess(normalize_unicode=True, lowercase=True)
.block(on="state")
.score(comparisons=[
StringComparison("first_name", algorithm="jaro_winkler"),
StringComparison("last_name", algorithm="jaro_winkler"),
])
.filter(min_score=0.7)
.decide(method="hungarian")
.build()
.link(df_left, df_right)
)
print(result.matches)
The 7-Step Pipeline
- Preprocess: Normalize text (unicode, case, whitespace)
- Deduplicate: Remove within-table duplicates
- Block: Reduce comparison space using blocking keys
- Score: Compute pairwise similarity scores
- Filter: Remove low-confidence pairs
- Decide: Apply matching algorithm (Hungarian, Greedy, Row-Sequential)
- Inspect: Generate diagnostics and reports
Multi-Pass Matching
For complex datasets, use multi-pass matching with progressively relaxed thresholds:
from preclink import MultiPassOrchestrator, StringComparison
orchestrator = MultiPassOrchestrator()
result = orchestrator.run(
df_left,
df_right,
passes=[
{"min_score": 0.95, "method": "hungarian"},
{"min_score": 0.85, "method": "hungarian"},
{"min_score": 0.70, "method": "greedy"},
],
comparisons=[
StringComparison("first_name"),
StringComparison("last_name"),
],
)
Decision Rules
- Hungarian: Optimal assignment maximizing total score (recommended for precision)
- Greedy: Best-global-pair first, fast and precision-optimized
- Row-Sequential: Process left records in order (baseline)
Features
- Type-safe with full mypy strict mode support
- Extensible via protocols (custom comparisons, blockers, decision rules)
- Native pandas DataFrames throughout
- Crosswalk support for blocking key normalization
- Margin-based filtering for ambiguity removal
Benchmarks
Comparison against recordlinkage on standard Febrl datasets:
| Dataset | Library | Precision | Recall | F1 |
|---|---|---|---|---|
| febrl1 | preclink | 100.0% | 76.4% | 86.6% |
| febrl1 | recordlinkage | 99.5% | 79.0% | 88.1% |
| febrl2 | preclink | 97.3% | 39.5% | 56.2% |
| febrl2 | recordlinkage | 95.0% | 80.0% | 86.9% |
| febrl3 | preclink | 99.2% | 35.4% | 52.2% |
| febrl3 | recordlinkage | 98.1% | 79.6% | 87.9% |
| febrl4 | preclink | 99.9% | 79.0% | 88.2% |
| febrl4 | recordlinkage | 94.3% | 80.8% | 87.0% |
febrl4 is the true record linkage scenario (linking two separate tables). On this dataset, preclink achieves 99.9% precision with higher F1 than recordlinkage. The other datasets (febrl1-3) are deduplication tasks where records are split artificially.
When to use preclink: When false positives are costly (merging administrative records, survey linking, research applications) and you need provably optimal 1:1 matching.
Reproduce:
pip install recordlinkage
python examples/benchmark_febrl.py
Documentation
Full documentation at finite-sample.github.io/preclink
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file preclink-1.2.0.tar.gz.
File metadata
- Download URL: preclink-1.2.0.tar.gz
- Upload date:
- Size: 39.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd0364a35316fc9eee9c1c1dcc055188f98ab5a19271d9af40aca9be2d4e2dee
|
|
| MD5 |
d5af5cae19d28e49c70047300dc6d5cf
|
|
| BLAKE2b-256 |
dad6b24e760726c05175929d39932039c7ea19b522b43ec01352e4f1f2b075d2
|
File details
Details for the file preclink-1.2.0-py3-none-any.whl.
File metadata
- Download URL: preclink-1.2.0-py3-none-any.whl
- Upload date:
- Size: 33.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db740b2a5d646b4a2f23fb4ce1db2139d6c17b521d2906d29472c2766a015646
|
|
| MD5 |
58d15738a3c83e25856733602687750c
|
|
| BLAKE2b-256 |
8e75d0e703ee6bfb0c57bf1a424ef83e11f405091feab0d67af9bc03e5776be2
|