High-precision record linkage library with multi-pass support
Project description
preclink
High-precision record linkage library implementing a 7-step pipeline with multi-pass support.
Installation
pip install preclink
Quick Start
import pandas as pd
from preclink import Pipeline, StringComparison, ExactComparison
df_left = pd.DataFrame({
"id": [1, 2, 3],
"first_name": ["John", "Jane", "Bob"],
"last_name": ["Smith", "Doe", "Johnson"],
"state": ["CA", "NY", "CA"],
})
df_right = pd.DataFrame({
"id": [101, 102, 103],
"first_name": ["Jon", "Jane", "Robert"],
"last_name": ["Smith", "Doe", "Johnson"],
"state": ["CA", "NY", "CA"],
})
result = (
Pipeline()
.preprocess(normalize_unicode=True, lowercase=True)
.block(on="state")
.score(comparisons=[
StringComparison("first_name", algorithm="jaro_winkler"),
StringComparison("last_name", algorithm="jaro_winkler"),
])
.filter(min_score=0.7)
.decide(method="hungarian")
.build()
.link(df_left, df_right)
)
print(result.matches)
The 7-Step Pipeline
- Preprocess: Normalize text (unicode, case, whitespace)
- Deduplicate: Remove within-table duplicates
- Block: Reduce comparison space using blocking keys
- Score: Compute pairwise similarity scores
- Filter: Remove low-confidence pairs
- Decide: Apply matching algorithm (Hungarian, Greedy, Row-Sequential)
- Inspect: Generate diagnostics and reports
Multi-Pass Matching
For complex datasets, use multi-pass matching with progressively relaxed thresholds:
from preclink import MultiPassOrchestrator, StringComparison
orchestrator = MultiPassOrchestrator()
result = orchestrator.run(
df_left,
df_right,
passes=[
{"min_score": 0.95, "method": "hungarian"},
{"min_score": 0.85, "method": "hungarian"},
{"min_score": 0.70, "method": "greedy"},
],
comparisons=[
StringComparison("first_name"),
StringComparison("last_name"),
],
)
Decision Rules
- Hungarian: Optimal assignment maximizing total score (recommended for precision)
- Greedy: Best-global-pair first, fast and precision-optimized
- Row-Sequential: Process left records in order (baseline)
Features
- Type-safe with full mypy strict mode support
- Extensible via protocols (custom comparisons, blockers, decision rules)
- Native pandas DataFrames throughout
- Crosswalk support for blocking key normalization
- Margin-based filtering for ambiguity removal
Documentation
Full documentation at preclink.readthedocs.io
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file preclink-0.1.0.tar.gz.
File metadata
- Download URL: preclink-0.1.0.tar.gz
- Upload date:
- Size: 25.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4df4b4017ea261500d2e62b190a4680eafa8c08e4f72c85b9e0abf270a42e78e
|
|
| MD5 |
a16102a30d2bca14715e6c177fe632f6
|
|
| BLAKE2b-256 |
8c4128dc0e09adaf605265b7194c844f71f737d140f2ddc48abb242e1530e740
|
File details
Details for the file preclink-0.1.0-py3-none-any.whl.
File metadata
- Download URL: preclink-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67f3e3034065fbc024e9d4f8b073dc91ca58afd8d86374b2507df6bdc5ccb006
|
|
| MD5 |
1b9680ce55e6d116a2514173f162e6bb
|
|
| BLAKE2b-256 |
51284dc2b6ac1e303900ed4f74fc94a5dd6f40711bff32f45c48273b941fb2b7
|