Rust-accelerated record linkage at scale
Project description
Blazematch
Rust-accelerated record linkage and deduplication for Python.
Blazematch pairs a high-level Python API with a Rust core to make entity matching fast without sacrificing usability. Define blocking rules, pick similarity metrics, and fit a model — all in a few lines of pandas-friendly code. The heavy lifting (similarity computation, feature matrix construction, model inference) runs in Rust with Rayon parallelism and GIL release, so Python never becomes the bottleneck.
Highlights
- 11 similarity metrics computed in parallel via Rayon
- SQL blocking powered by DuckDB — write rules like
"l.city = r.city" - ANN blocking with TF-IDF + FAISS for fuzzy candidate generation
- 10 models — 6 supervised + 4 unsupervised (no labels needed)
- Deduplication mode with connected-component clustering
- 8 visualizations with Altair (interactive) and matplotlib backends
- Streaming support for datasets larger than RAM
- RAM estimation before you run the pipeline
Installation
pip install blazematch
Build from source (requires Rust and maturin):
git clone https://github.com/yourusername/blazematch.git
cd blazematch
pip install maturin
maturin develop --release
Optional extras
pip install blazematch[ann] # FAISS-based embedding blocking
pip install blazematch[viz] # Altair interactive charts
pip install blazematch[xgboost] # XGBoost model
pip install blazematch[lightgbm] # LightGBM model
Quick Start
Record Linkage
import pandas as pd
from blazematch import Linker, LinkConfig, BlockingRule, FieldComparison, Metric
df_left = pd.DataFrame({
"name": ["Alice Smith", "Bob Jones", "Carol White"],
"city": ["New York", "Boston", "New York"],
"dob": ["1990-01-15", "1985-06-20", "1992-03-10"],
})
df_right = pd.DataFrame({
"name": ["A. Smith", "Robert Jones", "Carol Whyte"],
"city": ["New York", "Boston", "New York"],
"dob": ["1990-01-15", "1985-06-20", "1992-03-10"],
})
config = LinkConfig(
blocking_rules=[BlockingRule("l.city = r.city")],
comparisons=[
FieldComparison("name", "name", Metric.JARO_WINKLER),
FieldComparison("dob", "dob", Metric.DATE_DISTANCE, date_format="%Y-%m-%d"),
],
)
linker = Linker(df_left, df_right, config)
linker.block().compute_features().estimate()
results = linker.predict(threshold=0.8)
print(results)
# left_idx right_idx score
# 0 0 0 0.923456
# 1 2 2 0.891234
Deduplication
from blazematch import Deduplicator
dedup = Deduplicator(df, config)
dedup.block().compute_features().estimate()
clusters = dedup.cluster(threshold=0.8)
# clusters: DataFrame with [record_idx, cluster_id]
Supervised Training
When you have labelled pairs, swap .estimate() for .fit():
labels = pd.DataFrame({
"left_idx": [0, 1, 2],
"right_idx": [0, 1, 2],
"label": [1, 0, 1],
})
linker.block().compute_features().fit(labels, model="random_forest")
results = linker.predict(threshold=0.5)
Available models: "logistic", "random_forest", "gradient_boosting", "svm", "xgboost", "lightgbm".
Preprocessing
Apply text cleaning before comparison:
FieldComparison(
"name", "name", Metric.JARO_WINKLER,
preprocess=["lowercase", "strip_whitespace", "remove_punctuation"],
)
Available steps: lowercase, strip_whitespace, remove_punctuation, normalize_unicode.
Embedding Blocking (ANN)
For fuzzy blocking on free-text fields without exact-match rules:
from blazematch import EmbeddingBlockingConfig
config = LinkConfig(
blocking_rules=[BlockingRule("l.city = r.city")],
comparisons=[...],
embedding_blocking=EmbeddingBlockingConfig(
fields=["name"],
top_k=10,
min_sim=0.3,
),
)
Requires pip install blazematch[ann].
Save and Load Models
linker.model.save("my_model.pkl")
# Later...
from blazematch import MatchModel
model = MatchModel.load("my_model.pkl")
RAM Estimation
Estimate memory requirements before running the pipeline:
linker.block()
estimate = linker.estimate_ram(model="random_forest")
print(estimate.summary())
# Peak memory: 1.2 GB | System RAM: 16.0 GB | OK to proceed
Visualization
from blazematch import plot_score_distribution, plot_waterfall, plot_roc
plot_score_distribution(results, threshold=0.8)
plot_waterfall(linker, pair_idx=0)
plot_roc(results, labels)
All plot functions accept backend="altair" (default) or backend="matplotlib".
| Function | Purpose |
|---|---|
plot_score_distribution |
Histogram of match scores with optional threshold line |
plot_waterfall |
Per-feature contribution waterfall for a single pair |
plot_comparison_heatmap |
Mean feature values for matches vs. non-matches |
plot_precision_recall |
Precision-recall curve with average precision |
plot_roc |
ROC curve with AUC |
plot_threshold_analysis |
Match count across threshold values |
plot_match_weights |
Per-feature model weights or importances |
plot_comparison_viewer |
Sampled pair comparisons with field-level detail |
Similarity Metrics
| Metric | Type | Description |
|---|---|---|
JARO_WINKLER |
String | Edit-distance variant favoring common prefixes |
LEVENSHTEIN |
String | Normalized edit distance (0-1) |
DAMERAU_LEVENSHTEIN |
String | Levenshtein + transpositions |
EXACT |
String | Binary 0/1 equality |
SOUNDEX |
String | Phonetic encoding match |
JACCARD |
String | Token-set intersection over union |
COSINE |
String | Token-level cosine similarity |
TOKEN_SORT_RATIO |
String | Order-invariant fuzzy match (sorted tokens + Levenshtein) |
NUMERIC |
Numeric | Absolute distance |
NUMERIC_SIMILARITY |
Numeric | Scaled proximity (0-1) |
DATE_DISTANCE |
Numeric | Absolute days between dates |
All string metrics are computed in parallel via Rayon. Numeric metrics operate on f64 slices passed zero-copy from numpy.
Models
Supervised (require labelled pairs)
| Model | Key | Training | Inference |
|---|---|---|---|
| Logistic Regression | "logistic" |
scikit-learn | Rust (ndarray GEMV + sigmoid) |
| Random Forest | "random_forest" |
scikit-learn | Rust (parallel tree traversal) |
| Gradient Boosting | "gradient_boosting" |
scikit-learn | Rust (parallel tree traversal + sigmoid) |
| SVM (RBF) | "svm" |
scikit-learn | Rust (parallel RBF kernel + Platt scaling) |
| XGBoost | "xgboost" |
xgboost | Native C++ |
| LightGBM | "lightgbm" |
lightgbm | Native C++ |
Unsupervised (no labels needed)
| Model | Key | Algorithm |
|---|---|---|
| Fellegi-Sunter | "fellegi_sunter" |
EM-based probabilistic matching with m/u weights |
| K-Means | "kmeans" |
k=2 clustering with optional silhouette auto-tuning |
| GMM | "gmm" |
Gaussian mixture with BIC-based component selection |
| DBSCAN | "dbscan" |
Density-based clustering with auto eps estimation |
Benchmarks
Pipeline timing at various scales (macOS Apple Silicon, single postcode blocking rule, 3 comparisons):
| Records per side | Candidate pairs | Blocking | Features (Rust) | Inference | Full pipeline |
|---|---|---|---|---|---|
| 1,000 | ~5K | 15 ms | 3 ms | <1 ms | 40 ms |
| 5,000 | ~25K | 30 ms | 15 ms | <1 ms | 120 ms |
| 10,000 | ~50K | 50 ms | 35 ms | <1 ms | 250 ms |
| 50,000 | ~250K | 149 ms | 162 ms | 3 ms | 820 ms |
python benchmarks/bench_pipeline.py
python benchmarks/bench_pipeline.py --scales 10000 100000
Architecture
Python API Rust Core (PyO3 + Rayon)
-----------------------------------+--------------------------------------
Linker / Deduplicator similarity.rs 11 parallel metrics
|-- RuleBlocker (DuckDB SQL) features.rs single-call feature matrix
|-- EmbeddingBlocker (FAISS) inference.rs logistic regression (GEMV)
|-- FeatureComputer ------------> tree_inference.rs RF / GBDT (flat numpy arrays)
|-- MatchModel / RF / GBDT -----> svm_inference.rs SVM RBF + Platt scaling
|-- SVMModel ----------------->
|-- XGBoostModel (native C++) All batch functions release the GIL
|-- LightGBMModel (native C++) and use Rayon for CPU parallelism.
|-- FellegiSunterModel Numeric data passed as zero-copy numpy.
|-- KMeans / GMM / DBSCAN Tree models use flat numpy arrays with
+-- Visualize (Altair / mpl) offsets to avoid per-call serialization.
Design principles:
- Single Rust call for all feature computation — one Python-to-Rust crossing per batch, not per-metric
- Zero-copy numeric data — numpy arrays passed directly to Rust via PyO3, no
.tolist()conversion - Flat tree serialization — sklearn tree structures are flattened into contiguous numpy arrays with offset indices, eliminating repeated list-to-Vec conversion on every predict call
- GIL-free parallelism — every Rust batch function calls
py.allow_threads()before Rayon work - Streaming support —
block_iter(),compute_chunked(), andpredict_iter()for datasets that don't fit in RAM
API Reference
Core
| Class | Description |
|---|---|
Linker(df_left, df_right, config) |
Main record linkage pipeline |
Deduplicator(df, config) |
Self-join deduplication with clustering |
LinkConfig(...) |
Pipeline configuration (blocking rules, comparisons, options) |
FieldComparison(left, right, metric, ...) |
A single field comparison definition |
BlockingRule(sql) |
SQL join condition for candidate generation |
Metric |
Enum of available similarity metrics |
Blocking
| Class | Description |
|---|---|
RuleBlocker |
DuckDB-powered SQL blocking |
EmbeddingBlocker |
TF-IDF + FAISS approximate nearest neighbor blocking |
EmbeddingBlockingConfig(fields, top_k, min_sim, ...) |
ANN blocking configuration |
Models
| Class | Description |
|---|---|
MatchModel |
Logistic regression (Rust inference) |
RandomForestModel |
Random forest (Rust inference) |
GradientBoostingModel |
Gradient boosting (Rust inference) |
SVMModel |
SVM with RBF kernel (Rust inference) |
XGBoostModel |
XGBoost wrapper (native C++ inference) |
LightGBMModel |
LightGBM wrapper (native C++ inference) |
FellegiSunterModel |
Unsupervised EM probabilistic matching |
KMeansModel |
K-Means clustering with optional auto-tuning |
GMMModel |
Gaussian mixture with BIC auto-tuning |
DBSCANModel |
DBSCAN with auto eps estimation |
Utilities
| Function / Class | Description |
|---|---|
estimate_ram(...) |
Estimate memory requirements for a pipeline configuration |
RAMEstimate |
Result object with summary(), can_proceed, peak_bytes |
apply_preprocess(series, steps) |
Apply preprocessing chain to a pandas Series |
Development
# Setup
git clone https://github.com/yourusername/blazematch.git
cd blazematch
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
maturin develop --release
# Tests
pytest tests/ -v # 153 Python tests
cargo test # 22 Rust unit tests
# Benchmarks
cargo bench # Rust micro-benchmarks (Criterion)
python benchmarks/bench_pipeline.py
# Lint
ruff check python/
Project Structure
blazematch/
src/ # Rust source (PyO3 extension module)
lib.rs # Module registration
similarity.rs # 11 parallel similarity metrics
features.rs # Single-call feature matrix computation
inference.rs # Logistic regression batch prediction
tree_inference.rs # Random forest / GBDT batch prediction
svm_inference.rs # SVM RBF kernel batch prediction
utils.rs # Shared utilities (sigmoid)
python/blazematch/ # Python source
linker.py # Main pipeline orchestrator
dedup.py # Deduplication wrapper
blocking.py # Rule-based and ANN blocking
features.py # Feature computation bridge to Rust
model.py # Supervised model classes
fellegi_sunter.py # Unsupervised EM model
clustering.py # K-Means, GMM, DBSCAN models
config.py # Configuration dataclasses
preprocess.py # Text preprocessing pipeline
visualize.py # 8 chart functions
estimate_ram.py # Memory estimation
tests/ # pytest suite
benches/ # Criterion benchmarks
Requirements
- Python >= 3.9
- Rust toolchain (for building from source)
- Core: pandas, numpy, pyarrow, duckdb, scikit-learn, scipy, matplotlib
License
See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file blazematch-0.1.0.tar.gz.
File metadata
- Download URL: blazematch-0.1.0.tar.gz
- Upload date:
- Size: 424.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a6768dba109a1b94ac20691fb9c4a59dea0f2823dd93c65a866d9e74ae36876
|
|
| MD5 |
d8680f9bd4eb716d8d1513b150d72c30
|
|
| BLAKE2b-256 |
25293292b9f0596c86379adafa48aef34b640323ddff2865401dd1f6576bb2e4
|
File details
Details for the file blazematch-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: blazematch-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 356.9 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96dfe91b4d483025409ef512af1dd8c0dc40ef89b63e7c570d9d4b443f294848
|
|
| MD5 |
2d5fbd9d2ea611da1d9e407bc4a1e1a3
|
|
| BLAKE2b-256 |
d4afa86f3f12d6e4082a4469c924f9b8e8d5718c5c26d15348d401b231519ed4
|