Skip to main content

Rust-accelerated record linkage at scale

Project description

Blazematch

Rust-accelerated record linkage and deduplication for Python.

Python 3.9+ Rust Tests License

Blazematch pairs a high-level Python API with a Rust core to make entity matching fast without sacrificing usability. Define blocking rules, pick similarity metrics, and fit a model — all in a few lines of pandas-friendly code. The heavy lifting (similarity computation, feature matrix construction, model inference) runs in Rust with Rayon parallelism and GIL release, so Python never becomes the bottleneck.


Highlights

  • 11 similarity metrics computed in parallel via Rayon
  • SQL blocking powered by DuckDB — write rules like "l.city = r.city"
  • ANN blocking with TF-IDF + FAISS for fuzzy candidate generation
  • 10 models — 6 supervised + 4 unsupervised (no labels needed)
  • Deduplication mode with connected-component clustering
  • 8 visualizations with Altair (interactive) and matplotlib backends
  • Streaming support for datasets larger than RAM
  • RAM estimation before you run the pipeline

Installation

pip install blazematch

Build from source (requires Rust and maturin):

git clone https://github.com/yourusername/blazematch.git
cd blazematch
pip install maturin
maturin develop --release

Optional extras

pip install blazematch[ann]       # FAISS-based embedding blocking
pip install blazematch[viz]       # Altair interactive charts
pip install blazematch[xgboost]   # XGBoost model
pip install blazematch[lightgbm]  # LightGBM model

Quick Start

Record Linkage

import pandas as pd
from blazematch import Linker, LinkConfig, BlockingRule, FieldComparison, Metric

df_left = pd.DataFrame({
    "name": ["Alice Smith", "Bob Jones", "Carol White"],
    "city": ["New York", "Boston", "New York"],
    "dob":  ["1990-01-15", "1985-06-20", "1992-03-10"],
})

df_right = pd.DataFrame({
    "name": ["A. Smith", "Robert Jones", "Carol Whyte"],
    "city": ["New York", "Boston", "New York"],
    "dob":  ["1990-01-15", "1985-06-20", "1992-03-10"],
})

config = LinkConfig(
    blocking_rules=[BlockingRule("l.city = r.city")],
    comparisons=[
        FieldComparison("name", "name", Metric.JARO_WINKLER),
        FieldComparison("dob",  "dob",  Metric.DATE_DISTANCE, date_format="%Y-%m-%d"),
    ],
)

linker = Linker(df_left, df_right, config)
linker.block().compute_features().estimate()
results = linker.predict(threshold=0.8)
print(results)
#    left_idx  right_idx     score
# 0         0          0  0.923456
# 1         2          2  0.891234

Deduplication

from blazematch import Deduplicator

dedup = Deduplicator(df, config)
dedup.block().compute_features().estimate()
clusters = dedup.cluster(threshold=0.8)
# clusters: DataFrame with [record_idx, cluster_id]

Supervised Training

When you have labelled pairs, swap .estimate() for .fit():

labels = pd.DataFrame({
    "left_idx":  [0, 1, 2],
    "right_idx": [0, 1, 2],
    "label":     [1, 0, 1],
})

linker.block().compute_features().fit(labels, model="random_forest")
results = linker.predict(threshold=0.5)

Available models: "logistic", "random_forest", "gradient_boosting", "svm", "xgboost", "lightgbm".

Preprocessing

Apply text cleaning before comparison:

FieldComparison(
    "name", "name", Metric.JARO_WINKLER,
    preprocess=["lowercase", "strip_whitespace", "remove_punctuation"],
)

Available steps: lowercase, strip_whitespace, remove_punctuation, normalize_unicode.

Embedding Blocking (ANN)

For fuzzy blocking on free-text fields without exact-match rules:

from blazematch import EmbeddingBlockingConfig

config = LinkConfig(
    blocking_rules=[BlockingRule("l.city = r.city")],
    comparisons=[...],
    embedding_blocking=EmbeddingBlockingConfig(
        fields=["name"],
        top_k=10,
        min_sim=0.3,
    ),
)

Requires pip install blazematch[ann].

Save and Load Models

linker.model.save("my_model.pkl")

# Later...
from blazematch import MatchModel
model = MatchModel.load("my_model.pkl")

RAM Estimation

Estimate memory requirements before running the pipeline:

linker.block()
estimate = linker.estimate_ram(model="random_forest")
print(estimate.summary())
# Peak memory: 1.2 GB | System RAM: 16.0 GB | OK to proceed

Visualization

from blazematch import plot_score_distribution, plot_waterfall, plot_roc

plot_score_distribution(results, threshold=0.8)
plot_waterfall(linker, pair_idx=0)
plot_roc(results, labels)

All plot functions accept backend="altair" (default) or backend="matplotlib".

Function Purpose
plot_score_distribution Histogram of match scores with optional threshold line
plot_waterfall Per-feature contribution waterfall for a single pair
plot_comparison_heatmap Mean feature values for matches vs. non-matches
plot_precision_recall Precision-recall curve with average precision
plot_roc ROC curve with AUC
plot_threshold_analysis Match count across threshold values
plot_match_weights Per-feature model weights or importances
plot_comparison_viewer Sampled pair comparisons with field-level detail

Similarity Metrics

Metric Type Description
JARO_WINKLER String Edit-distance variant favoring common prefixes
LEVENSHTEIN String Normalized edit distance (0-1)
DAMERAU_LEVENSHTEIN String Levenshtein + transpositions
EXACT String Binary 0/1 equality
SOUNDEX String Phonetic encoding match
JACCARD String Token-set intersection over union
COSINE String Token-level cosine similarity
TOKEN_SORT_RATIO String Order-invariant fuzzy match (sorted tokens + Levenshtein)
NUMERIC Numeric Absolute distance
NUMERIC_SIMILARITY Numeric Scaled proximity (0-1)
DATE_DISTANCE Numeric Absolute days between dates

All string metrics are computed in parallel via Rayon. Numeric metrics operate on f64 slices passed zero-copy from numpy.


Models

Supervised (require labelled pairs)

Model Key Training Inference
Logistic Regression "logistic" scikit-learn Rust (ndarray GEMV + sigmoid)
Random Forest "random_forest" scikit-learn Rust (parallel tree traversal)
Gradient Boosting "gradient_boosting" scikit-learn Rust (parallel tree traversal + sigmoid)
SVM (RBF) "svm" scikit-learn Rust (parallel RBF kernel + Platt scaling)
XGBoost "xgboost" xgboost Native C++
LightGBM "lightgbm" lightgbm Native C++

Unsupervised (no labels needed)

Model Key Algorithm
Fellegi-Sunter "fellegi_sunter" EM-based probabilistic matching with m/u weights
K-Means "kmeans" k=2 clustering with optional silhouette auto-tuning
GMM "gmm" Gaussian mixture with BIC-based component selection
DBSCAN "dbscan" Density-based clustering with auto eps estimation

Benchmarks

Pipeline timing at various scales (macOS Apple Silicon, single postcode blocking rule, 3 comparisons):

Records per side Candidate pairs Blocking Features (Rust) Inference Full pipeline
1,000 ~5K 15 ms 3 ms <1 ms 40 ms
5,000 ~25K 30 ms 15 ms <1 ms 120 ms
10,000 ~50K 50 ms 35 ms <1 ms 250 ms
50,000 ~250K 149 ms 162 ms 3 ms 820 ms
python benchmarks/bench_pipeline.py
python benchmarks/bench_pipeline.py --scales 10000 100000

Architecture

Python API                          Rust Core (PyO3 + Rayon)
-----------------------------------+--------------------------------------
Linker / Deduplicator               similarity.rs    11 parallel metrics
  |-- RuleBlocker (DuckDB SQL)      features.rs      single-call feature matrix
  |-- EmbeddingBlocker (FAISS)      inference.rs     logistic regression (GEMV)
  |-- FeatureComputer ------------> tree_inference.rs RF / GBDT (flat numpy arrays)
  |-- MatchModel / RF / GBDT -----> svm_inference.rs  SVM RBF + Platt scaling
  |-- SVMModel ----------------->
  |-- XGBoostModel (native C++)     All batch functions release the GIL
  |-- LightGBMModel (native C++)    and use Rayon for CPU parallelism.
  |-- FellegiSunterModel            Numeric data passed as zero-copy numpy.
  |-- KMeans / GMM / DBSCAN         Tree models use flat numpy arrays with
  +-- Visualize (Altair / mpl)      offsets to avoid per-call serialization.

Design principles:

  • Single Rust call for all feature computation — one Python-to-Rust crossing per batch, not per-metric
  • Zero-copy numeric data — numpy arrays passed directly to Rust via PyO3, no .tolist() conversion
  • Flat tree serialization — sklearn tree structures are flattened into contiguous numpy arrays with offset indices, eliminating repeated list-to-Vec conversion on every predict call
  • GIL-free parallelism — every Rust batch function calls py.allow_threads() before Rayon work
  • Streaming supportblock_iter(), compute_chunked(), and predict_iter() for datasets that don't fit in RAM

API Reference

Core

Class Description
Linker(df_left, df_right, config) Main record linkage pipeline
Deduplicator(df, config) Self-join deduplication with clustering
LinkConfig(...) Pipeline configuration (blocking rules, comparisons, options)
FieldComparison(left, right, metric, ...) A single field comparison definition
BlockingRule(sql) SQL join condition for candidate generation
Metric Enum of available similarity metrics

Blocking

Class Description
RuleBlocker DuckDB-powered SQL blocking
EmbeddingBlocker TF-IDF + FAISS approximate nearest neighbor blocking
EmbeddingBlockingConfig(fields, top_k, min_sim, ...) ANN blocking configuration

Models

Class Description
MatchModel Logistic regression (Rust inference)
RandomForestModel Random forest (Rust inference)
GradientBoostingModel Gradient boosting (Rust inference)
SVMModel SVM with RBF kernel (Rust inference)
XGBoostModel XGBoost wrapper (native C++ inference)
LightGBMModel LightGBM wrapper (native C++ inference)
FellegiSunterModel Unsupervised EM probabilistic matching
KMeansModel K-Means clustering with optional auto-tuning
GMMModel Gaussian mixture with BIC auto-tuning
DBSCANModel DBSCAN with auto eps estimation

Utilities

Function / Class Description
estimate_ram(...) Estimate memory requirements for a pipeline configuration
RAMEstimate Result object with summary(), can_proceed, peak_bytes
apply_preprocess(series, steps) Apply preprocessing chain to a pandas Series

Development

# Setup
git clone https://github.com/yourusername/blazematch.git
cd blazematch
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
maturin develop --release

# Tests
pytest tests/ -v             # 153 Python tests
cargo test                   # 22 Rust unit tests

# Benchmarks
cargo bench                  # Rust micro-benchmarks (Criterion)
python benchmarks/bench_pipeline.py

# Lint
ruff check python/

Project Structure

blazematch/
  src/                       # Rust source (PyO3 extension module)
    lib.rs                   # Module registration
    similarity.rs            # 11 parallel similarity metrics
    features.rs              # Single-call feature matrix computation
    inference.rs             # Logistic regression batch prediction
    tree_inference.rs        # Random forest / GBDT batch prediction
    svm_inference.rs         # SVM RBF kernel batch prediction
    utils.rs                 # Shared utilities (sigmoid)
  python/blazematch/         # Python source
    linker.py                # Main pipeline orchestrator
    dedup.py                 # Deduplication wrapper
    blocking.py              # Rule-based and ANN blocking
    features.py              # Feature computation bridge to Rust
    model.py                 # Supervised model classes
    fellegi_sunter.py        # Unsupervised EM model
    clustering.py            # K-Means, GMM, DBSCAN models
    config.py                # Configuration dataclasses
    preprocess.py            # Text preprocessing pipeline
    visualize.py             # 8 chart functions
    estimate_ram.py          # Memory estimation
  tests/                     # pytest suite
  benches/                   # Criterion benchmarks

Requirements

  • Python >= 3.9
  • Rust toolchain (for building from source)
  • Core: pandas, numpy, pyarrow, duckdb, scikit-learn, scipy, matplotlib

License

See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blazematch-0.1.0.tar.gz (424.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blazematch-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (356.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file blazematch-0.1.0.tar.gz.

File metadata

  • Download URL: blazematch-0.1.0.tar.gz
  • Upload date:
  • Size: 424.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.2

File hashes

Hashes for blazematch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6a6768dba109a1b94ac20691fb9c4a59dea0f2823dd93c65a866d9e74ae36876
MD5 d8680f9bd4eb716d8d1513b150d72c30
BLAKE2b-256 25293292b9f0596c86379adafa48aef34b640323ddff2865401dd1f6576bb2e4

See more details on using hashes here.

File details

Details for the file blazematch-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for blazematch-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 96dfe91b4d483025409ef512af1dd8c0dc40ef89b63e7c570d9d4b443f294848
MD5 2d5fbd9d2ea611da1d9e407bc4a1e1a3
BLAKE2b-256 d4afa86f3f12d6e4082a4469c924f9b8e8d5718c5c26d15348d401b231519ed4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page