Rust-accelerated record linkage at scale

These details have not been verified by PyPI

Project links

Project description

Blazematch

Rust-accelerated record linkage and deduplication for Python.

Blazematch pairs a high-level Python API with a Rust core to make entity matching fast without sacrificing usability. Define blocking rules, pick similarity metrics, and fit a model — all in a few lines of pandas-friendly code. The heavy lifting (similarity computation, feature matrix construction, model inference) runs in Rust with Rayon parallelism and GIL release, so Python never becomes the bottleneck.

Highlights

11 similarity metrics computed in parallel via Rayon
SQL blocking powered by DuckDB — write rules like "l.city = r.city"
ANN blocking with TF-IDF + FAISS for fuzzy candidate generation
10 models — 6 supervised + 4 unsupervised (no labels needed)
Deduplication mode with connected-component clustering
8 visualizations with Altair (interactive) and matplotlib backends
Streaming support for datasets larger than RAM
RAM estimation before you run the pipeline

Installation

pip install blazematch

Build from source (requires Rust and maturin):

git clone https://github.com/yourusername/blazematch.git
cd blazematch
pip install maturin
maturin develop --release

Optional extras

pip install blazematch[ann]       # FAISS-based embedding blocking
pip install blazematch[viz]       # Altair interactive charts
pip install blazematch[xgboost]   # XGBoost model
pip install blazematch[lightgbm]  # LightGBM model

Quick Start

Record Linkage

import pandas as pd
from blazematch import Linker, LinkConfig, BlockingRule, FieldComparison, Metric

df_left = pd.DataFrame({
    "name": ["Alice Smith", "Bob Jones", "Carol White"],
    "city": ["New York", "Boston", "New York"],
    "dob":  ["1990-01-15", "1985-06-20", "1992-03-10"],
})

df_right = pd.DataFrame({
    "name": ["A. Smith", "Robert Jones", "Carol Whyte"],
    "city": ["New York", "Boston", "New York"],
    "dob":  ["1990-01-15", "1985-06-20", "1992-03-10"],
})

config = LinkConfig(
    blocking_rules=[BlockingRule("l.city = r.city")],
    comparisons=[
        FieldComparison("name", "name", Metric.JARO_WINKLER),
        FieldComparison("dob",  "dob",  Metric.DATE_DISTANCE, date_format="%Y-%m-%d"),
    ],
)

linker = Linker(df_left, df_right, config)
linker.block().compute_features().estimate()
results = linker.predict(threshold=0.8)
print(results)
#    left_idx  right_idx     score
# 0         0          0  0.923456
# 1         2          2  0.891234

Deduplication

from blazematch import Deduplicator

dedup = Deduplicator(df, config)
dedup.block().compute_features().estimate()
clusters = dedup.cluster(threshold=0.8)
# clusters: DataFrame with [record_idx, cluster_id]

Supervised Training

When you have labelled pairs, swap .estimate() for .fit():

labels = pd.DataFrame({
    "left_idx":  [0, 1, 2],
    "right_idx": [0, 1, 2],
    "label":     [1, 0, 1],
})

linker.block().compute_features().fit(labels, model="random_forest")
results = linker.predict(threshold=0.5)

Available models: "logistic", "random_forest", "gradient_boosting", "svm", "xgboost", "lightgbm".

Preprocessing

Apply text cleaning before comparison:

FieldComparison(
    "name", "name", Metric.JARO_WINKLER,
    preprocess=["lowercase", "strip_whitespace", "remove_punctuation"],
)

Available steps: lowercase, strip_whitespace, remove_punctuation, normalize_unicode.

Embedding Blocking (ANN)

For fuzzy blocking on free-text fields without exact-match rules:

from blazematch import EmbeddingBlockingConfig

config = LinkConfig(
    blocking_rules=[BlockingRule("l.city = r.city")],
    comparisons=[...],
    embedding_blocking=EmbeddingBlockingConfig(
        fields=["name"],
        top_k=10,
        min_sim=0.3,
    ),
)

Requires pip install blazematch[ann].

Save and Load Models

linker.model.save("my_model.pkl")

# Later...
from blazematch import MatchModel
model = MatchModel.load("my_model.pkl")

RAM Estimation

Estimate memory requirements before running the pipeline:

linker.block()
estimate = linker.estimate_ram(model="random_forest")
print(estimate.summary())
# Peak memory: 1.2 GB | System RAM: 16.0 GB | OK to proceed

Visualization

from blazematch import plot_score_distribution, plot_waterfall, plot_roc

plot_score_distribution(results, threshold=0.8)
plot_waterfall(linker, pair_idx=0)
plot_roc(results, labels)

All plot functions accept backend="altair" (default) or backend="matplotlib".

Function	Purpose
`plot_score_distribution`	Histogram of match scores with optional threshold line
`plot_waterfall`	Per-feature contribution waterfall for a single pair
`plot_comparison_heatmap`	Mean feature values for matches vs. non-matches
`plot_precision_recall`	Precision-recall curve with average precision
`plot_roc`	ROC curve with AUC
`plot_threshold_analysis`	Match count across threshold values
`plot_match_weights`	Per-feature model weights or importances
`plot_comparison_viewer`	Sampled pair comparisons with field-level detail

Similarity Metrics

Metric	Type	Description
`JARO_WINKLER`	String	Edit-distance variant favoring common prefixes
`LEVENSHTEIN`	String	Normalized edit distance (0-1)
`DAMERAU_LEVENSHTEIN`	String	Levenshtein + transpositions
`EXACT`	String	Binary 0/1 equality
`SOUNDEX`	String	Phonetic encoding match
`JACCARD`	String	Token-set intersection over union
`COSINE`	String	Token-level cosine similarity
`TOKEN_SORT_RATIO`	String	Order-invariant fuzzy match (sorted tokens + Levenshtein)
`NUMERIC`	Numeric	Absolute distance
`NUMERIC_SIMILARITY`	Numeric	Scaled proximity (0-1)
`DATE_DISTANCE`	Numeric	Absolute days between dates

All string metrics are computed in parallel via Rayon. Numeric metrics operate on f64 slices passed zero-copy from numpy.

Models

Supervised (require labelled pairs)

Model	Key	Training	Inference
Logistic Regression	`"logistic"`	scikit-learn	Rust (ndarray GEMV + sigmoid)
Random Forest	`"random_forest"`	scikit-learn	Rust (parallel tree traversal)
Gradient Boosting	`"gradient_boosting"`	scikit-learn	Rust (parallel tree traversal + sigmoid)
SVM (RBF)	`"svm"`	scikit-learn	Rust (parallel RBF kernel + Platt scaling)
XGBoost	`"xgboost"`	xgboost	Native C++
LightGBM	`"lightgbm"`	lightgbm	Native C++

Unsupervised (no labels needed)

Model	Key	Algorithm
Fellegi-Sunter	`"fellegi_sunter"`	EM-based probabilistic matching with m/u weights
K-Means	`"kmeans"`	k=2 clustering with optional silhouette auto-tuning
GMM	`"gmm"`	Gaussian mixture with BIC-based component selection
DBSCAN	`"dbscan"`	Density-based clustering with auto eps estimation

Benchmarks

Pipeline timing at various scales (macOS Apple Silicon, single postcode blocking rule, 3 comparisons):

Records per side	Candidate pairs	Blocking	Features (Rust)	Inference	Full pipeline
1,000	~5K	15 ms	3 ms	<1 ms	40 ms
5,000	~25K	30 ms	15 ms	<1 ms	120 ms
10,000	~50K	50 ms	35 ms	<1 ms	250 ms
50,000	~250K	149 ms	162 ms	3 ms	820 ms

python benchmarks/bench_pipeline.py
python benchmarks/bench_pipeline.py --scales 10000 100000

Architecture

Python API                          Rust Core (PyO3 + Rayon)
-----------------------------------+--------------------------------------
Linker / Deduplicator               similarity.rs    11 parallel metrics
  |-- RuleBlocker (DuckDB SQL)      features.rs      single-call feature matrix
  |-- EmbeddingBlocker (FAISS)      inference.rs     logistic regression (GEMV)
  |-- FeatureComputer ------------> tree_inference.rs RF / GBDT (flat numpy arrays)
  |-- MatchModel / RF / GBDT -----> svm_inference.rs  SVM RBF + Platt scaling
  |-- SVMModel ----------------->
  |-- XGBoostModel (native C++)     All batch functions release the GIL
  |-- LightGBMModel (native C++)    and use Rayon for CPU parallelism.
  |-- FellegiSunterModel            Numeric data passed as zero-copy numpy.
  |-- KMeans / GMM / DBSCAN         Tree models use flat numpy arrays with
  +-- Visualize (Altair / mpl)      offsets to avoid per-call serialization.

Design principles:

Single Rust call for all feature computation — one Python-to-Rust crossing per batch, not per-metric
Zero-copy numeric data — numpy arrays passed directly to Rust via PyO3, no .tolist() conversion
Flat tree serialization — sklearn tree structures are flattened into contiguous numpy arrays with offset indices, eliminating repeated list-to-Vec conversion on every predict call
GIL-free parallelism — every Rust batch function calls py.allow_threads() before Rayon work
Streaming support — block_iter(), compute_chunked(), and predict_iter() for datasets that don't fit in RAM

API Reference

Core

Class	Description
`Linker(df_left, df_right, config)`	Main record linkage pipeline
`Deduplicator(df, config)`	Self-join deduplication with clustering
`LinkConfig(...)`	Pipeline configuration (blocking rules, comparisons, options)
`FieldComparison(left, right, metric, ...)`	A single field comparison definition
`BlockingRule(sql)`	SQL join condition for candidate generation
`Metric`	Enum of available similarity metrics

Blocking

Class	Description
`RuleBlocker`	DuckDB-powered SQL blocking
`EmbeddingBlocker`	TF-IDF + FAISS approximate nearest neighbor blocking
`EmbeddingBlockingConfig(fields, top_k, min_sim, ...)`	ANN blocking configuration

Models

Class	Description
`MatchModel`	Logistic regression (Rust inference)
`RandomForestModel`	Random forest (Rust inference)
`GradientBoostingModel`	Gradient boosting (Rust inference)
`SVMModel`	SVM with RBF kernel (Rust inference)
`XGBoostModel`	XGBoost wrapper (native C++ inference)
`LightGBMModel`	LightGBM wrapper (native C++ inference)
`FellegiSunterModel`	Unsupervised EM probabilistic matching
`KMeansModel`	K-Means clustering with optional auto-tuning
`GMMModel`	Gaussian mixture with BIC auto-tuning
`DBSCANModel`	DBSCAN with auto eps estimation

Utilities

Function / Class	Description
`estimate_ram(...)`	Estimate memory requirements for a pipeline configuration
`RAMEstimate`	Result object with `summary()`, `can_proceed`, `peak_bytes`
`apply_preprocess(series, steps)`	Apply preprocessing chain to a pandas Series

Development

# Setup
git clone https://github.com/yourusername/blazematch.git
cd blazematch
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
maturin develop --release

# Tests
pytest tests/ -v             # 153 Python tests
cargo test                   # 22 Rust unit tests

# Benchmarks
cargo bench                  # Rust micro-benchmarks (Criterion)
python benchmarks/bench_pipeline.py

# Lint
ruff check python/

Project Structure

blazematch/
  src/                       # Rust source (PyO3 extension module)
    lib.rs                   # Module registration
    similarity.rs            # 11 parallel similarity metrics
    features.rs              # Single-call feature matrix computation
    inference.rs             # Logistic regression batch prediction
    tree_inference.rs        # Random forest / GBDT batch prediction
    svm_inference.rs         # SVM RBF kernel batch prediction
    utils.rs                 # Shared utilities (sigmoid)
  python/blazematch/         # Python source
    linker.py                # Main pipeline orchestrator
    dedup.py                 # Deduplication wrapper
    blocking.py              # Rule-based and ANN blocking
    features.py              # Feature computation bridge to Rust
    model.py                 # Supervised model classes
    fellegi_sunter.py        # Unsupervised EM model
    clustering.py            # K-Means, GMM, DBSCAN models
    config.py                # Configuration dataclasses
    preprocess.py            # Text preprocessing pipeline
    visualize.py             # 8 chart functions
    estimate_ram.py          # Memory estimation
  tests/                     # pytest suite
  benches/                   # Criterion benchmarks

Requirements

Python >= 3.9
Rust toolchain (for building from source)
Core: pandas, numpy, pyarrow, duckdb, scikit-learn, scipy, matplotlib

License

See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Mar 13, 2026

This version

0.1.0

Mar 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blazematch-0.1.0.tar.gz (424.6 kB view details)

Uploaded Mar 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

blazematch-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (356.9 kB view details)

Uploaded Mar 13, 2026 CPython 3.12macOS 11.0+ ARM64

File details

Details for the file blazematch-0.1.0.tar.gz.

File metadata

Download URL: blazematch-0.1.0.tar.gz
Upload date: Mar 13, 2026
Size: 424.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.2

File hashes

Hashes for blazematch-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6a6768dba109a1b94ac20691fb9c4a59dea0f2823dd93c65a866d9e74ae36876`
MD5	`d8680f9bd4eb716d8d1513b150d72c30`
BLAKE2b-256	`25293292b9f0596c86379adafa48aef34b640323ddff2865401dd1f6576bb2e4`

See more details on using hashes here.

File details

Details for the file blazematch-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: blazematch-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Upload date: Mar 13, 2026
Size: 356.9 kB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.2

File hashes

Hashes for blazematch-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`96dfe91b4d483025409ef512af1dd8c0dc40ef89b63e7c570d9d4b443f294848`
MD5	`2d5fbd9d2ea611da1d9e407bc4a1e1a3`
BLAKE2b-256	`d4afa86f3f12d6e4082a4469c924f9b8e8d5718c5c26d15348d401b231519ed4`

See more details on using hashes here.

blazematch 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Blazematch

Highlights

Installation

Optional extras

Quick Start

Record Linkage

Deduplication

Supervised Training

Preprocessing

Embedding Blocking (ANN)

Save and Load Models

RAM Estimation

Visualization

Similarity Metrics

Models

Supervised (require labelled pairs)

Unsupervised (no labels needed)

Benchmarks

Architecture

API Reference

Core

Blocking

Models

Utilities

Development

Project Structure

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes