Composable fuzzy string matching for Polars
Project description
polars-stringsim
Composable fuzzy string matching for Polars, implemented as a native Polars plugin (Rust core + Python bindings via PyO3). All scoring runs in Rust — no Python loops over rows — and works in both eager and lazy frames.
Features
- 23 metric expressions — edit distances, Jaro family, token/n-gram, LCS, and phonetic encoders.
- Hybrid scoring — combine multiple algorithms in a single Rust call (
hybrid_score), plus pre-built scorers. - DataFrame helpers —
fuzzy_join,deduplicate,pairwise_comparewith blocking indexes. - Explainability —
return_breakdown=Truereturns per-metric scores alongside the combined score. - Ensembles —
weighted_avg,mean,max,min,median,vote.
All similarity functions return Float64 in [0, 1]. Null in either input → null output.
Install
Users (no Rust toolchain required)
Prebuilt wheels are published on PyPI for CPython 3.9–3.13 on Linux x86_64/aarch64, Windows x86_64, and macOS x86_64/arm64:
pip install polars-stringsim
That's it — import polars_stringsim as pf works out of the box.
If a wheel for your platform is missing, pip will fall back to a source build, which does require Rust (see below).
From source / development
For hacking on the plugin itself, or for a platform without a prebuilt wheel:
git clone https://github.com/Pratham-26/rust_helpers.git
cd rust_helpers
# Iterative dev install (rebuild on every change):
pip install maturin
maturin develop --release
# Or: install directly from the repo's source (compiles Rust, needs a toolchain):
pip install git+https://github.com/Pratham-26/rust_helpers.git
Requires Rust stable. Pinned to polars = 0.54.4 / pyo3-polars = 0.27.
See RELEASE.md for how wheels are built and published.
Usage
import polars as pl
import polars_stringsim as pf
customers = pl.DataFrame({"name": ["Robert Smith", "Catherine Jones", "Jon Smyth"]})
db = pl.DataFrame({"name": ["Robert Smyth", "Katherine Jones", "William Brown"]})
# 1. Single metric
customers.join(db, how="cross").with_columns(
s=pf.jaro_winkler("name", "name_right")
)
# 2. Hybrid: spelling + phonetic + token, fused in one Rust call
df.with_columns(
hybrid=pf.hybrid_score("a", "b",
algorithms=["jaro_winkler", "double_metaphone", "trigram_jaccard"],
weights=[0.5, 0.3, 0.2])
)
# 3. Pre-built scorer
df.with_columns(s=pf.name_default("a", "b")) # JW + Double Metaphone + trigram
# 4. Per-metric breakdown (explainability)
df.with_columns(
bd=pf.combine(
[pf.jaro_winkler("a","b"), pf.double_metaphone_sim("a","b")],
weights=[0.6, 0.4], return_breakdown=True,
)
)
# 5. fuzzy_join with blocking (avoids O(n*m) cross product)
pf.fuzzy_join(customers, db, left_on="name", right_on="name",
algorithms=["jaro_winkler", "double_metaphone"], weights=[0.6, 0.4],
threshold=0.75, top_k=1, block="first_chars", block_n=1)
# 6. deduplicate near-duplicate name variants
pf.deduplicate(messy_names, on="name",
algorithms=["jaro_winkler"], weights=[1.0],
composite_threshold=0.8, block="first_chars", block_n=1)
# 7. pairwise_compare for threshold tuning (returns combined score per pair)
pf.pairwise_compare(customers, db, left_on="name", right_on="name",
algorithms=["jaro_winkler", "trigram_jaccard"], weights=[0.6, 0.4])
API reference
Per-metric expressions (pf.<name>(left, right) → pl.Expr)
jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein,
damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard,
token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice,
qgram_jaccard(left, right, q=3), lcs_sim, soundex_sim, soundex_jw_sim,
metaphone_sim, metaphone_jw_sim, double_metaphone_sim,
double_metaphone_jw_sim, nysiis_sim, nysiis_jw_sim.
Combiners / hybrid
pf.combine(metrics, *, weights=None, method="weighted_avg", threshold=None, return_breakdown=False)— fuse pre-built metric expressions.pf.hybrid_score(left, right, *, algorithms, weights=None, method="weighted_avg", threshold=None)— same, but builds metrics in Rust (no intermediate struct column).- Pre-built scorers:
pf.phonetic_edit,pf.token_char,pf.prefix_ngram,pf.name_default.
DataFrame helpers
pf.fuzzy_join(left, right, *, left_on, right_on, algorithms, weights, method, threshold, top_k, block, block_n, how, add_breakdown)pf.deduplicate(frame, *, on, algorithms, weights, method, composite_threshold, block, block_n)pf.pairwise_compare(left, right, *, left_on, right_on, algorithms, weights, method, block, block_n)- Blocking:
pf.block_first_chars(col, n=2),pf.block_char_bag(col)
Combine methods
weighted_avg (default; weights normalized to sum to 1), mean, max, min, median, vote (count of metrics ≥ threshold, normalized by N).
Algorithm name registry
hybrid_score, fuzzy_join, deduplicate, and pairwise_compare accept algorithm names:
jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein, damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard, token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice, lcs_sim, soundex/soundex_jw, metaphone/metaphone_jw, double_metaphone/double_metaphone_jw, nysiis/nysiis_jw.
Tests
cargo test --lib # 30 Rust unit tests
pytest tests/python # 38 Python end-to-end tests
Run the example end-to-end with uv:
maturin build --release
WHL=target/wheels/polars_stringsim-*.whl
uv run --with "$WHL" --with polars --with pytest python -m pytest tests/python/
uv run --with "$WHL" --with polars python examples/record_linkage.py
Architecture
src/
├── algorithms/ # pure Rust: edit, jaro, token, lcs, phonetic
├── combiner.rs # CombineMethod enum + combine_row (weighted_avg/mean/max/min/median/vote)
├── expr/mod.rs # #[polars_expr] wrappers, one per metric
├── expr_combine.rs # combine_expr + combine_breakdown_expr (Struct output)
├── expr_hybrid.rs # hybrid_score_expr (multi-algo in one Rust call)
├── series_util.rs # str-column readers, Float64 builder, null handling
└── lib.rs # #[pymodule]
python/polars_stringsim/
├── _expression.py # per-metric expr builders + combine()
├── _registry.py # algorithm name → builder map
├── hybrid.py # hybrid_score + pre-built scorers
├── frame.py # fuzzy_join, deduplicate, pairwise_compare, blocking
└── __init__.py # public API
Roadmap
- Done (MVP): all algorithms + expressions + combiner.
- Done (Phase 2):
fuzzy_join,deduplicate,pairwise_compare, blocking indexes. - Done (Phase 3):
hybrid_score, per-metric explainability (return_breakdown), pre-built hybrid scorers. - Future: custom combiner registration (user-supplied Rust closures), GPU acceleration, more phonetic encoders (Caverphone, Beider-Morse).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polars_stringsim-0.1.0.tar.gz.
File metadata
- Download URL: polars_stringsim-0.1.0.tar.gz
- Upload date:
- Size: 54.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1128cffcad440dc70a5f81dfcfeaf4f4b39d2dd8b0b400e1af4c009b8537d400
|
|
| MD5 |
3d79dd779a1d628febdb062a6e285178
|
|
| BLAKE2b-256 |
617b4a5b0617e26d556f9e6d13bebd4835b77e2640e2fb21f1538009bfba0c37
|
Provenance
The following attestation bundles were made for polars_stringsim-0.1.0.tar.gz:
Publisher:
release.yml on Pratham-26/rust_helpers
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_stringsim-0.1.0.tar.gz -
Subject digest:
1128cffcad440dc70a5f81dfcfeaf4f4b39d2dd8b0b400e1af4c009b8537d400 - Sigstore transparency entry: 2068162703
- Sigstore integration time:
-
Permalink:
Pratham-26/rust_helpers@52b4f93a352ffb09942dccc4b877019ae50513a2 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Pratham-26
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@52b4f93a352ffb09942dccc4b877019ae50513a2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file polars_stringsim-0.1.0-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: polars_stringsim-0.1.0-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 5.2 MB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a56d8845fedd41e697349964a1d3478d32b0a9e1cebc389603be9b33afb0eb21
|
|
| MD5 |
85dc0f8211599a96d01a38bcafc9b03f
|
|
| BLAKE2b-256 |
bf78a942c71a8e8502153d5e923b64fc007af56d1c8bbe410b55d053b03acee4
|
Provenance
The following attestation bundles were made for polars_stringsim-0.1.0-cp313-cp313-win_amd64.whl:
Publisher:
release.yml on Pratham-26/rust_helpers
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_stringsim-0.1.0-cp313-cp313-win_amd64.whl -
Subject digest:
a56d8845fedd41e697349964a1d3478d32b0a9e1cebc389603be9b33afb0eb21 - Sigstore transparency entry: 2068163278
- Sigstore integration time:
-
Permalink:
Pratham-26/rust_helpers@52b4f93a352ffb09942dccc4b877019ae50513a2 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Pratham-26
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@52b4f93a352ffb09942dccc4b877019ae50513a2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file polars_stringsim-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: polars_stringsim-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 5.5 MB
- Tags: CPython 3.12, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fd1a414e90ebd16343d7b6e81c7fa1e15ffdc1d6b2d2dcb70ed4b839f35991d
|
|
| MD5 |
8eeeae2f965b5c39b41837a2ff0cc9dc
|
|
| BLAKE2b-256 |
bb9f70c509a2036524e6bd45cfaedcdb0376dfd9100aa9c65bb74084d5dea486
|
Provenance
The following attestation bundles were made for polars_stringsim-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl:
Publisher:
release.yml on Pratham-26/rust_helpers
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_stringsim-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl -
Subject digest:
3fd1a414e90ebd16343d7b6e81c7fa1e15ffdc1d6b2d2dcb70ed4b839f35991d - Sigstore transparency entry: 2068164013
- Sigstore integration time:
-
Permalink:
Pratham-26/rust_helpers@52b4f93a352ffb09942dccc4b877019ae50513a2 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Pratham-26
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@52b4f93a352ffb09942dccc4b877019ae50513a2 -
Trigger Event:
push
-
Statement type: