Composable fuzzy string matching for Polars
Project description
polars-stringsim
Composable fuzzy string matching for Polars, implemented as a native Polars plugin (Rust core + Python bindings via PyO3). All scoring runs in Rust — no Python loops over rows — and works in both eager and lazy frames.
Features
- 24 metric expressions — edit distances (Levenshtein, Damerau-Levenshtein, OSA, Hamming), the Jaro family, token/n-gram (Jaccard, Sørensen-Dice, q-grams), LCS, and four phonetic encoders (Soundex, Metaphone, DoubleMetaphone, NYSIIS). See
ALGORITHMS.mdfor the math and normalization behind each one. - Hybrid scoring — combine multiple algorithms in a single Rust call (
hybrid_score), plus 4 pre-built scorers (name_default,phonetic_edit,token_char,prefix_ngram). - DataFrame helpers —
fuzzy_join,deduplicate,pairwise_comparewith blocking indexes (first-chars, char-bag) to avoid O(n²) cross joins. - Explainability —
return_breakdown=Truereturns per-metric scores alongside the combined score, so you can see why two strings matched. - Ensembles —
weighted_avg,mean,max,min,median,vote.
All similarity functions return Float64 in [0, 1]. Null in either input → null output.
Install
Users (no Rust toolchain required)
Prebuilt wheels are published on PyPI for CPython 3.9–3.13 on Linux x86_64 and Windows x86_64. macOS and aarch64-linux wheels are planned (see Platform support); on those platforms pip falls back to a source build.
pip install polars-stringsim
That's it — import polars_stringsim as pf works out of the box.
If a wheel for your platform is missing, pip will fall back to a source build, which does require Rust (see below).
From source / development
For hacking on the plugin itself, or for a platform without a prebuilt wheel:
git clone https://github.com/Pratham-26/rust_helpers.git
cd rust_helpers
# Iterative dev install (rebuild on every change):
pip install maturin
maturin develop --release
# Or: install directly from the repo's source (compiles Rust, needs a toolchain):
pip install git+https://github.com/Pratham-26/rust_helpers.git
Requires Rust stable. Pinned to polars = 0.54.4 / pyo3-polars = 0.27.
See RELEASE.md for how wheels are built and published.
Usage
import polars as pl
import polars_stringsim as pf
customers = pl.DataFrame({"name": ["Robert Smith", "Catherine Jones", "Jon Smyth"]})
db = pl.DataFrame({"name": ["Robert Smyth", "Katherine Jones", "William Brown"]})
# 1. Single metric
customers.join(db, how="cross").with_columns(
s=pf.jaro_winkler("name", "name_right")
)
# 2. Hybrid: spelling + phonetic + token, fused in one Rust call
df.with_columns(
hybrid=pf.hybrid_score("a", "b",
algorithms=["jaro_winkler", "double_metaphone", "trigram_jaccard"],
weights=[0.5, 0.3, 0.2])
)
# 3. Pre-built scorer
df.with_columns(s=pf.name_default("a", "b")) # JW + Double Metaphone + trigram
# 4. Per-metric breakdown (explainability)
df.with_columns(
bd=pf.combine(
[pf.jaro_winkler("a","b"), pf.double_metaphone_sim("a","b")],
weights=[0.6, 0.4], return_breakdown=True,
)
)
# 5. fuzzy_join with blocking (avoids O(n*m) cross product)
pf.fuzzy_join(customers, db, left_on="name", right_on="name",
algorithms=["jaro_winkler", "double_metaphone"], weights=[0.6, 0.4],
threshold=0.75, top_k=1, block="first_chars", block_n=1)
# 6. deduplicate near-duplicate name variants
pf.deduplicate(messy_names, on="name",
algorithms=["jaro_winkler"], weights=[1.0],
composite_threshold=0.8, block="first_chars", block_n=1)
# 7. pairwise_compare for threshold tuning (returns combined score per pair)
pf.pairwise_compare(customers, db, left_on="name", right_on="name",
algorithms=["jaro_winkler", "trigram_jaccard"], weights=[0.6, 0.4])
API reference
Per-metric expressions (pf.<name>(left, right) → pl.Expr)
jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein,
damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard,
token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice,
qgram_jaccard(left, right, q=3), lcs_sim, soundex_sim, soundex_jw_sim,
metaphone_sim, metaphone_jw_sim, double_metaphone_sim,
double_metaphone_jw_sim, nysiis_sim, nysiis_jw_sim.
Combiners / hybrid
pf.combine(metrics, *, weights=None, method="weighted_avg", threshold=None, return_breakdown=False)— fuse pre-built metric expressions.pf.hybrid_score(left, right, *, algorithms, weights=None, method="weighted_avg", threshold=None)— same, but builds metrics in Rust (no intermediate struct column).- Pre-built scorers:
pf.phonetic_edit,pf.token_char,pf.prefix_ngram,pf.name_default.
DataFrame helpers
pf.fuzzy_join(left, right, *, left_on, right_on, algorithms, weights, method, threshold, top_k, block, block_n, how, add_breakdown)pf.deduplicate(frame, *, on, algorithms, weights, method, composite_threshold, block, block_n)pf.pairwise_compare(left, right, *, left_on, right_on, algorithms, weights, method, block, block_n)- Blocking:
pf.block_first_chars(col, n=2),pf.block_char_bag(col)
Combine methods
weighted_avg (default; weights normalized to sum to 1), mean, max, min, median, vote (count of metrics ≥ threshold, normalized by N).
Parallelism & thread control
hybrid_score parallelizes its row scan across a dedicated worker pool that is
independent of the Polars engine pool (POLARS_MAX_THREADS), so you can
tune them separately. Throughput scales near-linearly with cores (≈7–8× on 16
threads vs 1).
import polars_stringsim as pf
pf.get_num_threads() # default = number of logical cores
pf.set_num_threads(8) # use 8 threads for hybrid_score
pf.set_num_threads(0) # restore default
Or set the default at process start: POLARS_STRINGSIM_THREADS=8 python ...
Algorithm name registry
hybrid_score, fuzzy_join, deduplicate, and pairwise_compare accept algorithm names:
jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein, damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard, token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice, lcs_sim, soundex/soundex_jw, metaphone/metaphone_jw, double_metaphone/double_metaphone_jw, nysiis/nysiis_jw.
For what each metric computes and how it's normalized, see ALGORITHMS.md.
Platform support
Prebuilt wheels on PyPI (no Rust toolchain needed):
| Platform | Wheel | Status |
|---|---|---|
| Linux x86_64 | manylinux_2_28_x86_64 |
✅ v0.1.0 |
| Windows x86_64 | win_amd64 |
✅ v0.1.0 |
| macOS x86_64 / arm64 | — | ⏳ planned (sdist fallback works; needs Rust) |
| Linux aarch64 | — | ⏳ planned (sdist fallback works; needs Rust) |
CPython 3.9–3.13 supported wherever a wheel exists. pip install polars-stringsim picks the right wheel automatically; on uncovered platforms it builds from the sdist (requires Rust stable).
Tests
cargo test --lib # 30 Rust unit tests
pytest tests/python # 38 Python end-to-end tests
Run the example end-to-end with uv:
maturin build --release
WHL=target/wheels/polars_stringsim-*.whl
uv run --with "$WHL" --with polars --with pytest python -m pytest tests/python/
uv run --with "$WHL" --with polars python examples/record_linkage.py
Architecture
src/
├── algorithms/ # pure Rust: edit, jaro, token, lcs, phonetic
├── combiner.rs # CombineMethod enum + combine_row (weighted_avg/mean/max/min/median/vote)
├── expr/mod.rs # #[polars_expr] wrappers, one per metric
├── expr_combine.rs # combine_expr + combine_breakdown_expr (Struct output)
├── expr_hybrid.rs # hybrid_score_expr (multi-algo in one Rust call)
├── series_util.rs # str-column readers, Float64 builder, null handling
└── lib.rs # #[pymodule]
python/polars_stringsim/
├── _expression.py # per-metric expr builders + combine()
├── _registry.py # algorithm name → builder map
├── hybrid.py # hybrid_score + pre-built scorers
├── frame.py # fuzzy_join, deduplicate, pairwise_compare, blocking
└── __init__.py # public API
Features & status
Everything in the original PRD is implemented and shipped:
- ✅ 24 metric expressions — edit distances, Jaro family, token/n-gram, LCS, and phonetic encoders. Full reference:
ALGORITHMS.md. - ✅ Composable ensembles —
combine()withweighted_avg/mean/max/min/median/vote. - ✅ Hybrid scoring —
hybrid_score()(builds metrics in Rust, no intermediate struct column) + 4 pre-built scorers. - ✅ DataFrame helpers —
fuzzy_join(blocked/cross, threshold,top_k,how),deduplicate(union-find clustering),pairwise_compare, and blocking indexes. - ✅ Explainability —
return_breakdown=Truereturns a per-metric score breakdown. - ✅ Native Polars plugin — all scoring in Rust via PyO3; works in eager and lazy frames.
- ✅ Prebuilt wheels on PyPI —
pip install polars-stringsim, no Rust toolchain required (see Platform support).
Not yet implemented
- ⏳ Platform wheels: macOS (x86_64 + arm64) and aarch64-linux — currently sdist-only on those platforms.
- 🔲 Custom combiner registration (user-supplied Rust closures).
- 🔲 GPU acceleration.
- 🔲 More phonetic encoders (Caverphone, Beider-Morse).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polars_stringsim-0.2.0.tar.gz.
File metadata
- Download URL: polars_stringsim-0.2.0.tar.gz
- Upload date:
- Size: 66.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26cbda80212d65523a7b0d3e07f6c3a9f7a3539889fb5dd773ae9f60c603e055
|
|
| MD5 |
5fc4143eab5bc4a2a0f3031071689a99
|
|
| BLAKE2b-256 |
3c6dc09fb5ef54c54420773ff5d33582be79962bf02ece62a3a62ba07a54b64e
|
Provenance
The following attestation bundles were made for polars_stringsim-0.2.0.tar.gz:
Publisher:
release.yml on Pratham-26/rust_helpers
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_stringsim-0.2.0.tar.gz -
Subject digest:
26cbda80212d65523a7b0d3e07f6c3a9f7a3539889fb5dd773ae9f60c603e055 - Sigstore transparency entry: 2070171739
- Sigstore integration time:
-
Permalink:
Pratham-26/rust_helpers@416d42a8d6706c9bfe39c19ccf88fcd3624f4de0 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Pratham-26
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@416d42a8d6706c9bfe39c19ccf88fcd3624f4de0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 5.2 MB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7614f5b609201c7f4d4380d9281b4b8429d3320a11a9e81e0a6f2626fea459d1
|
|
| MD5 |
711d2c4941284910821e33a063af8866
|
|
| BLAKE2b-256 |
0028fbdb8a9a5d16047f668793f54dc6ed5fbd03d31c4fec534d471f7b3204f4
|
Provenance
The following attestation bundles were made for polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl:
Publisher:
release.yml on Pratham-26/rust_helpers
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl -
Subject digest:
7614f5b609201c7f4d4380d9281b4b8429d3320a11a9e81e0a6f2626fea459d1 - Sigstore transparency entry: 2070171861
- Sigstore integration time:
-
Permalink:
Pratham-26/rust_helpers@416d42a8d6706c9bfe39c19ccf88fcd3624f4de0 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Pratham-26
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@416d42a8d6706c9bfe39c19ccf88fcd3624f4de0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 5.5 MB
- Tags: CPython 3.12, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8f2acaf93019be3b0138af5653c652989d59dbcc26890a70fe824ffd475d8d4
|
|
| MD5 |
16a702ff36117ee0bb807b3cd5e82e14
|
|
| BLAKE2b-256 |
442db388220f4fc59fbe8c16c1423b817f4d68b5a2bc8bf7570aac951a0621a8
|
Provenance
The following attestation bundles were made for polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl:
Publisher:
release.yml on Pratham-26/rust_helpers
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl -
Subject digest:
e8f2acaf93019be3b0138af5653c652989d59dbcc26890a70fe824ffd475d8d4 - Sigstore transparency entry: 2070172353
- Sigstore integration time:
-
Permalink:
Pratham-26/rust_helpers@416d42a8d6706c9bfe39c19ccf88fcd3624f4de0 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Pratham-26
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@416d42a8d6706c9bfe39c19ccf88fcd3624f4de0 -
Trigger Event:
push
-
Statement type: