Skip to main content

Composable fuzzy string matching for Polars

Project description

polars-stringsim

Composable fuzzy string matching for Polars, implemented as a native Polars plugin (Rust core + Python bindings via PyO3). All scoring runs in Rust — no Python loops over rows — and works in both eager and lazy frames.

Features

  • 23 metric expressions — edit distances, Jaro family, token/n-gram, LCS, and phonetic encoders.
  • Hybrid scoring — combine multiple algorithms in a single Rust call (hybrid_score), plus pre-built scorers.
  • DataFrame helpersfuzzy_join, deduplicate, pairwise_compare with blocking indexes.
  • Explainabilityreturn_breakdown=True returns per-metric scores alongside the combined score.
  • Ensemblesweighted_avg, mean, max, min, median, vote.

All similarity functions return Float64 in [0, 1]. Null in either input → null output.

Install

Users (no Rust toolchain required)

Prebuilt wheels are published on PyPI for CPython 3.9–3.13 on Linux x86_64/aarch64, Windows x86_64, and macOS x86_64/arm64:

pip install polars-stringsim

That's it — import polars_stringsim as pf works out of the box.

If a wheel for your platform is missing, pip will fall back to a source build, which does require Rust (see below).

From source / development

For hacking on the plugin itself, or for a platform without a prebuilt wheel:

git clone https://github.com/Pratham-26/rust_helpers.git
cd rust_helpers

# Iterative dev install (rebuild on every change):
pip install maturin
maturin develop --release

# Or: install directly from the repo's source (compiles Rust, needs a toolchain):
pip install git+https://github.com/Pratham-26/rust_helpers.git

Requires Rust stable. Pinned to polars = 0.54.4 / pyo3-polars = 0.27. See RELEASE.md for how wheels are built and published.

Usage

import polars as pl
import polars_stringsim as pf

customers = pl.DataFrame({"name": ["Robert Smith", "Catherine Jones", "Jon Smyth"]})
db = pl.DataFrame({"name": ["Robert Smyth", "Katherine Jones", "William Brown"]})

# 1. Single metric
customers.join(db, how="cross").with_columns(
    s=pf.jaro_winkler("name", "name_right")
)

# 2. Hybrid: spelling + phonetic + token, fused in one Rust call
df.with_columns(
    hybrid=pf.hybrid_score("a", "b",
        algorithms=["jaro_winkler", "double_metaphone", "trigram_jaccard"],
        weights=[0.5, 0.3, 0.2])
)

# 3. Pre-built scorer
df.with_columns(s=pf.name_default("a", "b"))   # JW + Double Metaphone + trigram

# 4. Per-metric breakdown (explainability)
df.with_columns(
    bd=pf.combine(
        [pf.jaro_winkler("a","b"), pf.double_metaphone_sim("a","b")],
        weights=[0.6, 0.4], return_breakdown=True,
    )
)

# 5. fuzzy_join with blocking (avoids O(n*m) cross product)
pf.fuzzy_join(customers, db, left_on="name", right_on="name",
    algorithms=["jaro_winkler", "double_metaphone"], weights=[0.6, 0.4],
    threshold=0.75, top_k=1, block="first_chars", block_n=1)

# 6. deduplicate near-duplicate name variants
pf.deduplicate(messy_names, on="name",
    algorithms=["jaro_winkler"], weights=[1.0],
    composite_threshold=0.8, block="first_chars", block_n=1)

# 7. pairwise_compare for threshold tuning (returns combined score per pair)
pf.pairwise_compare(customers, db, left_on="name", right_on="name",
    algorithms=["jaro_winkler", "trigram_jaccard"], weights=[0.6, 0.4])

API reference

Per-metric expressions (pf.<name>(left, right) → pl.Expr)

jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein, damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard, token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice, qgram_jaccard(left, right, q=3), lcs_sim, soundex_sim, soundex_jw_sim, metaphone_sim, metaphone_jw_sim, double_metaphone_sim, double_metaphone_jw_sim, nysiis_sim, nysiis_jw_sim.

Combiners / hybrid

  • pf.combine(metrics, *, weights=None, method="weighted_avg", threshold=None, return_breakdown=False) — fuse pre-built metric expressions.
  • pf.hybrid_score(left, right, *, algorithms, weights=None, method="weighted_avg", threshold=None) — same, but builds metrics in Rust (no intermediate struct column).
  • Pre-built scorers: pf.phonetic_edit, pf.token_char, pf.prefix_ngram, pf.name_default.

DataFrame helpers

  • pf.fuzzy_join(left, right, *, left_on, right_on, algorithms, weights, method, threshold, top_k, block, block_n, how, add_breakdown)
  • pf.deduplicate(frame, *, on, algorithms, weights, method, composite_threshold, block, block_n)
  • pf.pairwise_compare(left, right, *, left_on, right_on, algorithms, weights, method, block, block_n)
  • Blocking: pf.block_first_chars(col, n=2), pf.block_char_bag(col)

Combine methods

weighted_avg (default; weights normalized to sum to 1), mean, max, min, median, vote (count of metrics ≥ threshold, normalized by N).

Algorithm name registry

hybrid_score, fuzzy_join, deduplicate, and pairwise_compare accept algorithm names:

jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein, damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard, token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice, lcs_sim, soundex/soundex_jw, metaphone/metaphone_jw, double_metaphone/double_metaphone_jw, nysiis/nysiis_jw.

Tests

cargo test --lib          # 30 Rust unit tests
pytest tests/python       # 38 Python end-to-end tests

Run the example end-to-end with uv:

maturin build --release
WHL=target/wheels/polars_stringsim-*.whl
uv run --with "$WHL" --with polars --with pytest python -m pytest tests/python/
uv run --with "$WHL" --with polars python examples/record_linkage.py

Architecture

src/
├── algorithms/   # pure Rust: edit, jaro, token, lcs, phonetic
├── combiner.rs   # CombineMethod enum + combine_row (weighted_avg/mean/max/min/median/vote)
├── expr/mod.rs   # #[polars_expr] wrappers, one per metric
├── expr_combine.rs   # combine_expr + combine_breakdown_expr (Struct output)
├── expr_hybrid.rs    # hybrid_score_expr (multi-algo in one Rust call)
├── series_util.rs    # str-column readers, Float64 builder, null handling
└── lib.rs        # #[pymodule]

python/polars_stringsim/
├── _expression.py  # per-metric expr builders + combine()
├── _registry.py    # algorithm name → builder map
├── hybrid.py       # hybrid_score + pre-built scorers
├── frame.py        # fuzzy_join, deduplicate, pairwise_compare, blocking
└── __init__.py     # public API

Roadmap

  • Done (MVP): all algorithms + expressions + combiner.
  • Done (Phase 2): fuzzy_join, deduplicate, pairwise_compare, blocking indexes.
  • Done (Phase 3): hybrid_score, per-metric explainability (return_breakdown), pre-built hybrid scorers.
  • Future: custom combiner registration (user-supplied Rust closures), GPU acceleration, more phonetic encoders (Caverphone, Beider-Morse).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_stringsim-0.1.0.tar.gz (54.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_stringsim-0.1.0-cp313-cp313-win_amd64.whl (5.2 MB view details)

Uploaded CPython 3.13Windows x86-64

polars_stringsim-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (5.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

File details

Details for the file polars_stringsim-0.1.0.tar.gz.

File metadata

  • Download URL: polars_stringsim-0.1.0.tar.gz
  • Upload date:
  • Size: 54.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_stringsim-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1128cffcad440dc70a5f81dfcfeaf4f4b39d2dd8b0b400e1af4c009b8537d400
MD5 3d79dd779a1d628febdb062a6e285178
BLAKE2b-256 617b4a5b0617e26d556f9e6d13bebd4835b77e2640e2fb21f1538009bfba0c37

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_stringsim-0.1.0.tar.gz:

Publisher: release.yml on Pratham-26/rust_helpers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_stringsim-0.1.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for polars_stringsim-0.1.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 a56d8845fedd41e697349964a1d3478d32b0a9e1cebc389603be9b33afb0eb21
MD5 85dc0f8211599a96d01a38bcafc9b03f
BLAKE2b-256 bf78a942c71a8e8502153d5e923b64fc007af56d1c8bbe410b55d053b03acee4

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_stringsim-0.1.0-cp313-cp313-win_amd64.whl:

Publisher: release.yml on Pratham-26/rust_helpers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_stringsim-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_stringsim-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3fd1a414e90ebd16343d7b6e81c7fa1e15ffdc1d6b2d2dcb70ed4b839f35991d
MD5 8eeeae2f965b5c39b41837a2ff0cc9dc
BLAKE2b-256 bb9f70c509a2036524e6bd45cfaedcdb0376dfd9100aa9c65bb74084d5dea486

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_stringsim-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Pratham-26/rust_helpers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page