Skip to main content

Composable fuzzy string matching for Polars

Project description

polars-stringsim

Composable fuzzy string matching for Polars, implemented as a native Polars plugin (Rust core + Python bindings via PyO3). All scoring runs in Rust — no Python loops over rows — and works in both eager and lazy frames.

Features

  • 24 metric expressions — edit distances (Levenshtein, Damerau-Levenshtein, OSA, Hamming), the Jaro family, token/n-gram (Jaccard, Sørensen-Dice, q-grams), LCS, and four phonetic encoders (Soundex, Metaphone, DoubleMetaphone, NYSIIS). See ALGORITHMS.md for the math and normalization behind each one.
  • Hybrid scoring — combine multiple algorithms in a single Rust call (hybrid_score), plus 4 pre-built scorers (name_default, phonetic_edit, token_char, prefix_ngram).
  • DataFrame helpersfuzzy_join, deduplicate, pairwise_compare with blocking indexes (first-chars, char-bag) to avoid O(n²) cross joins.
  • Explainabilityreturn_breakdown=True returns per-metric scores alongside the combined score, so you can see why two strings matched.
  • Ensemblesweighted_avg, mean, max, min, median, vote.

All similarity functions return Float64 in [0, 1]. Null in either input → null output.

Install

Users (no Rust toolchain required)

Prebuilt wheels are published on PyPI for CPython 3.9–3.13 on Linux x86_64 and Windows x86_64. macOS and aarch64-linux wheels are planned (see Platform support); on those platforms pip falls back to a source build.

pip install polars-stringsim

That's it — import polars_stringsim as pf works out of the box.

If a wheel for your platform is missing, pip will fall back to a source build, which does require Rust (see below).

From source / development

For hacking on the plugin itself, or for a platform without a prebuilt wheel:

git clone https://github.com/Pratham-26/rust_helpers.git
cd rust_helpers

# Iterative dev install (rebuild on every change):
pip install maturin
maturin develop --release

# Or: install directly from the repo's source (compiles Rust, needs a toolchain):
pip install git+https://github.com/Pratham-26/rust_helpers.git

Requires Rust stable. Pinned to polars = 0.54.4 / pyo3-polars = 0.27. See RELEASE.md for how wheels are built and published.

Usage

import polars as pl
import polars_stringsim as pf

customers = pl.DataFrame({"name": ["Robert Smith", "Catherine Jones", "Jon Smyth"]})
db = pl.DataFrame({"name": ["Robert Smyth", "Katherine Jones", "William Brown"]})

# 1. Single metric
customers.join(db, how="cross").with_columns(
    s=pf.jaro_winkler("name", "name_right")
)

# 2. Hybrid: spelling + phonetic + token, fused in one Rust call
df.with_columns(
    hybrid=pf.hybrid_score("a", "b",
        algorithms=["jaro_winkler", "double_metaphone", "trigram_jaccard"],
        weights=[0.5, 0.3, 0.2])
)

# 3. Pre-built scorer
df.with_columns(s=pf.name_default("a", "b"))   # JW + Double Metaphone + trigram

# 4. Per-metric breakdown (explainability)
df.with_columns(
    bd=pf.combine(
        [pf.jaro_winkler("a","b"), pf.double_metaphone_sim("a","b")],
        weights=[0.6, 0.4], return_breakdown=True,
    )
)

# 5. fuzzy_join with blocking (avoids O(n*m) cross product)
pf.fuzzy_join(customers, db, left_on="name", right_on="name",
    algorithms=["jaro_winkler", "double_metaphone"], weights=[0.6, 0.4],
    threshold=0.75, top_k=1, block="first_chars", block_n=1)

# 6. deduplicate near-duplicate name variants
pf.deduplicate(messy_names, on="name",
    algorithms=["jaro_winkler"], weights=[1.0],
    composite_threshold=0.8, block="first_chars", block_n=1)

# 7. pairwise_compare for threshold tuning (returns combined score per pair)
pf.pairwise_compare(customers, db, left_on="name", right_on="name",
    algorithms=["jaro_winkler", "trigram_jaccard"], weights=[0.6, 0.4])

API reference

Per-metric expressions (pf.<name>(left, right) → pl.Expr)

jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein, damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard, token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice, qgram_jaccard(left, right, q=3), lcs_sim, soundex_sim, soundex_jw_sim, metaphone_sim, metaphone_jw_sim, double_metaphone_sim, double_metaphone_jw_sim, nysiis_sim, nysiis_jw_sim.

Combiners / hybrid

  • pf.combine(metrics, *, weights=None, method="weighted_avg", threshold=None, return_breakdown=False) — fuse pre-built metric expressions.
  • pf.hybrid_score(left, right, *, algorithms, weights=None, method="weighted_avg", threshold=None) — same, but builds metrics in Rust (no intermediate struct column).
  • Pre-built scorers: pf.phonetic_edit, pf.token_char, pf.prefix_ngram, pf.name_default.

DataFrame helpers

  • pf.fuzzy_join(left, right, *, left_on, right_on, algorithms, weights, method, threshold, top_k, block, block_n, how, add_breakdown)
  • pf.deduplicate(frame, *, on, algorithms, weights, method, composite_threshold, block, block_n)
  • pf.pairwise_compare(left, right, *, left_on, right_on, algorithms, weights, method, block, block_n)
  • Blocking: pf.block_first_chars(col, n=2), pf.block_char_bag(col)

Combine methods

weighted_avg (default; weights normalized to sum to 1), mean, max, min, median, vote (count of metrics ≥ threshold, normalized by N).

Parallelism & thread control

hybrid_score parallelizes its row scan across a dedicated worker pool that is independent of the Polars engine pool (POLARS_MAX_THREADS), so you can tune them separately. Throughput scales near-linearly with cores (≈7–8× on 16 threads vs 1).

import polars_stringsim as pf

pf.get_num_threads()         # default = number of logical cores
pf.set_num_threads(8)        # use 8 threads for hybrid_score
pf.set_num_threads(0)        # restore default

Or set the default at process start: POLARS_STRINGSIM_THREADS=8 python ...

Algorithm name registry

hybrid_score, fuzzy_join, deduplicate, and pairwise_compare accept algorithm names:

jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein, damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard, token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice, lcs_sim, soundex/soundex_jw, metaphone/metaphone_jw, double_metaphone/double_metaphone_jw, nysiis/nysiis_jw.

For what each metric computes and how it's normalized, see ALGORITHMS.md.

Platform support

Prebuilt wheels on PyPI (no Rust toolchain needed):

Platform Wheel Status
Linux x86_64 manylinux_2_28_x86_64 ✅ v0.1.0
Windows x86_64 win_amd64 ✅ v0.1.0
macOS x86_64 / arm64 ⏳ planned (sdist fallback works; needs Rust)
Linux aarch64 ⏳ planned (sdist fallback works; needs Rust)

CPython 3.9–3.13 supported wherever a wheel exists. pip install polars-stringsim picks the right wheel automatically; on uncovered platforms it builds from the sdist (requires Rust stable).

Tests

cargo test --lib          # 30 Rust unit tests
pytest tests/python       # 38 Python end-to-end tests

Run the example end-to-end with uv:

maturin build --release
WHL=target/wheels/polars_stringsim-*.whl
uv run --with "$WHL" --with polars --with pytest python -m pytest tests/python/
uv run --with "$WHL" --with polars python examples/record_linkage.py

Architecture

src/
├── algorithms/   # pure Rust: edit, jaro, token, lcs, phonetic
├── combiner.rs   # CombineMethod enum + combine_row (weighted_avg/mean/max/min/median/vote)
├── expr/mod.rs   # #[polars_expr] wrappers, one per metric
├── expr_combine.rs   # combine_expr + combine_breakdown_expr (Struct output)
├── expr_hybrid.rs    # hybrid_score_expr (multi-algo in one Rust call)
├── series_util.rs    # str-column readers, Float64 builder, null handling
└── lib.rs        # #[pymodule]

python/polars_stringsim/
├── _expression.py  # per-metric expr builders + combine()
├── _registry.py    # algorithm name → builder map
├── hybrid.py       # hybrid_score + pre-built scorers
├── frame.py        # fuzzy_join, deduplicate, pairwise_compare, blocking
└── __init__.py     # public API

Features & status

Everything in the original PRD is implemented and shipped:

  • 24 metric expressions — edit distances, Jaro family, token/n-gram, LCS, and phonetic encoders. Full reference: ALGORITHMS.md.
  • Composable ensemblescombine() with weighted_avg / mean / max / min / median / vote.
  • Hybrid scoringhybrid_score() (builds metrics in Rust, no intermediate struct column) + 4 pre-built scorers.
  • DataFrame helpersfuzzy_join (blocked/cross, threshold, top_k, how), deduplicate (union-find clustering), pairwise_compare, and blocking indexes.
  • Explainabilityreturn_breakdown=True returns a per-metric score breakdown.
  • Native Polars plugin — all scoring in Rust via PyO3; works in eager and lazy frames.
  • Prebuilt wheels on PyPIpip install polars-stringsim, no Rust toolchain required (see Platform support).

Not yet implemented

  • Platform wheels: macOS (x86_64 + arm64) and aarch64-linux — currently sdist-only on those platforms.
  • 🔲 Custom combiner registration (user-supplied Rust closures).
  • 🔲 GPU acceleration.
  • 🔲 More phonetic encoders (Caverphone, Beider-Morse).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_stringsim-0.2.0.tar.gz (66.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl (5.2 MB view details)

Uploaded CPython 3.13Windows x86-64

polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl (5.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

File details

Details for the file polars_stringsim-0.2.0.tar.gz.

File metadata

  • Download URL: polars_stringsim-0.2.0.tar.gz
  • Upload date:
  • Size: 66.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_stringsim-0.2.0.tar.gz
Algorithm Hash digest
SHA256 26cbda80212d65523a7b0d3e07f6c3a9f7a3539889fb5dd773ae9f60c603e055
MD5 5fc4143eab5bc4a2a0f3031071689a99
BLAKE2b-256 3c6dc09fb5ef54c54420773ff5d33582be79962bf02ece62a3a62ba07a54b64e

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_stringsim-0.2.0.tar.gz:

Publisher: release.yml on Pratham-26/rust_helpers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 7614f5b609201c7f4d4380d9281b4b8429d3320a11a9e81e0a6f2626fea459d1
MD5 711d2c4941284910821e33a063af8866
BLAKE2b-256 0028fbdb8a9a5d16047f668793f54dc6ed5fbd03d31c4fec534d471f7b3204f4

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl:

Publisher: release.yml on Pratham-26/rust_helpers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e8f2acaf93019be3b0138af5653c652989d59dbcc26890a70fe824ffd475d8d4
MD5 16a702ff36117ee0bb807b3cd5e82e14
BLAKE2b-256 442db388220f4fc59fbe8c16c1423b817f4d68b5a2bc8bf7570aac951a0621a8

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Pratham-26/rust_helpers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page