Composable fuzzy string matching for Polars

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

polars-stringsim

Composable fuzzy string matching for Polars, implemented as a native Polars plugin (Rust core + Python bindings via PyO3). All scoring runs in Rust — no Python loops over rows — and works in both eager and lazy frames.

Features

24 metric expressions — edit distances (Levenshtein, Damerau-Levenshtein, OSA, Hamming), the Jaro family, token/n-gram (Jaccard, Sørensen-Dice, q-grams), LCS, and four phonetic encoders (Soundex, Metaphone, DoubleMetaphone, NYSIIS). See ALGORITHMS.md for the math and normalization behind each one.
Hybrid scoring — combine multiple algorithms in a single Rust call (hybrid_score), plus 4 pre-built scorers (name_default, phonetic_edit, token_char, prefix_ngram).
DataFrame helpers — fuzzy_join, deduplicate, pairwise_compare with blocking indexes (first-chars, char-bag) to avoid O(n²) cross joins.
Explainability — return_breakdown=True returns per-metric scores alongside the combined score, so you can see why two strings matched.
Ensembles — weighted_avg, mean, max, min, median, vote.

All similarity functions return Float64 in [0, 1]. Null in either input → null output.

Install

Users (no Rust toolchain required)

Prebuilt wheels are published on PyPI for CPython 3.9–3.13 on Linux x86_64 and Windows x86_64. macOS and aarch64-linux wheels are planned (see Platform support); on those platforms pip falls back to a source build.

pip install polars-stringsim

That's it — import polars_stringsim as pf works out of the box.

If a wheel for your platform is missing, pip will fall back to a source build, which does require Rust (see below).

From source / development

For hacking on the plugin itself, or for a platform without a prebuilt wheel:

git clone https://github.com/Pratham-26/rust_helpers.git
cd rust_helpers

# Iterative dev install (rebuild on every change):
pip install maturin
maturin develop --release

# Or: install directly from the repo's source (compiles Rust, needs a toolchain):
pip install git+https://github.com/Pratham-26/rust_helpers.git

Requires Rust stable. Pinned to polars = 0.54.4 / pyo3-polars = 0.27. See RELEASE.md for how wheels are built and published.

Usage

import polars as pl
import polars_stringsim as pf

customers = pl.DataFrame({"name": ["Robert Smith", "Catherine Jones", "Jon Smyth"]})
db = pl.DataFrame({"name": ["Robert Smyth", "Katherine Jones", "William Brown"]})

# 1. Single metric
customers.join(db, how="cross").with_columns(
    s=pf.jaro_winkler("name", "name_right")
)

# 2. Hybrid: spelling + phonetic + token, fused in one Rust call
df.with_columns(
    hybrid=pf.hybrid_score("a", "b",
        algorithms=["jaro_winkler", "double_metaphone", "trigram_jaccard"],
        weights=[0.5, 0.3, 0.2])
)

# 3. Pre-built scorer
df.with_columns(s=pf.name_default("a", "b"))   # JW + Double Metaphone + trigram

# 4. Per-metric breakdown (explainability)
df.with_columns(
    bd=pf.combine(
        [pf.jaro_winkler("a","b"), pf.double_metaphone_sim("a","b")],
        weights=[0.6, 0.4], return_breakdown=True,
    )
)

# 5. fuzzy_join with blocking (avoids O(n*m) cross product)
pf.fuzzy_join(customers, db, left_on="name", right_on="name",
    algorithms=["jaro_winkler", "double_metaphone"], weights=[0.6, 0.4],
    threshold=0.75, top_k=1, block="first_chars", block_n=1)

# 6. deduplicate near-duplicate name variants
pf.deduplicate(messy_names, on="name",
    algorithms=["jaro_winkler"], weights=[1.0],
    composite_threshold=0.8, block="first_chars", block_n=1)

# 7. pairwise_compare for threshold tuning (returns combined score per pair)
pf.pairwise_compare(customers, db, left_on="name", right_on="name",
    algorithms=["jaro_winkler", "trigram_jaccard"], weights=[0.6, 0.4])

API reference

Per-metric expressions (`pf.<name>(left, right) → pl.Expr`)

jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein, damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard, token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice, qgram_jaccard(left, right, q=3), lcs_sim, soundex_sim, soundex_jw_sim, metaphone_sim, metaphone_jw_sim, double_metaphone_sim, double_metaphone_jw_sim, nysiis_sim, nysiis_jw_sim.

Combiners / hybrid

pf.combine(metrics, *, weights=None, method="weighted_avg", threshold=None, return_breakdown=False) — fuse pre-built metric expressions.
pf.hybrid_score(left, right, *, algorithms, weights=None, method="weighted_avg", threshold=None) — same, but builds metrics in Rust (no intermediate struct column).
Pre-built scorers: pf.phonetic_edit, pf.token_char, pf.prefix_ngram, pf.name_default.

DataFrame helpers

pf.fuzzy_join(left, right, *, left_on, right_on, algorithms, weights, method, threshold, top_k, block, block_n, how, add_breakdown)
pf.deduplicate(frame, *, on, algorithms, weights, method, composite_threshold, block, block_n)
pf.pairwise_compare(left, right, *, left_on, right_on, algorithms, weights, method, block, block_n)
Blocking: pf.block_first_chars(col, n=2), pf.block_char_bag(col)

Combine methods

weighted_avg (default; weights normalized to sum to 1), mean, max, min, median, vote (count of metrics ≥ threshold, normalized by N).

Parallelism & thread control

hybrid_score parallelizes its row scan across a dedicated worker pool that is independent of the Polars engine pool (POLARS_MAX_THREADS), so you can tune them separately. Throughput scales near-linearly with cores (≈7–8× on 16 threads vs 1).

import polars_stringsim as pf

pf.get_num_threads()         # default = number of logical cores
pf.set_num_threads(8)        # use 8 threads for hybrid_score
pf.set_num_threads(0)        # restore default

Or set the default at process start: POLARS_STRINGSIM_THREADS=8 python ...

Algorithm name registry

hybrid_score, fuzzy_join, deduplicate, and pairwise_compare accept algorithm names:

jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein, damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard, token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice, lcs_sim, soundex/soundex_jw, metaphone/metaphone_jw, double_metaphone/double_metaphone_jw, nysiis/nysiis_jw.

For what each metric computes and how it's normalized, see ALGORITHMS.md.

Platform support

Prebuilt wheels on PyPI (no Rust toolchain needed):

Platform	Wheel	Status
Linux x86_64	`manylinux_2_28_x86_64`	✅ v0.1.0
Windows x86_64	`win_amd64`	✅ v0.1.0
macOS x86_64 / arm64	—	⏳ planned (sdist fallback works; needs Rust)
Linux aarch64	—	⏳ planned (sdist fallback works; needs Rust)

CPython 3.9–3.13 supported wherever a wheel exists. pip install polars-stringsim picks the right wheel automatically; on uncovered platforms it builds from the sdist (requires Rust stable).

Tests

cargo test --lib          # 30 Rust unit tests
pytest tests/python       # 38 Python end-to-end tests

Run the example end-to-end with uv:

maturin build --release
WHL=target/wheels/polars_stringsim-*.whl
uv run --with "$WHL" --with polars --with pytest python -m pytest tests/python/
uv run --with "$WHL" --with polars python examples/record_linkage.py

Architecture

src/
├── algorithms/   # pure Rust: edit, jaro, token, lcs, phonetic
├── combiner.rs   # CombineMethod enum + combine_row (weighted_avg/mean/max/min/median/vote)
├── expr/mod.rs   # #[polars_expr] wrappers, one per metric
├── expr_combine.rs   # combine_expr + combine_breakdown_expr (Struct output)
├── expr_hybrid.rs    # hybrid_score_expr (multi-algo in one Rust call)
├── series_util.rs    # str-column readers, Float64 builder, null handling
└── lib.rs        # #[pymodule]

python/polars_stringsim/
├── _expression.py  # per-metric expr builders + combine()
├── _registry.py    # algorithm name → builder map
├── hybrid.py       # hybrid_score + pre-built scorers
├── frame.py        # fuzzy_join, deduplicate, pairwise_compare, blocking
└── __init__.py     # public API

Features & status

Everything in the original PRD is implemented and shipped:

✅ 24 metric expressions — edit distances, Jaro family, token/n-gram, LCS, and phonetic encoders. Full reference: ALGORITHMS.md.
✅ Composable ensembles — combine() with weighted_avg / mean / max / min / median / vote.
✅ Hybrid scoring — hybrid_score() (builds metrics in Rust, no intermediate struct column) + 4 pre-built scorers.
✅ DataFrame helpers — fuzzy_join (blocked/cross, threshold, top_k, how), deduplicate (union-find clustering), pairwise_compare, and blocking indexes.
✅ Explainability — return_breakdown=True returns a per-metric score breakdown.
✅ Native Polars plugin — all scoring in Rust via PyO3; works in eager and lazy frames.
✅ Prebuilt wheels on PyPI — pip install polars-stringsim, no Rust toolchain required (see Platform support).

Not yet implemented

⏳ Platform wheels: macOS (x86_64 + arm64) and aarch64-linux — currently sdist-only on those platforms.
🔲 Custom combiner registration (user-supplied Rust closures).
🔲 GPU acceleration.
🔲 More phonetic encoders (Caverphone, Beider-Morse).

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

PrathamK2602

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jul 4, 2026

0.1.0

Jul 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_stringsim-0.2.0.tar.gz (66.0 kB view details)

Uploaded Jul 4, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl (5.2 MB view details)

Uploaded Jul 4, 2026 CPython 3.13Windows x86-64

polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl (5.5 MB view details)

Uploaded Jul 4, 2026 CPython 3.12manylinux: glibc 2.28+ x86-64

File details

Details for the file polars_stringsim-0.2.0.tar.gz.

File metadata

Download URL: polars_stringsim-0.2.0.tar.gz
Upload date: Jul 4, 2026
Size: 66.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_stringsim-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`26cbda80212d65523a7b0d3e07f6c3a9f7a3539889fb5dd773ae9f60c603e055`
MD5	`5fc4143eab5bc4a2a0f3031071689a99`
BLAKE2b-256	`3c6dc09fb5ef54c54420773ff5d33582be79962bf02ece62a3a62ba07a54b64e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_stringsim-0.2.0.tar.gz:

Publisher: release.yml on Pratham-26/rust_helpers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: polars_stringsim-0.2.0.tar.gz
- Subject digest: 26cbda80212d65523a7b0d3e07f6c3a9f7a3539889fb5dd773ae9f60c603e055
- Sigstore transparency entry: 2070171739
- Sigstore integration time: Jul 4, 2026
Source repository:
- Permalink: Pratham-26/rust_helpers@416d42a8d6706c9bfe39c19ccf88fcd3624f4de0
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Pratham-26
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@416d42a8d6706c9bfe39c19ccf88fcd3624f4de0
- Trigger Event: push

File details

Details for the file polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl.

File metadata

Download URL: polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl
Upload date: Jul 4, 2026
Size: 5.2 MB
Tags: CPython 3.13, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl
Algorithm	Hash digest
SHA256	`7614f5b609201c7f4d4380d9281b4b8429d3320a11a9e81e0a6f2626fea459d1`
MD5	`711d2c4941284910821e33a063af8866`
BLAKE2b-256	`0028fbdb8a9a5d16047f668793f54dc6ed5fbd03d31c4fec534d471f7b3204f4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl:

Publisher: release.yml on Pratham-26/rust_helpers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: polars_stringsim-0.2.0-cp313-cp313-win_amd64.whl
- Subject digest: 7614f5b609201c7f4d4380d9281b4b8429d3320a11a9e81e0a6f2626fea459d1
- Sigstore transparency entry: 2070171861
- Sigstore integration time: Jul 4, 2026
Source repository:
- Permalink: Pratham-26/rust_helpers@416d42a8d6706c9bfe39c19ccf88fcd3624f4de0
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Pratham-26
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@416d42a8d6706c9bfe39c19ccf88fcd3624f4de0
- Trigger Event: push

File details

Details for the file polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

Download URL: polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl
Upload date: Jul 4, 2026
Size: 5.5 MB
Tags: CPython 3.12, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`e8f2acaf93019be3b0138af5653c652989d59dbcc26890a70fe824ffd475d8d4`
MD5	`16a702ff36117ee0bb807b3cd5e82e14`
BLAKE2b-256	`442db388220f4fc59fbe8c16c1423b817f4d68b5a2bc8bf7570aac951a0621a8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Pratham-26/rust_helpers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: polars_stringsim-0.2.0-cp312-cp312-manylinux_2_28_x86_64.whl
- Subject digest: e8f2acaf93019be3b0138af5653c652989d59dbcc26890a70fe824ffd475d8d4
- Sigstore transparency entry: 2070172353
- Sigstore integration time: Jul 4, 2026
Source repository:
- Permalink: Pratham-26/rust_helpers@416d42a8d6706c9bfe39c19ccf88fcd3624f4de0
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Pratham-26
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@416d42a8d6706c9bfe39c19ccf88fcd3624f4de0
- Trigger Event: push

polars-stringsim 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

polars-stringsim

Features

Install

Users (no Rust toolchain required)

From source / development

Usage

API reference

Per-metric expressions (pf.<name>(left, right) → pl.Expr)

Combiners / hybrid

DataFrame helpers

Combine methods

Parallelism & thread control

Algorithm name registry

Platform support

Tests

Architecture

Features & status

Not yet implemented

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Per-metric expressions (`pf.<name>(left, right) → pl.Expr`)