Skip to main content

rapid fuzzy string matching

Project description

rustfuzz logo

PyPI version Docs Tests MIT License Rust powered Built by AI


[!WARNING] ๐Ÿšง Under Heavy Construction

This library is actively being developed and APIs may change between releases. We're shipping fast โ€” expect frequent updates, new features, and occasional breaking changes. Pin your version if stability matters to you: pip install rustfuzz==0.1.12


๐Ÿค– This project was built entirely by AI.

The idea was simple: could an AI agent beat RapidFuzz โ€” one of the fastest fuzzy matching libraries in the world โ€” by writing a Rust-backed Python library from scratch, guided only by benchmarks?

The development loop was: Research โ†’ Build โ†’ Benchmark โ†’ Repeat.


rustfuzz is a blazing-fast fuzzy string matching library for Python โ€” implemented entirely in Rust. ๐Ÿš€

Zero Python overhead. Memory safe. Pre-compiled wheels for every major platform.

The Challenge: Beat RapidFuzz

flowchart LR
    R["๐Ÿ” Research<br>Profiler output<br>& algorithm gaps"]
    B["๐Ÿฆ€ Build<br>Rust implementation<br>via PyO3"]
    T["โœ… Test<br>All tests must pass<br>before proceeding"]
    BM["๐Ÿ“Š Benchmark<br>vs RapidFuzz<br>Numbers don't lie"]
    RP["๐Ÿ” Repeat<br>Find the next<br>bottleneck"]

    R --> B --> T --> BM --> RP --> R

    style R fill:#6366f1,color:#fff,stroke:none
    style B fill:#a855f7,color:#fff,stroke:none
    style T fill:#ef4444,color:#fff,stroke:none
    style BM fill:#22c55e,color:#fff,stroke:none
    style RP fill:#f59e0b,color:#fff,stroke:none

The goal: match or exceed RapidFuzz's throughput on ratio, partial_ratio, token_sort_ratio, and process.extract โ€” all from Python. Each iteration starts with profiling, identifies the hottest path, and rewrites it deeper into Rust.

The Results: RustFuzz is Faster ๐Ÿ†

We benchmarked process.extract on a 1,000,000 row corpus. Thanks to zero-overhead Rayon parallelization, lock-free global threshold shrinking (AtomicU64), and native query token caching, rustfuzz officially outperforms rapidfuzz.

Benchmark (1M rows) RapidFuzz RustFuzz (Parallel)
Raw Characters (ratio) 5506 ms 5253 ms
Complex Tokens (WRatio) 3032 ms 2716 ms

But that's not all. By utilizing the built-in BM25 Hybrid Pipeline, rustfuzz can complete the identical extraction task in a revolutionary 97 ms (a ~30x speedup over state-of-the-art fuzzy matching!).

Features

โšก Blazing Fast Core algorithms written in Rust โ€” no Python overhead, no GIL bottlenecks
๐Ÿง  Smart Matching Ratio, partial ratio, token sort/set, Levenshtein, Jaro-Winkler, and more
๐Ÿ”’ Memory Safe Rust's borrow checker guarantees โ€” no segfaults, no buffer overflows
๐Ÿ Pythonic API Clean, typed Python interface. Import and go
๐Ÿ“ฆ Zero Build Step Pre-compiled wheels on PyPI for Python 3.10โ€“3.14 on all major platforms
๐Ÿ”๏ธ Big Data Ready Excels in 1 Billion Row Challenge benchmarks, crushing high-throughput tasks
๐Ÿ” 3-Way Hybrid Search BM25 + Fuzzy + Dense embeddings via RRF โ€” 25ms at 1M docs, all in Rust
๐Ÿ”Ž Filter & Sort Meilisearch-style filtering and sorting with Rust-level performance
๐Ÿ“„ Document Objects First-class Document(content, metadata) + LangChain compatibility
๐Ÿงฉ Ecosystem Integrations BM25, Hybrid Search, and LangChain Retrievers for Vector DBs (Qdrant, LanceDB, FAISS, etc.)

Installation

pip install rustfuzz
# or, with uv (recommended โ€” much faster):
uv pip install rustfuzz

Quick Start

import rustfuzz.fuzz as fuzz
from rustfuzz.distance import Levenshtein

# Fuzzy ratio
print(fuzz.ratio("hello world", "hello wrold"))          # ~96.0

# Partial ratio (substring match)
print(fuzz.partial_ratio("hello", "say hello world"))    # 100.0

# Token-order-insensitive match
print(fuzz.token_sort_ratio("fuzzy wuzzy", "wuzzy fuzzy")) # 100.0

# Levenshtein distance
print(Levenshtein.distance("kitten", "sitting"))         # 3

# Normalised similarity [0.0 โ€“ 1.0]
print(Levenshtein.normalized_similarity("kitten", "kitten")) # 1.0

Batch extraction

from rustfuzz import process

choices = ["New York", "New Orleans", "Newark", "Los Angeles"]
print(process.extractOne("new york", choices))
# ('New York', 100.0, 0)

print(process.extract("new", choices, limit=3))
# [('Newark', ...), ('New York', ...), ('New Orleans', ...)]

3-Way Hybrid Search (BM25 + Fuzzy + Dense)

from rustfuzz.search import Document, HybridSearch

# Create documents with metadata
docs = [
    Document("Apple iPhone 15 Pro Max 256GB", {"brand": "Apple", "price": 1199}),
    Document("Samsung Galaxy S24 Ultra", {"brand": "Samsung", "price": 1299}),
    Document("Google Pixel 8 Pro", {"brand": "Google", "price": 699}),
]

# Optional: add dense embeddings for semantic search
embeddings = [[1.0, 0.0, 0.0], [0.9, 0.1, 0.0], [0.1, 0.9, 0.0]]

hs = HybridSearch(docs, embeddings=embeddings)

# Handles typos via fuzzy, keywords via BM25, meaning via dense โ€” all in Rust
results = hs.search("appel iphon", query_embedding=[1.0, 0.0, 0.0], n=1)
text, score, meta = results[0]
print(f"{text} โ€” ${meta['price']}")
# Apple iPhone 15 Pro Max 256GB โ€” $1199

Also works with LangChain Document objects โ€” no dependency required, auto-detected via duck-typing!

With Real Embeddings (FastEmbed)

Use FastEmbed for lightweight, local, ONNX-based embeddings โ€” no GPU needed:

from fastembed import TextEmbedding
from rustfuzz.search import Document, HybridSearch

model = TextEmbedding("BAAI/bge-small-en-v1.5")  # ~33 MB, CPU-only

docs = [
    Document("Apple iPhone 15 Pro Max 256GB", {"brand": "Apple"}),
    Document("Samsung Galaxy S24 Ultra",      {"brand": "Samsung"}),
    Document("Sony WH-1000XM5 Headphones",    {"brand": "Sony"}),
]

embeddings = [e.tolist() for e in model.embed([d.content for d in docs])]
hs = HybridSearch(docs, embeddings=embeddings)

query = "wireless noise cancelling headset"
query_emb = list(model.embed([query]))[0].tolist()

results = hs.search(query, query_embedding=query_emb, n=1)
text, score, meta = results[0]
print(f"{text} โ€” {meta['brand']}")
# Sony WH-1000XM5 Headphones โ€” Sony

With Rust-Native Embeddings (EmbedAnything)

Use EmbedAnything for Rust-native embeddings via Candle โ€” no PyTorch, no ONNX:

import embed_anything
from embed_anything import EmbeddingModel
from rustfuzz.search import Document, HybridSearch

model = EmbeddingModel.from_pretrained_hf(
    model_id="sentence-transformers/all-MiniLM-L6-v2",
)

docs = [
    Document("Apple iPhone 15 Pro Max 256GB", {"brand": "Apple"}),
    Document("Samsung Galaxy S24 Ultra",      {"brand": "Samsung"}),
    Document("Sony WH-1000XM5 Headphones",    {"brand": "Sony"}),
]

# Embed corpus with EmbedAnything
embed_data = embed_anything.embed_query([d.content for d in docs], embedder=model)
embeddings = [item.embedding for item in embed_data]

hs = HybridSearch(docs, embeddings=embeddings)

query = "wireless noise cancelling headset"
query_emb = embed_anything.embed_query([query], embedder=model)[0].embedding

text, score, meta = hs.search(query, query_embedding=query_emb, n=1)[0]
print(f"{text} โ€” {meta['brand']}")
# Sony WH-1000XM5 Headphones โ€” Sony

Or use the callback pattern for fully automatic query embedding:

def embed_fn(texts: list[str]) -> list[list[float]]:
    return [r.embedding for r in embed_anything.embed_query(texts, embedder=model)]

hs = HybridSearch(docs, embeddings=embed_fn)
results = hs.search("wireless headset", n=1)  # query auto-embedded!

Filtering & Sorting (Meilisearch-style)

from rustfuzz import Document
from rustfuzz.search import BM25

docs = [
    Document("Apple iPhone 15 Pro Max",  {"brand": "Apple",   "category": "phone",  "price": 1199, "in_stock": True}),
    Document("Samsung Galaxy S24 Ultra", {"brand": "Samsung", "category": "phone",  "price": 1299, "in_stock": True}),
    Document("Google Pixel 8 Pro",       {"brand": "Google",  "category": "phone",  "price": 699,  "in_stock": False}),
    Document("Apple MacBook Pro M3",     {"brand": "Apple",   "category": "laptop", "price": 2499, "in_stock": True}),
]

bm25 = BM25(docs)

# Fluent builder: filter โ†’ sort โ†’ match (executes immediately)
results = (
    bm25
    .filter('brand = "Apple" AND price > 500')
    .sort("price:asc")
    .match("pro", n=10)
)

for text, score, meta in results:
    print(f"  {text} โ€” ${meta['price']}")

# Supports: =, !=, >, <, >=, <=, TO (range), IN, EXISTS, IS NULL, AND, OR, NOT
# Works with BM25, BM25L, BM25Plus, BM25T, and HybridSearch

Filter and sort also work with HybridSearch (BM25 + Fuzzy + Dense):

from rustfuzz import Document
from rustfuzz.search import HybridSearch

docs = [
    Document("Apple iPhone 15 Pro Max", {"brand": "Apple", "price": 1199}),
    Document("Samsung Galaxy S24 Ultra", {"brand": "Samsung", "price": 1299}),
    Document("Google Pixel 8 Pro",       {"brand": "Google", "price": 699}),
]

hs = HybridSearch(docs, embeddings=embeddings)

# Filter + sort + semantic search
results = (
    hs
    .filter('brand = "Apple"')
    .sort("price:asc")
    .match("iphone pro", n=5, query_embedding=query_emb)
)

Supported Algorithms

Module Algorithms
rustfuzz.fuzz ratio, partial_ratio, token_sort_ratio, token_set_ratio, token_ratio, WRatio, QRatio, partial_token_*
rustfuzz.distance Levenshtein, Hamming, Indel, Jaro, JaroWinkler, LCSseq, OSA, DamerauLevenshtein, Prefix, Postfix
rustfuzz.process extract, extractOne, extract_iter, cdist
rustfuzz.search BM25, BM25L, BM25Plus, BM25T, HybridSearch, Document
rustfuzz.filter Meilisearch-style filter parser & evaluator
rustfuzz.sort Multi-key sort with dot notation
rustfuzz.query Fluent SearchQuery builder (.filter().sort().search().collect())
rustfuzz.utils default_process

The BM25 Search Engines

rustfuzz.search implements lightning-fast Text Retrieval mathematical variants. The core differences:

  • BM25 (Okapi): The industry standard. Employs term frequency saturation (logarithmic decay) and document length normalization.
  • BM25L: Focuses on length penalization corrections. Introduces a static term shift delta, guaranteeing that matching terms yield a minimum baseline score even in massive documents where normalisation would normally suppress them.
  • BM25Plus: Also creates a lower-bound for any given matching term, but applies the shift after term saturation. Widely considered the best default for highly mixed-length corpuses.
  • BM25T: Introduces Information Gain adjustments to dynamically calculate the saturation limit $k_1$ per term, restricting dominant variance. rustfuzz hyper-optimises this by pre-computing term limits natively within the inverted index.

You can see an end-to-end benchmark comparison of these algorithms resolving the BEIR SciFact dataset in examples/bench_retrieval.py.

Documentation

Full cookbook with interactive examples and benchmark results: ๐Ÿ‘‰ bmsuisse.github.io/rustfuzz

License

MIT ยฉ BM Suisse

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rustfuzz-0.1.17.tar.gz (31.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rustfuzz-0.1.17-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (70.8 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

rustfuzz-0.1.17-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (70.6 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

rustfuzz-0.1.17-cp310-abi3-win_amd64.whl (70.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

rustfuzz-0.1.17-cp310-abi3-musllinux_1_2_x86_64.whl (71.0 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

rustfuzz-0.1.17-cp310-abi3-musllinux_1_2_aarch64.whl (70.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

rustfuzz-0.1.17-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (70.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

rustfuzz-0.1.17-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (70.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

rustfuzz-0.1.17-cp310-abi3-macosx_11_0_arm64.whl (71.0 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

rustfuzz-0.1.17-cp310-abi3-macosx_10_12_x86_64.whl (70.6 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file rustfuzz-0.1.17.tar.gz.

File metadata

  • Download URL: rustfuzz-0.1.17.tar.gz
  • Upload date:
  • Size: 31.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rustfuzz-0.1.17.tar.gz
Algorithm Hash digest
SHA256 6d603d55f352eb484ad4264468973aa24f4bcc8842864f6598301874132ecb8b
MD5 9cc1e2ee865b75f8f8cead35a2f0d694
BLAKE2b-256 4e62e562453c6606ea9426a22817ba0c78542856bda4bb36646d23843a2b607a

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.17.tar.gz:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.17-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.17-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e9f28220b6e78caa9d782296f9d908616268b915d84c4abb1c736ee8633b8a48
MD5 6265c2cc8c1a83f501c15ca52c9dd2da
BLAKE2b-256 ff7c4a68476a25fdca03ccfa2d90095160a5a730b129d6fcd8731a9d9ff5ad98

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.17-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.17-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.17-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b32d663e3b09d85441b02cbd9ddddba3c6816c50ef023ed588b02db21d65f005
MD5 4a414149bf65ce75d558c1cb987a7ddf
BLAKE2b-256 a22e9aa6ddd445e2189596f3908c36363e6aabe39ee845fb44409ca12f4b86eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.17-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.17-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: rustfuzz-0.1.17-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 70.4 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rustfuzz-0.1.17-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5f8856a01af38357d93b0d5c35a9c535292f6e61c29f20db8957d76df065d424
MD5 b9d0f839127e7de32c08e44643675478
BLAKE2b-256 89c40ba47f9637271a7dc9f73017141c28a6ad7d3739b5caa74a3b035d28e837

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.17-cp310-abi3-win_amd64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.17-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.17-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 a66de24de5fb7bee82c699e0c108350e120e702b7d60238d53bf791724dff9a6
MD5 369ef1fdfd41cb79365cc2b13d322ba3
BLAKE2b-256 cf034ed712f621aa91deb9a7ed88bd99a00dcfd85ec48e17ee0d1cdb61f35fce

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.17-cp310-abi3-musllinux_1_2_x86_64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.17-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.17-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 80956da9df8262799ff0773111b41d4d0212cc7857f657d213603b2a0e63d81c
MD5 c903435cdfa11a3819a4e93f5f15f213
BLAKE2b-256 0a40832af5abf86be2a01e56bd0d37e51c4c23551ea1d8663576985bc170bd49

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.17-cp310-abi3-musllinux_1_2_aarch64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.17-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.17-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8645a69383f6f8358c1d21f81f7415687a35df7e83e376b895d43eeb90194efe
MD5 2b28e1365ed5e459a77136f9d8011fca
BLAKE2b-256 ac7e6e1c6b0bbc6d0ec9e6bf8b52933d0b13b3efc517568faf5fc9cb17d53400

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.17-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.17-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.17-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 101042657e4189a40a34a8cb09046b25522e736e688e6168835e135c16a551a7
MD5 3c2842d84a35ad10eb16b0f20c024254
BLAKE2b-256 0c59f2c3e9a27ef17e5059616e3a385c30b4b775a1dfca2fc42d8ab481801d02

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.17-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.17-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.17-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7c5342e5d10361f8125c1daa38c1a710f2eef19bff338db330c50fc2ebd8b8e5
MD5 df3d17fd499c5cf4541e4298c5491728
BLAKE2b-256 a026b5175dcfbf6b4d422ffe64fd4bf7640fa5d7dbdce9667dd1595045b9e2da

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.17-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.17-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.17-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b6a0c97c2e52b90f5a7523908c0bff7b5cbb77dfe90b1edd7a62e5d7f7d1d4ae
MD5 1724dcfdd790d0f8c4f1437c4ba30372
BLAKE2b-256 aca1a7198084d2017127b48eaeafedd64620d61061c9353935598498bea99415

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.17-cp310-abi3-macosx_10_12_x86_64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page