Skip to main content

rapid fuzzy string matching

Project description

rustfuzz logo

PyPI version Docs Tests MIT License Rust powered Built by AI


[!WARNING] ๐Ÿšง Under Heavy Construction

This library is actively being developed and APIs may change between releases. We're shipping fast โ€” expect frequent updates, new features, and occasional breaking changes. Pin your version if stability matters to you: pip install rustfuzz==0.1.12


๐Ÿค– This project was built entirely by AI.

The idea was simple: could an AI agent beat RapidFuzz โ€” one of the fastest fuzzy matching libraries in the world โ€” by writing a Rust-backed Python library from scratch, guided only by benchmarks?

The development loop was: Research โ†’ Build โ†’ Benchmark โ†’ Repeat.


rustfuzz is a blazing-fast fuzzy string matching library for Python โ€” implemented entirely in Rust. ๐Ÿš€

Zero Python overhead. Memory safe. Pre-compiled wheels for every major platform.

The Challenge: Beat RapidFuzz

flowchart LR
    R["๐Ÿ” Research<br>Profiler output<br>& algorithm gaps"]
    B["๐Ÿฆ€ Build<br>Rust implementation<br>via PyO3"]
    T["โœ… Test<br>All tests must pass<br>before proceeding"]
    BM["๐Ÿ“Š Benchmark<br>vs RapidFuzz<br>Numbers don't lie"]
    RP["๐Ÿ” Repeat<br>Find the next<br>bottleneck"]

    R --> B --> T --> BM --> RP --> R

    style R fill:#6366f1,color:#fff,stroke:none
    style B fill:#a855f7,color:#fff,stroke:none
    style T fill:#ef4444,color:#fff,stroke:none
    style BM fill:#22c55e,color:#fff,stroke:none
    style RP fill:#f59e0b,color:#fff,stroke:none

The goal: match or exceed RapidFuzz's throughput on ratio, partial_ratio, token_sort_ratio, and process.extract โ€” all from Python. Each iteration starts with profiling, identifies the hottest path, and rewrites it deeper into Rust.

The Results: RustFuzz is Faster ๐Ÿ†

We benchmarked process.extract on a 1,000,000 row corpus. Thanks to zero-overhead Rayon parallelization, lock-free global threshold shrinking (AtomicU64), and native query token caching, rustfuzz officially outperforms rapidfuzz.

Benchmark (1M rows) RapidFuzz RustFuzz (Parallel)
Raw Characters (ratio) 5506 ms 5253 ms
Complex Tokens (WRatio) 3032 ms 2716 ms

But that's not all. By utilizing the built-in BM25 Hybrid Pipeline, rustfuzz can complete the identical extraction task in a revolutionary 97 ms (a ~30x speedup over state-of-the-art fuzzy matching!).

Features

โšก Blazing Fast Core algorithms written in Rust โ€” no Python overhead, no GIL bottlenecks
๐Ÿง  Smart Matching Ratio, partial ratio, token sort/set, Levenshtein, Jaro-Winkler, and more
๐Ÿ”’ Memory Safe Rust's borrow checker guarantees โ€” no segfaults, no buffer overflows
๐Ÿ Pythonic API Clean, typed Python interface. Import and go
๐Ÿ“ฆ Zero Build Step Pre-compiled wheels on PyPI for Python 3.10โ€“3.14 on all major platforms
๐Ÿ”๏ธ Big Data Ready Excels in 1 Billion Row Challenge benchmarks, crushing high-throughput tasks
๐Ÿ” 3-Way Hybrid Search BM25 + Fuzzy + Dense embeddings via RRF โ€” 25ms at 1M docs, all in Rust
๐Ÿ”Ž Filter & Sort Meilisearch-style filtering and sorting with Rust-level performance
๐Ÿ“„ Document Objects First-class Document(content, metadata) + LangChain compatibility
๐Ÿงฉ Ecosystem Integrations BM25, Hybrid Search, and LangChain Retrievers for Vector DBs (Qdrant, LanceDB, FAISS, etc.)

Installation

pip install rustfuzz
# or, with uv (recommended โ€” much faster):
uv pip install rustfuzz

Quick Start

import rustfuzz.fuzz as fuzz
from rustfuzz.distance import Levenshtein

# Fuzzy ratio
print(fuzz.ratio("hello world", "hello wrold"))          # ~96.0

# Partial ratio (substring match)
print(fuzz.partial_ratio("hello", "say hello world"))    # 100.0

# Token-order-insensitive match
print(fuzz.token_sort_ratio("fuzzy wuzzy", "wuzzy fuzzy")) # 100.0

# Levenshtein distance
print(Levenshtein.distance("kitten", "sitting"))         # 3

# Normalised similarity [0.0 โ€“ 1.0]
print(Levenshtein.normalized_similarity("kitten", "kitten")) # 1.0

Batch extraction

from rustfuzz import process

choices = ["New York", "New Orleans", "Newark", "Los Angeles"]
print(process.extractOne("new york", choices))
# ('New York', 100.0, 0)

print(process.extract("new", choices, limit=3))
# [('Newark', ...), ('New York', ...), ('New Orleans', ...)]

3-Way Hybrid Search (BM25 + Fuzzy + Dense)

from rustfuzz.search import Document, HybridSearch

# Create documents with metadata
docs = [
    Document("Apple iPhone 15 Pro Max 256GB", {"brand": "Apple", "price": 1199}),
    Document("Samsung Galaxy S24 Ultra", {"brand": "Samsung", "price": 1299}),
    Document("Google Pixel 8 Pro", {"brand": "Google", "price": 699}),
]

# Optional: add dense embeddings for semantic search
embeddings = [[1.0, 0.0, 0.0], [0.9, 0.1, 0.0], [0.1, 0.9, 0.0]]

hs = HybridSearch(docs, embeddings=embeddings)

# Handles typos via fuzzy, keywords via BM25, meaning via dense โ€” all in Rust
results = hs.search("appel iphon", query_embedding=[1.0, 0.0, 0.0], n=1)
text, score, meta = results[0]
print(f"{text} โ€” ${meta['price']}")
# Apple iPhone 15 Pro Max 256GB โ€” $1199

Also works with LangChain Document objects โ€” no dependency required, auto-detected via duck-typing!

With Real Embeddings (FastEmbed)

Use FastEmbed for lightweight, local, ONNX-based embeddings โ€” no GPU needed:

from fastembed import TextEmbedding
from rustfuzz.search import Document, HybridSearch

model = TextEmbedding("BAAI/bge-small-en-v1.5")  # ~33 MB, CPU-only

docs = [
    Document("Apple iPhone 15 Pro Max 256GB", {"brand": "Apple"}),
    Document("Samsung Galaxy S24 Ultra",      {"brand": "Samsung"}),
    Document("Sony WH-1000XM5 Headphones",    {"brand": "Sony"}),
]

embeddings = [e.tolist() for e in model.embed([d.content for d in docs])]
hs = HybridSearch(docs, embeddings=embeddings)

query = "wireless noise cancelling headset"
query_emb = list(model.embed([query]))[0].tolist()

results = hs.search(query, query_embedding=query_emb, n=1)
text, score, meta = results[0]
print(f"{text} โ€” {meta['brand']}")
# Sony WH-1000XM5 Headphones โ€” Sony

With Rust-Native Embeddings (EmbedAnything)

Use EmbedAnything for Rust-native embeddings via Candle โ€” no PyTorch, no ONNX:

import embed_anything
from embed_anything import EmbeddingModel
from rustfuzz.search import Document, HybridSearch

model = EmbeddingModel.from_pretrained_hf(
    model_id="sentence-transformers/all-MiniLM-L6-v2",
)

docs = [
    Document("Apple iPhone 15 Pro Max 256GB", {"brand": "Apple"}),
    Document("Samsung Galaxy S24 Ultra",      {"brand": "Samsung"}),
    Document("Sony WH-1000XM5 Headphones",    {"brand": "Sony"}),
]

# Embed corpus with EmbedAnything
embed_data = embed_anything.embed_query([d.content for d in docs], embedder=model)
embeddings = [item.embedding for item in embed_data]

hs = HybridSearch(docs, embeddings=embeddings)

query = "wireless noise cancelling headset"
query_emb = embed_anything.embed_query([query], embedder=model)[0].embedding

text, score, meta = hs.search(query, query_embedding=query_emb, n=1)[0]
print(f"{text} โ€” {meta['brand']}")
# Sony WH-1000XM5 Headphones โ€” Sony

Or use the callback pattern for fully automatic query embedding:

def embed_fn(texts: list[str]) -> list[list[float]]:
    return [r.embedding for r in embed_anything.embed_query(texts, embedder=model)]

hs = HybridSearch(docs, embeddings=embed_fn)
results = hs.search("wireless headset", n=1)  # query auto-embedded!

Filtering & Sorting (Meilisearch-style)

from rustfuzz import Document
from rustfuzz.search import BM25

docs = [
    Document("Apple iPhone 15 Pro Max",  {"brand": "Apple",   "category": "phone",  "price": 1199, "in_stock": True}),
    Document("Samsung Galaxy S24 Ultra", {"brand": "Samsung", "category": "phone",  "price": 1299, "in_stock": True}),
    Document("Google Pixel 8 Pro",       {"brand": "Google",  "category": "phone",  "price": 699,  "in_stock": False}),
    Document("Apple MacBook Pro M3",     {"brand": "Apple",   "category": "laptop", "price": 2499, "in_stock": True}),
]

bm25 = BM25(docs)

# Fluent builder: filter โ†’ sort โ†’ match (executes immediately)
results = (
    bm25
    .filter('brand = "Apple" AND price > 500')
    .sort("price:asc")
    .match("pro", n=10)
)

for text, score, meta in results:
    print(f"  {text} โ€” ${meta['price']}")

# Supports: =, !=, >, <, >=, <=, TO (range), IN, EXISTS, IS NULL, AND, OR, NOT
# Works with BM25, BM25L, BM25Plus, BM25T, and HybridSearch

Filter and sort also work with HybridSearch (BM25 + Fuzzy + Dense):

from rustfuzz import Document
from rustfuzz.search import HybridSearch

docs = [
    Document("Apple iPhone 15 Pro Max", {"brand": "Apple", "price": 1199}),
    Document("Samsung Galaxy S24 Ultra", {"brand": "Samsung", "price": 1299}),
    Document("Google Pixel 8 Pro",       {"brand": "Google", "price": 699}),
]

hs = HybridSearch(docs, embeddings=embeddings)

# Filter + sort + semantic search
results = (
    hs
    .filter('brand = "Apple"')
    .sort("price:asc")
    .match("iphone pro", n=5, query_embedding=query_emb)
)

Supported Algorithms

Module Algorithms
rustfuzz.fuzz ratio, partial_ratio, token_sort_ratio, token_set_ratio, token_ratio, WRatio, QRatio, partial_token_*
rustfuzz.distance Levenshtein, Hamming, Indel, Jaro, JaroWinkler, LCSseq, OSA, DamerauLevenshtein, Prefix, Postfix
rustfuzz.process extract, extractOne, extract_iter, cdist
rustfuzz.search BM25, BM25L, BM25Plus, BM25T, HybridSearch, Document
rustfuzz.filter Meilisearch-style filter parser & evaluator
rustfuzz.sort Multi-key sort with dot notation
rustfuzz.query Fluent SearchQuery builder (.filter().sort().search().collect())
rustfuzz.utils default_process

The BM25 Search Engines

rustfuzz.search implements lightning-fast Text Retrieval mathematical variants. The core differences:

  • BM25 (Okapi): The industry standard. Employs term frequency saturation (logarithmic decay) and document length normalization.
  • BM25L: Focuses on length penalization corrections. Introduces a static term shift delta, guaranteeing that matching terms yield a minimum baseline score even in massive documents where normalisation would normally suppress them.
  • BM25Plus: Also creates a lower-bound for any given matching term, but applies the shift after term saturation. Widely considered the best default for highly mixed-length corpuses.
  • BM25T: Introduces Information Gain adjustments to dynamically calculate the saturation limit $k_1$ per term, restricting dominant variance. rustfuzz hyper-optimises this by pre-computing term limits natively within the inverted index.

You can see an end-to-end benchmark comparison of these algorithms resolving the BEIR SciFact dataset in examples/bench_retrieval.py.

Documentation

Full cookbook with interactive examples and benchmark results: ๐Ÿ‘‰ bmsuisse.github.io/rustfuzz

License

MIT ยฉ BM Suisse

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rustfuzz-0.1.18.tar.gz (31.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rustfuzz-0.1.18-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (70.8 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

rustfuzz-0.1.18-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (70.6 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

rustfuzz-0.1.18-cp310-abi3-win_amd64.whl (70.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

rustfuzz-0.1.18-cp310-abi3-musllinux_1_2_x86_64.whl (71.0 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

rustfuzz-0.1.18-cp310-abi3-musllinux_1_2_aarch64.whl (70.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

rustfuzz-0.1.18-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (70.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

rustfuzz-0.1.18-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (70.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

rustfuzz-0.1.18-cp310-abi3-macosx_11_0_arm64.whl (71.0 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

rustfuzz-0.1.18-cp310-abi3-macosx_10_12_x86_64.whl (70.6 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file rustfuzz-0.1.18.tar.gz.

File metadata

  • Download URL: rustfuzz-0.1.18.tar.gz
  • Upload date:
  • Size: 31.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rustfuzz-0.1.18.tar.gz
Algorithm Hash digest
SHA256 98e1ea44e6895b254ecfc8d4ca4625fe442436ff1080578bc8802be7e28cf62b
MD5 2c034215876078818f70b19fa2d409a2
BLAKE2b-256 5cef7082925c143bfe5e2c6ff80eb1e72c75e0d697d5a65994d25c1ee43d65e1

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.18.tar.gz:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.18-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.18-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 413c5757dce606eb5eca2fd1749411aeaea4093982cf9e379bb486f0266add91
MD5 9668951edd648dc73a30df5ed1982eee
BLAKE2b-256 8c4d14a4040229590dabc2d51d2daed517975edb6ec3a0235c6e1508abfea9f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.18-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.18-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.18-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 12b47f96a0c930a4ea258d9fe1fb486163ec18351c7426d7c802293f610ef67e
MD5 287c1d6ee986cb26c6768ce06b74f645
BLAKE2b-256 498f76c747ee55517dfc97f4b04584c20396f73e8d581f67d3ab0dd3df3798b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.18-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.18-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: rustfuzz-0.1.18-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 70.4 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rustfuzz-0.1.18-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 fe1db36b9b18d18a1aa2e8ff89a47faa1450420bb42c707b2e5be4b780401539
MD5 8ee0f14420bef7d87635cf06ebb5d958
BLAKE2b-256 aa8079dfd4c77f52daf9e0edf0767ade221a73f382b008f2a66506f314c91635

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.18-cp310-abi3-win_amd64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.18-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.18-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 b84b62194e0d3d8d4f52948ad025f43802b51c03dcb5f7a8005533e3b82eb289
MD5 448a1a0f035d5ae14f585cff448bdfec
BLAKE2b-256 048f75c5799c2e93e46a0e200fbebb0f4e679f6024888245133dad1ea3e35994

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.18-cp310-abi3-musllinux_1_2_x86_64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.18-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.18-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 40cfdad54afd6925175959ad85aaac8bbd5c3d366202dfd222e51f3564dd9a74
MD5 936d7637fb5149ce0a67d30cb8379a46
BLAKE2b-256 7feda49aa82e9c57800b6e422ce34ee0ba0e3585fe978475d2599ba179d3e8f1

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.18-cp310-abi3-musllinux_1_2_aarch64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.18-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.18-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2417d7aedc32d591782fa90f3ccf3b7661e7e8469fc77fcd65f70cb7a60b59c4
MD5 4ef62cf4ba92532cb9c7a2fe86d2e1ea
BLAKE2b-256 bf153d95ba47b20d9dd88e1b6dbc79fac97513036c27bd52980ef8849d54a77d

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.18-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.18-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.18-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e700efea37aba012211f2358473c716c70f1af5e26d17c70499efa30527c4ce3
MD5 ea9320df25a8a1fdc369b54d8fb6c9be
BLAKE2b-256 2ff457985efb99a8a47fc546449548b544666a8f7b60aeb08f95ce6173e6fa68

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.18-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.18-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.18-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b5532f13bb0a52255f94ab6305642e853bc247051a4da278327cadace91b73b1
MD5 33ea260713a55158ddbd59482b4d5790
BLAKE2b-256 ed65b69b77bdba65eb6a08c08874c796bd65de98929371e61e6ddc6f9bb29087

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.18-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.18-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.18-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4f66030312967db68c3281b68c1851dbfaeb240ba9ed3e2172fc3b8cecc2b570
MD5 7a4a2a411c3bc986ccc36db9a791b9a7
BLAKE2b-256 0214d4ff738f7504fb49193f20a27371f3618b5aceb1649d78a4a6fc846262e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.18-cp310-abi3-macosx_10_12_x86_64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page