Skip to main content

rapid fuzzy string matching

Project description

rustfuzz logo

PyPI version Docs Tests MIT License Rust powered Built by AI


[!WARNING] ๐Ÿšง Under Heavy Construction

This library is actively being developed and APIs may change between releases. We're shipping fast โ€” expect frequent updates, new features, and occasional breaking changes. Pin your version if stability matters to you: pip install rustfuzz==0.1.12


๐Ÿค– This project was built entirely by AI.

The idea was simple: could an AI agent beat RapidFuzz โ€” one of the fastest fuzzy matching libraries in the world โ€” by writing a Rust-backed Python library from scratch, guided only by benchmarks?

The development loop was: Research โ†’ Build โ†’ Benchmark โ†’ Repeat.


rustfuzz is a blazing-fast fuzzy string matching library for Python โ€” implemented entirely in Rust. ๐Ÿš€

Zero Python overhead. Memory safe. Pre-compiled wheels for every major platform.

The Challenge: Beat RapidFuzz

flowchart LR
    R["๐Ÿ” Research<br>Profiler output<br>& algorithm gaps"]
    B["๐Ÿฆ€ Build<br>Rust implementation<br>via PyO3"]
    T["โœ… Test<br>All tests must pass<br>before proceeding"]
    BM["๐Ÿ“Š Benchmark<br>vs RapidFuzz<br>Numbers don't lie"]
    RP["๐Ÿ” Repeat<br>Find the next<br>bottleneck"]

    R --> B --> T --> BM --> RP --> R

    style R fill:#6366f1,color:#fff,stroke:none
    style B fill:#a855f7,color:#fff,stroke:none
    style T fill:#ef4444,color:#fff,stroke:none
    style BM fill:#22c55e,color:#fff,stroke:none
    style RP fill:#f59e0b,color:#fff,stroke:none

The goal: match or exceed RapidFuzz's throughput on ratio, partial_ratio, token_sort_ratio, and process.extract โ€” all from Python. Each iteration starts with profiling, identifies the hottest path, and rewrites it deeper into Rust.

The Results: RustFuzz is Faster ๐Ÿ†

We benchmarked process.extract on a 1,000,000 row corpus. Thanks to zero-overhead Rayon parallelization, lock-free global threshold shrinking (AtomicU64), and native query token caching, rustfuzz officially outperforms rapidfuzz.

Benchmark (1M rows) RapidFuzz RustFuzz (Parallel)
Raw Characters (ratio) 5506 ms 5253 ms
Complex Tokens (WRatio) 3032 ms 2716 ms

But that's not all. By utilizing the built-in BM25 Hybrid Pipeline, rustfuzz can complete the identical extraction task in a revolutionary 97 ms (a ~30x speedup over state-of-the-art fuzzy matching!).

Features

โšก Blazing Fast Core algorithms written in Rust โ€” no Python overhead, no GIL bottlenecks
๐Ÿง  Smart Matching Ratio, partial ratio, token sort/set, Levenshtein, Jaro-Winkler, and more
๐Ÿ”’ Memory Safe Rust's borrow checker guarantees โ€” no segfaults, no buffer overflows
๐Ÿ Pythonic API Clean, typed Python interface. Import and go
๐Ÿ“ฆ Zero Build Step Pre-compiled wheels on PyPI for Python 3.10โ€“3.14 on all major platforms
๐Ÿ”๏ธ Big Data Ready Excels in 1 Billion Row Challenge benchmarks, crushing high-throughput tasks
๐Ÿ” 3-Way Hybrid Search BM25 + Fuzzy + Dense embeddings via RRF โ€” 25ms at 1M docs, all in Rust
๐Ÿ”Ž Filter & Sort Meilisearch-style filtering and sorting with Rust-level performance
๐Ÿ“„ Document Objects First-class Document(content, metadata) + LangChain compatibility
๐Ÿงฉ Ecosystem Integrations BM25, Hybrid Search, and LangChain Retrievers for Vector DBs (Qdrant, LanceDB, FAISS, etc.)

Installation

pip install rustfuzz
# or, with uv (recommended โ€” much faster):
uv pip install rustfuzz

Quick Start

import rustfuzz.fuzz as fuzz
from rustfuzz.distance import Levenshtein

# Fuzzy ratio
print(fuzz.ratio("hello world", "hello wrold"))          # ~96.0

# Partial ratio (substring match)
print(fuzz.partial_ratio("hello", "say hello world"))    # 100.0

# Token-order-insensitive match
print(fuzz.token_sort_ratio("fuzzy wuzzy", "wuzzy fuzzy")) # 100.0

# Levenshtein distance
print(Levenshtein.distance("kitten", "sitting"))         # 3

# Normalised similarity [0.0 โ€“ 1.0]
print(Levenshtein.normalized_similarity("kitten", "kitten")) # 1.0

Batch extraction

from rustfuzz import process

choices = ["New York", "New Orleans", "Newark", "Los Angeles"]
print(process.extractOne("new york", choices))
# ('New York', 100.0, 0)

print(process.extract("new", choices, limit=3))
# [('Newark', ...), ('New York', ...), ('New Orleans', ...)]

3-Way Hybrid Search (BM25 + Fuzzy + Dense)

from rustfuzz.search import Document, HybridSearch

# Create documents with metadata
docs = [
    Document("Apple iPhone 15 Pro Max 256GB", {"brand": "Apple", "price": 1199}),
    Document("Samsung Galaxy S24 Ultra", {"brand": "Samsung", "price": 1299}),
    Document("Google Pixel 8 Pro", {"brand": "Google", "price": 699}),
]

# Optional: add dense embeddings for semantic search
embeddings = [[1.0, 0.0, 0.0], [0.9, 0.1, 0.0], [0.1, 0.9, 0.0]]

hs = HybridSearch(docs, embeddings=embeddings)

# Handles typos via fuzzy, keywords via BM25, meaning via dense โ€” all in Rust
results = hs.search("appel iphon", query_embedding=[1.0, 0.0, 0.0], n=1)
text, score, meta = results[0]
print(f"{text} โ€” ${meta['price']}")
# Apple iPhone 15 Pro Max 256GB โ€” $1199

Also works with LangChain Document objects โ€” no dependency required, auto-detected via duck-typing!

With Real Embeddings (FastEmbed)

Use FastEmbed for lightweight, local, ONNX-based embeddings โ€” no GPU needed:

from fastembed import TextEmbedding
from rustfuzz.search import Document, HybridSearch

model = TextEmbedding("BAAI/bge-small-en-v1.5")  # ~33 MB, CPU-only

docs = [
    Document("Apple iPhone 15 Pro Max 256GB", {"brand": "Apple"}),
    Document("Samsung Galaxy S24 Ultra",      {"brand": "Samsung"}),
    Document("Sony WH-1000XM5 Headphones",    {"brand": "Sony"}),
]

embeddings = [e.tolist() for e in model.embed([d.content for d in docs])]
hs = HybridSearch(docs, embeddings=embeddings)

query = "wireless noise cancelling headset"
query_emb = list(model.embed([query]))[0].tolist()

results = hs.search(query, query_embedding=query_emb, n=1)
text, score, meta = results[0]
print(f"{text} โ€” {meta['brand']}")
# Sony WH-1000XM5 Headphones โ€” Sony

With Rust-Native Embeddings (EmbedAnything)

Use EmbedAnything for Rust-native embeddings via Candle โ€” no PyTorch, no ONNX:

import embed_anything
from embed_anything import EmbeddingModel
from rustfuzz.search import Document, HybridSearch

model = EmbeddingModel.from_pretrained_hf(
    model_id="sentence-transformers/all-MiniLM-L6-v2",
)

docs = [
    Document("Apple iPhone 15 Pro Max 256GB", {"brand": "Apple"}),
    Document("Samsung Galaxy S24 Ultra",      {"brand": "Samsung"}),
    Document("Sony WH-1000XM5 Headphones",    {"brand": "Sony"}),
]

# Embed corpus with EmbedAnything
embed_data = embed_anything.embed_query([d.content for d in docs], embedder=model)
embeddings = [item.embedding for item in embed_data]

hs = HybridSearch(docs, embeddings=embeddings)

query = "wireless noise cancelling headset"
query_emb = embed_anything.embed_query([query], embedder=model)[0].embedding

text, score, meta = hs.search(query, query_embedding=query_emb, n=1)[0]
print(f"{text} โ€” {meta['brand']}")
# Sony WH-1000XM5 Headphones โ€” Sony

Or use the callback pattern for fully automatic query embedding:

def embed_fn(texts: list[str]) -> list[list[float]]:
    return [r.embedding for r in embed_anything.embed_query(texts, embedder=model)]

hs = HybridSearch(docs, embeddings=embed_fn)
results = hs.search("wireless headset", n=1)  # query auto-embedded!

Filtering & Sorting (Meilisearch-style)

from rustfuzz import Document
from rustfuzz.search import BM25

docs = [
    Document("Apple iPhone 15 Pro Max",  {"brand": "Apple",   "category": "phone",  "price": 1199, "in_stock": True}),
    Document("Samsung Galaxy S24 Ultra", {"brand": "Samsung", "category": "phone",  "price": 1299, "in_stock": True}),
    Document("Google Pixel 8 Pro",       {"brand": "Google",  "category": "phone",  "price": 699,  "in_stock": False}),
    Document("Apple MacBook Pro M3",     {"brand": "Apple",   "category": "laptop", "price": 2499, "in_stock": True}),
]

bm25 = BM25(docs)

# Fluent builder: filter โ†’ sort โ†’ match (executes immediately)
results = (
    bm25
    .filter('brand = "Apple" AND price > 500')
    .sort("price:asc")
    .match("pro", n=10)
)

for text, score, meta in results:
    print(f"  {text} โ€” ${meta['price']}")

# Supports: =, !=, >, <, >=, <=, TO (range), IN, EXISTS, IS NULL, AND, OR, NOT
# Works with BM25, BM25L, BM25Plus, BM25T, and HybridSearch

Filter and sort also work with HybridSearch (BM25 + Fuzzy + Dense):

from rustfuzz import Document
from rustfuzz.search import HybridSearch

docs = [
    Document("Apple iPhone 15 Pro Max", {"brand": "Apple", "price": 1199}),
    Document("Samsung Galaxy S24 Ultra", {"brand": "Samsung", "price": 1299}),
    Document("Google Pixel 8 Pro",       {"brand": "Google", "price": 699}),
]

hs = HybridSearch(docs, embeddings=embeddings)

# Filter + sort + semantic search
results = (
    hs
    .filter('brand = "Apple"')
    .sort("price:asc")
    .match("iphone pro", n=5, query_embedding=query_emb)
)

Supported Algorithms

Module Algorithms
rustfuzz.fuzz ratio, partial_ratio, token_sort_ratio, token_set_ratio, token_ratio, WRatio, QRatio, partial_token_*
rustfuzz.distance Levenshtein, Hamming, Indel, Jaro, JaroWinkler, LCSseq, OSA, DamerauLevenshtein, Prefix, Postfix
rustfuzz.process extract, extractOne, extract_iter, cdist
rustfuzz.search BM25, BM25L, BM25Plus, BM25T, HybridSearch, Document
rustfuzz.filter Meilisearch-style filter parser & evaluator
rustfuzz.sort Multi-key sort with dot notation
rustfuzz.query Fluent SearchQuery builder (.filter().sort().search().collect())
rustfuzz.utils default_process

The BM25 Search Engines

rustfuzz.search implements lightning-fast Text Retrieval mathematical variants. The core differences:

  • BM25 (Okapi): The industry standard. Employs term frequency saturation (logarithmic decay) and document length normalization.
  • BM25L: Focuses on length penalization corrections. Introduces a static term shift delta, guaranteeing that matching terms yield a minimum baseline score even in massive documents where normalisation would normally suppress them.
  • BM25Plus: Also creates a lower-bound for any given matching term, but applies the shift after term saturation. Widely considered the best default for highly mixed-length corpuses.
  • BM25T: Introduces Information Gain adjustments to dynamically calculate the saturation limit $k_1$ per term, restricting dominant variance. rustfuzz hyper-optimises this by pre-computing term limits natively within the inverted index.

You can see an end-to-end benchmark comparison of these algorithms resolving the BEIR SciFact dataset in examples/bench_retrieval.py.

Documentation

Full cookbook with interactive examples and benchmark results: ๐Ÿ‘‰ bmsuisse.github.io/rustfuzz

License

MIT ยฉ BM Suisse

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rustfuzz-0.1.16.tar.gz (31.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rustfuzz-0.1.16-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (70.8 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

rustfuzz-0.1.16-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (70.6 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

rustfuzz-0.1.16-cp310-abi3-win_amd64.whl (70.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

rustfuzz-0.1.16-cp310-abi3-musllinux_1_2_x86_64.whl (71.0 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

rustfuzz-0.1.16-cp310-abi3-musllinux_1_2_aarch64.whl (70.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

rustfuzz-0.1.16-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (70.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

rustfuzz-0.1.16-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (70.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

rustfuzz-0.1.16-cp310-abi3-macosx_11_0_arm64.whl (71.0 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

rustfuzz-0.1.16-cp310-abi3-macosx_10_12_x86_64.whl (70.6 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file rustfuzz-0.1.16.tar.gz.

File metadata

  • Download URL: rustfuzz-0.1.16.tar.gz
  • Upload date:
  • Size: 31.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rustfuzz-0.1.16.tar.gz
Algorithm Hash digest
SHA256 90a5ab6af72872fcd02180f88aa587076699eb8fd622cfcee9cacbd66e1d9341
MD5 b52878dd5152c49dbb0321364d43a124
BLAKE2b-256 4622aed0bf3fdaf93e125026d7ba1c969d5f7bdb30b72c013c44fccb69e374ea

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.16.tar.gz:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.16-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.16-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3bc87b446e5105f679564cd20135f6ac1454b7d7659f25d0f76824709edbefb8
MD5 d3ad874ea88a3856846f08493852bb4e
BLAKE2b-256 7729a260cd7a4c1bb2054d65f4c2c9052f7983b5d94418a7abcf9002d107c2a1

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.16-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.16-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.16-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 361370c396f95365c454c71f7100cc69fc94758ca0d19806fe7b0b420c4af64d
MD5 ba1eb38b73db46f97bbc69f3cdae8559
BLAKE2b-256 d281cb86a2994e88144765111a660251d65782ba987c3102e052ca918a177f35

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.16-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.16-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: rustfuzz-0.1.16-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 70.4 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rustfuzz-0.1.16-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 513e9a4b375789513bab93ea20b245febf777180a362991284581e5cbc07bda2
MD5 8d2835e1b861a9cabdb10ac4dc202325
BLAKE2b-256 b08f246102d0b3ceadae7db963e90c84c890831cadba326221e5bf3eab59bf57

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.16-cp310-abi3-win_amd64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.16-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.16-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 6a38095ee30d01a78552f8a9caa9913f7e2a1d480de737934efb60ceebf32f16
MD5 25abb1f01359b737a7d88da286e706d3
BLAKE2b-256 c25151954f3fa80f0b59f10bd1d6449942ab94f9faf1460d23a76a88b44e8fcd

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.16-cp310-abi3-musllinux_1_2_x86_64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.16-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.16-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 ca7869c0c51dc93b0c42af1944a60e7a72aa0731c0f7115751e7a56eb517fea7
MD5 6ada5fde9d191c9a9d404275a83812d5
BLAKE2b-256 0a977b5974276641525e9ffbe1f3e028674bed52e3d5d47897192f1469f2eb25

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.16-cp310-abi3-musllinux_1_2_aarch64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.16-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.16-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6505f16acab6f69c587009238c1eb01e3204605f5ea98d4d5f80bddf964187e5
MD5 2fc5d9231df17198bcf0f185a88893f9
BLAKE2b-256 f78d2d3c13ce338986c4197f1ea08674679bc40182956ef1a49be13132e6d661

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.16-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.16-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.16-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d508fb86bf5b62de88cd376d0b7162d711a7753e2ea92d464b401a3ddcd21f55
MD5 d85e888af376f6ec79be936047b70b5e
BLAKE2b-256 07004f3b0ed4c274fe7db2160dc6129a76a3b687f6888b901ee0743accd96faf

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.16-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.16-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.16-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cc52f1404a1284d92d6494a2a3cd6aa4fd5a4a5e3add1d56f6d4645179f14810
MD5 709c3a4acef9924aa40ab851e42e8053
BLAKE2b-256 16d64787134519cc66e22a804515d95c46049a33e156594d7895ebd0f25ea395

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.16-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustfuzz-0.1.16-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for rustfuzz-0.1.16-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 40b1a89c3e20303a61df2b8c139ca9d9040b64a9bd94894853175d7bc24e56d6
MD5 5ae2a210db15c473b38716b0d611e85e
BLAKE2b-256 442955488c9fe6c4fa8c9b467bbf6f2300e12a7938276b68783d6c9e227db8a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustfuzz-0.1.16-cp310-abi3-macosx_10_12_x86_64.whl:

Publisher: ci.yml on bmsuisse/rustfuzz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page