Blazing fast fuzzy text search for Python.

These details have not been verified by PyPI

Project description

Neofuzz

Blazing fast, lightweight and customizable fuzzy and semantic text search in Python.

Introduction (Documentation)

Neofuzz is a fuzzy search library based on vectorization and approximate nearest neighbour search techniques.

New in version 0.3.0

Now you can reorder your search results using Levenshtein distance! Sometimes n-gram processes or vectorized processes don't quite order the results correctly. In these cases you can retrieve a higher number of examples from the indexed corpus, then refine those results with Levenshtein distance.

from neofuzz import char_ngram_process

process = char_ngram_process()
process.index(corpus)

process.extract("your query", limit=30, refine_levenshtein=True)

Why is Neofuzz fast?

Most fuzzy search libraries rely on optimizing the hell out of the same couple of fuzzy search algorithms (Hamming distance, Levenshtein distance). Sometimes unfortunately due to the complexity of these algorithms, no amount of optimization will get you the speed, that you want.

Neofuzz makes the realization, that you can’t go above a certain speed limit by relying on traditional algorithms, and uses text vectorization and approximate nearest neighbour search in the vector space to speed up this process.

When it comes to the dilemma of speed versus accuracy, Neofuzz goes full-on speed.

When should I choose Neofuzz?

You need to do repeated searches in the same corpus.
Levenshtein and Hamming distance is simply not fast enough.
You are willing to sacrifice the quality of the results for speed.
You don’t mind that the up-front computation to index a corpus might take time.
You have very long strings, where other methods would be impractical.
You want to rely on semantic content.
You need a drop-in replacement for TheFuzz.

When should I NOT choose Neofuzz?

The corpus changes all the time, or you only want to do one search in a corpus. (It might still give speed-up in that case though.)
You value the quality of the results over speed.
You don’t mind slower searches in favor of no indexing.
You have a small corpus with short strings.

Usage

You can install Neofuzz from PyPI:

pip install neofuzz

If you want a plug-and play experience you can create a generally good quick and dirty process with the char_ngram_process() process.

from neofuzz import char_ngram_process

# We create a process that takes character 1 to 5-grams as features for
# vectorization and uses a tf-idf weighting scheme.
# We will use cosine distance for the nearest neighbour search.
process = char_ngram_process(ngram_range=(1,5), metric="cosine", tf_idf=True)

# We index the options that we are going to search in
process.index(options)

# Then we can extract the ten most similar items the same way as in
# thefuzz
process.extract("fuzz", limit=10)
---------------------------------
[('fuzzer', 67),
 ('Januzzi', 30),
 ('Figliuzzi', 25),
 ('Fun', 20),
 ('Erika_Petruzzi', 20),
 ('zu', 20),
 ('Zo', 18),
 ('blog_BuzzMachine', 18),
 ('LW_Todd_Bertuzzi', 18),
 ('OFU', 17)]

Custom Processes

You can customize Neofuzz’s behaviour by making a custom process. Under the hood every Neofuzz Process relies on the same two components:

A vectorizer, which turns texts into a vectorized form, and can be fully customized.
Approximate Nearest Neighbour search, which indexes the vector space and can find neighbours of a given vector very quickly.

Words as Features

If you’re more interested in the words/semantic content of the text you can also use them as features. This can be very useful especially with longer texts, such as literary works.

from neofuzz import Process
from sklearn.feature_extraction.text import TfidfVectorizer

 # Vectorization with words is the default in sklearn.
 vectorizer = TfidfVectorizer()

 # We use cosine distance because it's waay better for high-dimentional spaces.
 process = Process(vectorizer, metric="cosine")

Dimensionality Reduction

You might find that the speed of your fuzzy search process is not sufficient. In this case it might be desirable to reduce the dimentionality of the produced vectors with some matrix decomposition method or topic model.

Here for example I use NMF (excellent topic model and incredibly fast one too) too speed up my fuzzy search pipeline.

from neofuzz import Process
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import make_pipeline

# Vectorization with tokens again
vectorizer = TfidfVectorizer()
# Dimensionality reduction method to 20 dimensions
nmf = NMF(n_components=20)
# Create a pipeline of the two
pipeline = make_pipeline(vectorizer, nmf)

process = Process(pipeline, metric="cosine")

Semantic Search/Large Language Models

With Neofuzz you can easily use semantic embeddings to your advantage, and can use both attention-based language models (Bert), just simple neural word or document embeddings (Word2Vec, Doc2Vec, FastText, etc.) or even OpenAI’s LLMs.

We recommend you try embetter, which has a lot of built-in sklearn compatible vectorizers.

pip install embetter

from embetter.text import SentenceEncoder
from neofuzz import Process

# Here we will use a pretrained Bert sentence encoder as vectorizer
vectorizer = SentenceEncoder("all-distilroberta-v1")
# Then we make a process with the language model
process = Process(vectorizer, metric="cosine")

# Remember that the options STILL have to be indexed even though you have a pretrained vectorizer
process.index(options)

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

Apr 19, 2025

0.3.4

Jan 21, 2025

0.3.3

Nov 4, 2024

0.3.2

Oct 4, 2024

0.3.1

Sep 6, 2024

0.3.0

Sep 6, 2024

0.1.3

May 25, 2023

0.1.2

Apr 4, 2023

0.1.1

Apr 4, 2023

0.1.0

Apr 4, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neofuzz-0.4.0.tar.gz (7.2 kB view details)

Uploaded Apr 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

neofuzz-0.4.0-py3-none-any.whl (8.3 kB view details)

Uploaded Apr 19, 2025 Python 3

File details

Details for the file neofuzz-0.4.0.tar.gz.

File metadata

Download URL: neofuzz-0.4.0.tar.gz
Upload date: Apr 19, 2025
Size: 7.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/5.15.0-136-generic

File hashes

Hashes for neofuzz-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`231494eeca28121fd147027a61d782f3d471ca599c9517dea8d4fb0da6133e51`
MD5	`3a15714d7d3ec81f1902d3bab7f6e8c2`
BLAKE2b-256	`387674543c1c9c0a80b0619d84b2cad572bde5e5503b94c9a88649af71fdabac`

See more details on using hashes here.

File details

Details for the file neofuzz-0.4.0-py3-none-any.whl.

File metadata

Download URL: neofuzz-0.4.0-py3-none-any.whl
Upload date: Apr 19, 2025
Size: 8.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/5.15.0-136-generic

File hashes

Hashes for neofuzz-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ebada6d5c8c3cfba5cc412a8120f570b619e9831c105e6ee24c3ff251cf97825`
MD5	`e968086279cd84234a49a929f8b2d085`
BLAKE2b-256	`7d84f46115999f180624966c2faedba08a169096c0e698fdb8a55aeac45b4fdb`

See more details on using hashes here.

neofuzz 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Neofuzz

Introduction (Documentation)

New in version 0.3.0

Why is Neofuzz fast?

When should I choose Neofuzz?

When should I NOT choose Neofuzz?

Usage

Custom Processes

Words as Features

Dimensionality Reduction

Semantic Search/Large Language Models

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes