Blazing fast fuzzy text search for Python.
Project description
neofuzz
Blazing fast fuzzy text search for Python.
Introduction
neofuzz is a fuzzy search library based on vectorization and approximate nearest neighbour search techniques.
What neofuzz is good at:
- Hella Fast.
- Repeated searches in the same space of options.
- Compatibility with already existing TheFuzz code.
- Incredible flexibility in the vectorization process.
- Complete control over the nearest neighbour algorithm's speed and accuracy.
If you're looking for a scalable solution for searching the same, large dataset with lower quality of results but incredible speed, neofuzz is the thing you're looking for.
What neofuzz is not good at:
- Exact and certainly correct results.
- Searching different databases in succession.
- Not the best fuzzy search algorithm.
If you're looking for a library that's great for fuzzy searching small amount of data with a good fuzzy algorithm like levenshtein or hamming distance, neofuzz is probably not the thing for you.
Usage
You can install neofuzz from PyPI:
pip install neofuzz
The base abstraction of neofuzz is the Process
, which is a class aimed at replicating TheFuzz's API.
A Process
takes a vectorizer, that turns strings into vectorized form, and different parameters
for fine-tuning the indexing process.
If you want a plug-and play experience you can create a generally good quick and dirty
process with the char_ngram_process()
process.
from neofuzz import char_ngram_process
# We create a process that takes character 1 to 5-grams as features for
# vectorization and uses a tf-idf weighting scheme.
# We will use cosine distance for the nearest neighbour search.
process = char_ngram_process(ngram_range=(1,5), metrics="cosine", tf_idf=True)
# We index the options that we are going to search in
process.index(options)
# Then we can extract the ten most similar items the same way as in
# thefuzz
process.extract("fuzz", limit=10)
---------------------------------
[('fuzzer', 67),
('Januzzi', 30),
('Figliuzzi', 25),
('Fun', 20),
('Erika_Petruzzi', 20),
('zu', 20),
('Zo', 18),
('blog_BuzzMachine', 18),
('LW_Todd_Bertuzzi', 18),
('OFU', 17)]
If you want to use a custom vectorization process with dimentionality reduction for example,
you are more than free to do so by creating your own custom Process
from neofuzz import Process
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
# Let's say we have a list of sentences instead of words,
# Then we can use token ngrams as features
tf_idf = TfidfVectorizer(analyzer="word", stop_words="english", ngram_range=(1,3))
# We use NMF for reducing the dimensionality of the vectors to 20
# This could improve speed but takes more time to set up the index
nmf = NMF(n_components=20)
# Our vectorizer is going to be a pipeline
vectorizer = make_pipeline(tf_idf, nmf)
# We create a process and index it with our corpus.
process = Process(vectorizer, metric="cosine")
process.index(sentences)
# Then you can extract results
process.extract("she ate the cat", limit=3)
-------------------------------------------
[('She ate the Apple.', 65),
('The dog at the cat.', 42),
('She loves that cat', 30)]
Documentation
TODO
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file neofuzz-0.1.2.tar.gz
.
File metadata
- Download URL: neofuzz-0.1.2.tar.gz
- Upload date:
- Size: 7.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.10.8 Linux/5.14.0-1059-oem
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a059924bb7cfd89f35155f01ea67b25502211eba0b17749c300586f666d2094 |
|
MD5 | 72a42884f68272cd528dfed96e33adba |
|
BLAKE2b-256 | c3267c1d9d918229a5166c33d7febf8b4d24f97c7265ae43a711455719cfeec1 |
File details
Details for the file neofuzz-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: neofuzz-0.1.2-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.10.8 Linux/5.14.0-1059-oem
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c247531a81ac9f6c53050d927c83dddb45458ae6510ec779fb1dc0fdf4b6e6c8 |
|
MD5 | ac766cf69a57be11fe57f22f847d11aa |
|
BLAKE2b-256 | b57e90c9502c754789d1fddd7129e9d4cc40507346cc4c88eb072221e6436368 |