Python library for fast fuzzy search over a big file leveraging C++ and mbleven algorithm
Project description
Python library for fast fuzzy search over a big file leveraging C++ and mbleven algorithm
Table of Contents
About The Project
Fastzy is a library written in C++ used for searching over a file for a text based on its distance (levenshtein). The library uses mbleven algorithm for a k-bounded levenshtein distance measurement. When the max distance requested is above 3, where mbleven should be slower, the distance algorithm is replaced with Wagner–Fischer.The library at first, loads the whole file into memory, and created a lightweight index, based on the length of the line. It helps to narrow down the amount of lookups to only potential lines.
Built With
Performance
Library | Text Size | Function | Time | #Results | Improvement Factor |
---|---|---|---|---|---|
python-Levenshtein | 500mb | Levenshtein.distance('text') | 24.2 s | 1249 | 1.0x |
fastzy | 500mb | fastzy.lookup('text) | 22.2 ms | 1249 | 1090.0x |
Prerequisites
In order to compile this package you should have GCC & Python development package installed.
- Fedora
sudo dnf install python3-devel gcc-c++
- Ubuntu 18.04
sudo apt install python3-dev g++-9
Installation
pip3 install fastzy
Usage
import fastzy
# open a file and index it in memory
searcher = fastzy.Searcher(
input_file_path='input_text_file.txt',
separator='',
)
# lookup for the input text 'text' with the distance of 1
searcher.lookup(
pattern='text',
max_distance=1,
)
['test', 'texts', 'next']
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Gal Ben David - gal@intsights.com
Project Link: https://github.com/Intsights/fastzy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.