Skip to main content

Python library for fast fuzzy search over a big file leveraging C++ and mbleven algorithm

Project description

Logo

Python library for fast fuzzy search over a big file leveraging C++ and mbleven algorithm

license Python Build PyPi

Table of Contents

About The Project

Fastzy is a library written in C++ used for searching over a file for a text based on its distance (levenshtein). The library uses mbleven algorithm for a k-bounded levenshtein distance measurement. When the max distance requested is above 3, where mbleven should be slower, the distance algorithm is replaced with Wagner–Fischer.The library at first, loads the whole file into memory, and created a lightweight index, based on the length of the line. It helps to narrow down the amount of lookups to only potential lines.

Built With

Performance

Library Text Size Function Time #Results Improvement Factor
python-Levenshtein 500mb Levenshtein.distance('text') 24.2 s 1249 1.0x
fastzy 500mb fastzy.lookup('text) 22.2 ms 1249 1090.0x

Prerequisites

In order to compile this package you should have GCC & Python development package installed.

  • Fedora
sudo dnf install python3-devel gcc-c++
  • Ubuntu 18.04
sudo apt install python3-dev g++-9

Installation

pip3 install fastzy

Usage

import fastzy

# open a file and index it in memory
searcher = fastzy.Searcher(
    input_file_path='input_text_file.txt',
    separator='',
)

# lookup for the input text 'text' with the distance of 1
searcher.lookup(
    pattern='text',
    max_distance=1,
)
['test', 'texts', 'next']

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - gal@intsights.com

Project Link: https://github.com/Intsights/fastzy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for fastzy, version 0.1.2
Filename, size File type Python version Upload date Hashes
Filename, size fastzy-0.1.2.tar.gz (27.4 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page