Fast and flexible search in a dictionary using Levenshtein distance

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Efficient and Flexible Searching Within Levenshtein Distance

Coverage

Introduction

Welcome to Leven-Search, a library designed for efficient and fast searching of words within a specified Levenshtein distance.

This library is designed with Kaggle developers and researchers in mind as well as all others who deal with natural language processing, text analysis, and similar domains where the closeness of strings is a pivotal aspect.

What is Levenshtein Distance?

Levenshtein distance measures the difference between two sequences. In the context of strings, it is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another.

For example, the Levenshtein distance between "table" and "marble" is 2:

table → mable (substitution of t for `m')
mable → marble (insertion of r)

Design Goals

The library is designed with the following goals in mind:

Efficient indexing of large datasets of words. Indexing about 40k words from the Brown corpus takes about 300ms on a modern laptop, and the index takes about 53MB of RAM.
Flexibility in searching. The library allows searching for words within a specified Levenshtein distance also allows configuring specific edit costs for each operation. It allows to configure other distances, like a keyboard distance.
Extensibility. The library is designed to be easily extensible to support other edit distances, such as Damerau-Levenshtein distance or Jaro-Winkler distance.

Performance

Example performance of the library on a Brown corpus (only words larger than 2 characters) and a modern laptop:

Distance	Time per 1000 searches (in seconds)
0	0.0146
1	0.3933
1 (*)	0.4154
2	7.9556

(*) with the per-letter cost granularity

Installation

To install the library, simply run:

pip install leven-search

Usage

First, import the library:

import leven_search as lev

Then, create a LevenSearch object:

searcher = lev.LevenSearch()

Next, add words to the searcher:

searcher.insert("hello")
searcher.insert("world")

Finally, search for words within a specified Levenshtein distance:

searcher.find_dist("mello", 1)

Result:
	hello: ResultItem(word='hello', dist=1, updates=[m -> h])

Example

The following example shows how to use the library to search for words within a Brown corpus:

import nltk
import leven_search as lev

# Download the Brown corpus
nltk.download('brown')

# Create a LevenSearch object
searcher = lev.LevenSearch()

for w in nltk.corpus.brown.words():
  if len(w) > 2:
    searcher.insert(w)

# Search for words within a Levenshtein distance
searcher.find_dist('komputer', 1)

Result:
	computer: ResultItem(word='computer', dist=1, updates=[k -> c])

Search for words within a Levenshtein distance with custom costs

cost = lev.GranularEditCostConfig(default_cost=2, edit_costs=[lev.EditCost('k', 'c', 0.1)])
searcher.find_dist('komputer', 2, cost)

Result:
      computer: ResultItem(word='computer', dist=0.1, updates=[k -> c])

searcher.find_dist('yomputer', 2, cost)

Result:
      computer: ResultItem(word='computer', dist=2, updates=[y -> c])

searcher.find_dist('yomputer', 1, cost)

Result:
      None

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.2.1

Jan 30, 2024

0.2.0

Jan 16, 2024

0.1.0

Jan 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leven-search-0.2.1.tar.gz (8.4 kB view hashes)

Uploaded Jan 30, 2024 Source

Built Distribution

leven_search-0.2.1-py3-none-any.whl (8.2 kB view hashes)

Uploaded Jan 30, 2024 Python 3

Hashes for leven-search-0.2.1.tar.gz

Hashes for leven-search-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`84d33c3ac799df90a01683dfae9fc7c3e1dced856a9a73572d9d267115adbca6`
MD5	`70e3caabe576e4362cf5ff0c76d02f79`
BLAKE2b-256	`57b584d1b5b479e4949e13a7dd3c85501db465d238ab20766ea2d9be90f32e2d`

Hashes for leven_search-0.2.1-py3-none-any.whl

Hashes for leven_search-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3ccd646496af7bcc736c79c66807e224469ae5cfccbf4a50e17e86dba900b865`
MD5	`91d6a9d1cbc842476ab446c9f7ad44b1`
BLAKE2b-256	`6fd3e106348cca8e9f23219bfd93a9cfe159c55cc2d71a835f1b93a01cf93c80`