Empirical determination of approximate values for levenshtein distances between random strings.
Project description
expectedlevenshtein
This repository contains empirically determined approximate expected Levenshtein distances between random strings over alphabets of different sizes, as well as simple python code to generate them.
Dependencies
To use the code, you will need numpy
and numba
.
Installing
Simply clone this repo:
git clone https://github.com/nickmachnik/expectedlevenshtein.git [TARGET DIR]
and then install via pip
pip install [TARGET DIR]
or install directly from PyPI (this won't include unreleased changes as specified in the changelog):
pip install expectedlevenshtein
Testing
Test the cloned package:
cd [TARGET DIR]
python m unittest
Geting started
Use precomputed models
This package comes with precomputed models for certain alphabet sizes k and string lengths n. Currently the following models are available:
 k = 20, 25 ≤ n ≤ 6000
Note: A model for a specific value of n only fits values for m (the length of the second string) such that m ≤ n.
The following example shows how a models can be loaded and used to compute the expected levenshtein distances for k = 20, n = 5000:
import expected_levenshtein.fit as efit import numpy as np # load all models for k = 20 row_indices, coefficients, mean_squared_deviations = efit.load_precomputed(20) # get the specific model for n = 5000. Here we consider an index row offset. coeff_5k = coefficients[5000  row_indices[0]] # predict expected distance for n=5000, m=876 single_distance = efit.poly(876, coeff_5k) # predict expected distances for n=5000, m ≤ 5000 range_distances = efit.poly(np.arange(5000), coeff_5k)
Computing average levenshtein distances
To compute the approximate expected Levenshtein distances of random strings of lengths 1 ≤ lengths ≤ n, use random_average_levenshtein
in sample.py
.
This example shows how to compute the distances of random strings up to length 100 over a 4letter alphabet, averaged over 1000 replicates.
from sample import random_average_levenshtein import numpy as np random_average_levenshtein(100, 1000, np.arange(4))
Generating models for expected distances
For long sequences, the distance matrix returned by random_average_levenshtein
can get quite large.
If you prefer not to load and query a large matrix object every time you need an expected distance,
fit.model_average_levenshtein
generates a polynomial model for each row in
the distance matrix. That way, the information that needs to be stored to compute approximate
expected levenshtein distances is reduced to the coefficients of the polynomials. Once computed,
these can be used to predict expected distances with fit.poly
.
This example shows how to generate and use such models for random strings from length 25 to length 50.
from sample import random_average_levenshtein from fit import poly, model_average_levenshtein import numpy as np # sample distances average_distances = random_average_levenshtein(50, 1000, np.arange(4)) # make models row_indices, coefficients, mean_squared_deviations = model_average_levenshtein( average_distances, model_rows=np.arange(25, 51)) # predict expected distance for n=50, m=44 coeff_n_50 = coefficients[1] predicted_expected_distance = poly(44, coeff_n_50)
License
MIT license (LICENSE or https://opensource.org/licenses/MIT)
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size expected_levenshtein0.1.2py3noneany.whl (398.2 kB)  File type Wheel  Python version py3  Upload date  Hashes View 
Filename, size expectedlevenshtein0.1.2.tar.gz (402.2 kB)  File type Source  Python version None  Upload date  Hashes View 
Hashes for expected_levenshtein0.1.2py3noneany.whl
Algorithm  Hash digest  

SHA256  7706399088192cb3f7db45bc73d437da93e238c279d30c925a0dcb2a469040ca 

MD5  40385e0a83d3dd118d5a941cb96b7326 

BLAKE2256  1cb04ee90921e84f57a6c5eb8585eea52dee68303b0df36edd4d5b6e01d2ed02 
Hashes for expectedlevenshtein0.1.2.tar.gz
Algorithm  Hash digest  

SHA256  12f2f9a93775e88d9dea9646d512eb335d0f765402b20bb95a25c9e71fb048f7 

MD5  1085a8c49a3a2c11f7c7d0155c33b9d5 

BLAKE2256  6331a626279db8b0cf8af00390c4c5fd4b5d32017b0f784481b6e85891c6f400 