Cython implementation of true Damerau-Levenshtein algorithm.
Project description
fastDamerauLevenshtein
Cython implementation of true Damerau-Levenshtein edit distance which allows one item to be edited more than once. More information from Wikipedia:
In information theory and computer science, the Damerau-Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau-Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.
The Damerau-Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions).
The implementation is based on James M. Jensen II explanation and it allows specifying the cost of every operation.
Requirements
This code requires Python 2.7 or 3.5+ and a C compiler such as GCC.
Install
fastDamerauLevenshtein is available on PyPI at https://pypi.python.org/pypi/fastDamerauLevenshtein.
Install using pip:
pip install fastDamerauLevenshtein
Install from source:
python setup.py install
or
pip install .
Usage
The available method it's called damerauLevenshtein
and can compute the distance on two objects that are hashable(strings, list of strings etc.). The method provides the following parameters:
-
firstObject
-
secondObject
-
similarity
- If this parameter value is
False
, it will return the total cost of edit, otherwise it will return a score from 0.0 to 1.0 denoting how similar the two objects are. It isTrue
by default.
- If this parameter value is
-
deleteWeight
- Cost of delete operation.
-
insertWeight
- Cost of insert operation.
-
replaceWeight
- Cost of replace operation.
-
swapWeight
- Cost of swap operation.
The provided weights of operations must be int
values. All these values are 1
by default.
Basic use:
from fastDamerauLevenshtein import damerauLevenshtein
damerauLevenshtein('ca', 'abc', similarity=False) # expected result: 2.0
damerauLevenshtein('car', 'cars', similarity=True) # expected result: 0.75
damerauLevenshtein(['ab', 'bc'], ['ab'], similarity=False) # expected result: 1.0
damerauLevenshtein(['ab', 'bc'], ['ab'], similarity=True) # expected result: 0.5
Benchmark
Other Python Damerau-Levenshtein and OSA implementations:
- pyxDamerauLevenshtein (restricted edit distance and no custom weights)
- jellyfish (true Damerau-Levenshtein but no custom weights)
- editdistance (restricted edit distance and no custom weights)
- textdistance (true Damerau-Levenshtein but no custom weights)
Python 3.7 (on Intel i5 6500):
>>> import timeit
>>> #fastDamerauLevenshtein:
... timeit.timeit(setup="import fastDamerauLevenshtein; text1='afwafghfdowbihgp'; text2='goagumkphfwifawpte'", stmt="fastDamerauLevenshtein.damerauLevenshtein(text1, text2)", number=100000)
0.43
>>> #pyxDamerauLevenshtein:
... timeit.timeit(setup="from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance; text1='afwafghfdowbihgp'; text2='goagumkphfwifawpte'", stmt="normalized_damerau_levenshtein_distance(text1, text2)", number=100000)
2.44
>>> #jellyfish
... timeit.timeit(setup="import jellyfish; text1='afwafghfdowbihgp'; text2='goagumkphfwifawpte'", stmt="jellyfish.damerau_levenshtein_distance(text1, text2)", number=100000)
0.20
>>> #editdistance
... timeit.timeit(setup="import editdistance; text1='afwafghfdowbihgp'; text2='goagumkphfwifawpte'", stmt="editdistance.eval(text1, text2)", number=100000)
0.22
>>> #textdistance
... timeit.timeit(setup="import textdistance; text1='afwafghfdowbihgp'; text2='goagumkphfwifawpte'", stmt="textdistance.damerau_levenshtein.distance(text1, text2)", number=100000)
0.70
License
It is released under the MIT license.
Copyright (c) 2019 Robert Grigoroiu
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fastDamerauLevenshtein-1.0.7.tar.gz
.
File metadata
- Download URL: fastDamerauLevenshtein-1.0.7.tar.gz
- Upload date:
- Size: 36.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0186ccf45f3312a72d3b52665083f0fefb14fb510b71daeba2483292970834c1 |
|
MD5 | 7eed419ec18590ff215ada73d565aec1 |
|
BLAKE2b-256 | 42371d3f799161bdc4aebea549f3d782f55107e1d9988c60ed85a30618782d0c |
File details
Details for the file fastDamerauLevenshtein-1.0.7-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: fastDamerauLevenshtein-1.0.7-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 19.8 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 874ac3ce7a1c9ba2645629f999fac74815bd069f71d3c77c01095400f0bcc9c1 |
|
MD5 | daff763359b7743f3591caa6dd445645 |
|
BLAKE2b-256 | 3f9e294ca40c287c8107706da44627c4e07dd192141466a702a1621430f52a26 |