Edit distance implementations in cython
Project description
Python Edit Distances
Copyright (C) 20192020  Benjamin Paassen
Machine Learning Research Group
Center of Excellence Cognitive Interaction Technology (CITEC)
Bielefeld University
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, see http://www.gnu.org/licenses/.
Introduction
This library contains several edit distance and alignment algorithms for sequences and trees of arbitrary node type. Additionally, this library contains multiple backtracing mechanisms for every algorithm in order to facilitate more detailed interpretation and subsequent processing. Finally, this library provides a reference implementation for embedding edit distance learning (BEDL; Paaßen et al., 2018), which enables users to learn edit distance parameters instead of specifying them manually.
Refer to the Quickstart Guide for how to use the library and refer to the list below for a full list of the enclosed algorithms. The detailed API documentation is available at readthedocs.org.
If you use this library in academic work, please cite:
 Paaßen, B., Mokbel, B., & Hammer, B. (2015). A Toolbox for Adaptive Sequence Dissimilarity Measures for Intelligent Tutoring Systems. In O. C. Santos, J. G. Boticario, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J. M. Luna, et al. (Eds.), Proceedings of the 8th International Conference on Educational Data Mining (pp. 632632). International Educational Datamining Society. (Link)
This library is historically based on its Java version, the TCS Alignment Toolbox.
Installation
This package is available on pypi as edist
. You can install
it via
pip install edist
If you wish to build this project from source, you need to first install cython and then execute the following commands in this directory:
python3 cython_setup.py build_ext inplace
cp *so edist/.
Quickstart Guide
There are multiple example cases illustrated in our demo notebooks. In particular:
sed_demo.ipynb
illustrates the Levenshtein distance (Levenshtein, 1965) and affine edit distance (Gotoh, 1982) as well as its backtracing,dtw_demo.ipynb
illustrates dynamic time warping (Vintsyuk, 1968) as well as its backtracing and speedup measures,ted_demo.ipynb
illustrates the tree edit distance (Zhang and Shasha, 1989) as well as its backtracing and support for edit functions, andbedl_demo.ipynb
illustrates embedding edit distance learning (Paaßen et al., 2018).
In general, applying this library works as follows. First, you select the
edit distance function that best fits for your data and your setting
(see below for an overview of all available functions). Let's say your
function is called distfun
. Then, you can compute the distance between two
lists/trees x
and y
via distfun(x, y)
. If you wish to compute the matrix
of all pairwise distances for an entire dataset of lists/trees X
, then you
can use the multiprocess
module as follows.
from edist.multiprocess import pairwise_distances_symmetric
D = pairwise_distances_symmetric(X, distfun)
If you wish to compute the matrix of all pairwise distances between one
dataset X
and another dataset Y
, you can use the following function.
from edist.multiprocess import pairwise_distances
D = pairwise_distances(X, Y, distfun)
If you wish to use a custom local distance function delta
, you can supply
it as additional argument to either distfun
itself, to
pairwise_distances_symmetric
, or to pairwise_distances
.
If you wish to compute the optimal alignment between two lists/trees x
and y
according to distfun
, you can use the function
distfun_backtrace(x, y)
. Note that, in case of multiple possible optimal
alignments, this function will always return the option that uses replacements
as early as possible. If you instead wish to sample a random optimal alignment,
you can use distfun_backtrace_stochastic(x, y)
. Unfortunately, it is
infeasible to enumerate the entire set of cooptimal alignments because this
set may be exponentially large. However, it is possible to characterize the
distribution of cooptimal alignments concisely by describing with which
probability each node in x
is paired with each node in y
. This probability
matrix is computed by the function distfun_backtrace_matrix(x, y)
and follows
the forwardbackward algorithm developed by Paaßen (2018).
List of Algorithms and Functions
The following edit distance algorithms and functions are contained in this library.
 The Levenshtein distance/sequence edit distance (Levenshtein, 1965):
edist.sed.standard_sed(x, y)
for edit distance computation between sequencesx
andy
with a cost of 1 for each replacement, deletion, and insertion.edist.sed.sed_string(x, y)
for the same, but specifically designed for strings and thus considerably faster (~factor 3).edist.sed.standard_sed_backtrace(x, y)
for backtracing for the standard edit distance.edist.sed.standard_sed_backtrace_stochastic(x, y)
for the same, but returning a random optimal alignment instead of a fixed one.edist.sed.standard_sed_backtrace_matrix(x, y)
for the same, but returning a probability distribution over all pairings between elements ofx
andy
.edist.sed.sed(x, y, delta)
for edit distance computation with a custom element distance functiondelta
.edist.sed.sed_backtrace(x, y, delta)
for backtracing for the edit distance with a custom element distance functiondelta
.edist.sed.sed_backtrace_stochastic(x, y, delta)
for the same, but returning a random optimal alignment instead of a fixed one.edist.sed.sed_backtrace_matrix(x, y, delta)
for the same, but returning a probability distribution over all pairings between elements ofx
andy
.
 The dynamic time warping distance (DTW; Vintsyuk, 1968):
edist.dtw.dtw_numeric(x, y)
for DTW computation between two time seriesx
andy
, each given as a double array.edist.dtw.dtw_manhattan(x, y)
for DTW computation between two time seriesx
andy
, each given as a double matrix. The distance between two frames is defined as the Manhattan distance.edist.dtw.dtw_euclidean(x, y)
for DTW computation between two time seriesx
andy
, each given as a double matrix. The distance between two frames is defined as the Euclidean distance.edist.dtw.dtw_string(x, y)
for DTW computation between two stringsx
andy
.edist.dtw.dtw(x, y, delta)
for DTW computation between two arbitrary sequencesx
andy
with a custom element distance functiondelta
.edist.dtw.dtw_backtrace(x, y, delta)
for backtracing for DTW with a custom element distance functiondelta
.edist.dtw.dtw_backtrace_stochastic(x, y, delta)
for the same, but returning a random optimal alignment instead of a fixed one.edist.dtw.dtw_backtrace_matrix(x, y, delta)
for the same, but returning a probability distribution over all pairings between elements ofx
andy
.
 The affine edit distance (Gotoh, 1982):
edist.aed.aed(x, y, rep, gap, skip)
for affine edit distance computation between two arbitrary sequencesx
andy
, where each frame replacement is scored with the functionrep
, each deletion and insertion is scored with the functiongap
, and each deletion and insertion extension is scored with the functionskip
.edist.aed.aed_backtrace(x, y, rep, gap, skip)
for backtracing for the affine edit distance with the replacement cost functionrep
, the deletion/insertion cost functiongap
, and the gap extension cost functionskip
.edist.aed.aed_backtrace_stochastic(x, y, delta)
for the same, but returning a random optimal alignment instead of a fixed one.edist.aed.aed_backtrace_matrix(x, y, delta)
for the same, but returning a probability distribution over all pairings between elements ofx
andy
.
 The tree edit distance (TED; Zhang and Shasha, 1989):
edist.ted.standard_ted(x_nodes, x_adj, y_nodes, y_adj)
for edit distance computation between the treesx
andy
, which are both given in a node list/adjacency list format. Both lists are supposed to be in depthfirstsearch order, e.g. a tree a(b, c) is supposed to be represented as the two lists['a', 'b', 'c']
and[[1, 2], [], []]
. The cost for replacements, deletions, and insertions is fixed to 1.edist.ted.standard_ted_backtrace(x_nodes, x_adj, y_nodes, y_adj)
for backtracing for the tree edit distance.edist.sed.standard_sed_backtrace_matrix(x_nodes, x_adj, y_nodes, y_adj)
for the same, but returning a probability distribution over all pairings between elements ofx
andy
.edist.ted.ted(x_nodes, x_adj, y_nodes, delta)
for tree edit distance computation with a custom node distance functiondelta
.edist.ted.ted_backtrace(x_nodes, x_adj, y_nodes, delta)
for backtracing for the tree edit distance with a custom element distance functiondelta
.edist.ted.ted_backtrace_matrix(x_nodes, x_adj, y_nodes, delta)
for the same, but returning a probability distribution over all pairings between elements ofx
andy
.
 The set edit distance (SetED; unpublished, but using the Hungarian algorithm
of Kuhn, 1955 at its core):
edist.seted.standard_seted(x, y)
for set edit distance computation between the setsx
andy
, which are both given as lists for convenience. The cost for replacements, deletions, and insertions is fixed to 1.edist.seted.standard_seted_backtrace(x, y)
for backtracing for the set edit distance.edist.seted.seted(x, y, delta)
for set edit distance computation with a custom element distance functiondelta
.edist.seted.seted_backtrace(x, y, delta)
for backtracing for the set edit distance with a custom element distance functiondelta
.
Additionally, this library contains a few helper modules, namely:
edist.adp
contains functions to compute arbitrary sequence edit distances that can be defined by a regular grammar. This is based on the framework of algebraic dynamic programming (ADP; Giegerich, Meyer, and Steffen, 2004), as applied by Paaßen, Mokbel, and Hammer (2016). In particular:edist.adp.edit_distance(x, y, grammar, deltas)
computes the sequence edit distance defined by the regular grammargrammar
and the cost functionsdeltas
between sequencesx
andy
,edist.adp.backtrace(x, y, grammar, deltas)
computes the backtracing for said edit distance,edist.adp.backtrace_stochastic(x, y, grammar, deltas)
does the same, but returns a random optimal alignment instead of a fixed one, andedist.adp.backtrace_matrix(x, y, grammar, deltas)
does the same, but returns a probability distribution over all pairings between elements ofx
andy
.
edist.alignment
models backtraces/alignments between sequences or trees. Instances of classedist.alignment.Alignment
are returned by every backtrace function (except forbacktrace_matrix
functions).edist.bedl
supports embedding edit distance learning (BEDL; Paaßen et al., 2018) to learn parameters for edit distance instead of learning them manually. Please refer to thebedl_demo
for more information.edist.edits
supports objects that model sequence edits, in particular replacements, deletions, and insertions, and provides the functionalignment_to_script(alignment, x, y)
, which transforms the alignmentalignment
between the sequencesx
andy
into a list of edits that transformx
intoy
.edist.tree_edits
supports objects that model tree edits, in particular replacements, deletions, and insertions, and provides the functionalignment_to_script(alignment, x_nodes, y_adj, y_nodes, y_adj)
, which transforms the alignmentalignment
between the treesx
andy
into a list of edits that transformx
intoy
.edist.tree_utils
is a collection of supporting functions for tree handling used by the library. Interesting for users of the library may be the following:edist.tree_utils.to_json
writes a tree to a JSON file.edist.tree_utils.from_json
reads a tree from a JSON file.edist.tree_utils.dataset_from_json
reads a list of trees from a directory containing JSON files.edist.tree_utils.tree_to_string
formats a tree as a string.
Background
The background for all algorithms covered in this library by far exceeds the scope of this Readme (we list material for further reading below). However, there are a few general points that are worth noting in short:
 Edit distance algorithms heavily rely on dynamic programming to be
efficient, i.e. to decompose the overall edit distance computation into
subtasks and to tabulate the results of these subtasks. In this way, we
can search an exponentially large space of possible alignments in polynomial
time. However, these decompositions rely on the critical assumption that
the element distance function $
\delta
$ is a metric, especially that this function is nonnegative, zero for selfdistances, and fulfills the triangular inequality. If any of these assumptions is broken, there may exist cheaper alignments that are not covered by the dynamic programming computation. So this is critical when generating your own $\delta
$ functions.  Interestingly, if $
\delta
$ is a metric, its metric properties translate to the overall edit distance (refer, e.g., to Theorem 3.2. in Paaßen, 2019). This can make handling edit distances quite appealing, mathematically. However, this does not hold for dynamic time warping, which always violates the triangular inequality.  Even though dynamic programming makes edit distances polynomial, computing
them can become prohibitively expensive for long sequences/large trees. In
particular, any sequence edit distance lies in $
\mathcal{O}(m \cdot n)
$, where $m
$ and $n
$ are the lengths of the input sequences, the tree edit distance lies in $\mathcal{O}(m^2 \cdot n^2)
$, and the set edit distance in $\mathcal{O}((m+n)^3)
$. Fortunately, the cython implementation provided in this library is relatively fast and thus can cope with $m, n
$ even up to a few thousand elements (at least for sequence edits). Still, it is key that you choose the edit distance function that is best fitting to your case. For example,edist.sed.sed_string
is about factor 15 faster compared to the more generaledist.sed.sed
.
For more background on the algorithms, we refer to the Wikipedia articles for the Levenshtein distance and dynamic time warping, to the paper of Gotoh (1982) with respect to the affine edit distance, to the review paper of Paaßen (2018) with respect to the tree edit distance and its backtracing, to Section 2.3.2 of the dissertation of Paaßen (2019) with respect to algebraic dynamic programming, and to Chapter 4 of the same dissertation with respect to embedding edit distance learning.
Licensing
This library is licensed under the GNU General Public License Version 3.
Dependencies
This library depends on NumPy for matrix operations, and on cython
for the effective Cinterface. Further, the bedl.py
module depends on
scikitlearn for the base interfaces and on SciPy for
optimization. Finally, the seted.pyx
module depends on SciPy for
an implementation of the Hungarian algorithm (Kuhn, 1955).
Literature
 Giegerich, R., Meyer, C., & Steffen, P. (2004). A discipline of dynamic programming over sequence data. Science of Computer Programming, 51(3), 215263. doi:10.1016/j.scico.2003.12.005
 Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162(3), 705708. doi:10.1016/00222836(82)903989
 Kuhn, H. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(12), 8397. doi:10.1002/nav.3800020109
 Levenshtein, V. (1965). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707710.
 Paaßen, B., Mokbel, B., & Hammer, B. (2015). A Toolbox for Adaptive Sequence Dissimilarity Measures for Intelligent Tutoring Systems. In O. C. Santos, J. G. Boticario, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J. M. Luna, et al. (Eds.), Proceedings of the 8th International Conference on Educational Data Mining (pp. 632632). International Educational Datamining Society. Link
 Paaßen, B., Mokbel, B., & Hammer, B. (2016). Adaptive structure metrics for automated feedback provision in intelligent tutoring systems. Neurocomputing, 192, 313. doi:10.1016/j.neucom.2015.12.108. Link
 Paaßen, B., Gallicchio, C., Micheli, A., & Hammer, B. (2018). Tree Edit Distance Learning via Adaptive Symbol Embeddings. Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 39733982. Link
 Paaßen, B. (2018). Revisiting the tree edit distance and its backtracing: A tutorial. arXiv:1805.06869.
 Paaßen, B. (2019). Metric Learning for Structured Data. Dissertation. Bielefeld University. doi:10.4119/unibi/2935545
 Vintsyuk, T.K. (1968). Speech discrimination by dynamic programming. Cybernetics, 4(1), 5257. doi:10.1007/BF01074755
 Zhang, K., & Shasha, D. (1989). Simple Fast Algorithms for the Editing Distance between Trees and Related Problems. SIAM Journal on Computing, 18(6), 12451262. doi:10.1137/0218082
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size edist1.1.0cp37cp37mmanylinux1_x86_64.whl (3.5 MB)  File type Wheel  Python version cp37  Upload date  Hashes View 
Filename, size edist1.1.0cp37cp37mmanylinux2010_x86_64.whl (3.5 MB)  File type Wheel  Python version cp37  Upload date  Hashes View 
Filename, size edist1.1.0cp37cp37mmanylinux2014_x86_64.whl (3.3 MB)  File type Wheel  Python version cp37  Upload date  Hashes View 
Filename, size edist1.1.0cp38cp38manylinux1_x86_64.whl (4.2 MB)  File type Wheel  Python version cp38  Upload date  Hashes View 
Filename, size edist1.1.0cp38cp38manylinux2010_x86_64.whl (4.2 MB)  File type Wheel  Python version cp38  Upload date  Hashes View 
Filename, size edist1.1.0cp38cp38manylinux2014_x86_64.whl (3.8 MB)  File type Wheel  Python version cp38  Upload date  Hashes View 
Filename, size edist1.1.0.tar.gz (898.6 kB)  File type Source  Python version None  Upload date  Hashes View 
Hashes for edist1.1.0cp37cp37mmanylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  95641c6f963f466f07a54ec60fa2897c0eb2e732876af222abc80d6f77c7b031 

MD5  ecd4f9be1aaaa9a498fd5336501aeb29 

BLAKE2256  f5e92fdd4691fa32e0ee20afa0fabadc63341088803e496516e336dea59a7807 
Hashes for edist1.1.0cp37cp37mmanylinux2010_x86_64.whl
Algorithm  Hash digest  

SHA256  57aa4c76fb989dac94c6b1298a1b1552244b80cb12b867d459fd1bb948cf2646 

MD5  148281b6055de91b00480f7942d7c895 

BLAKE2256  af17fbe4b9741c00dc184dda4c898fe57617810664e432a4b435f930da1ed42e 
Hashes for edist1.1.0cp37cp37mmanylinux2014_x86_64.whl
Algorithm  Hash digest  

SHA256  3d83e0ec5004e5c4a952527df2c3300d976a18ff31250f3917b9cd97f0742238 

MD5  a6bf7f36a21a05604521449f2a35737d 

BLAKE2256  c7e97d0f82e2e9b3714b27a336bcbf037d22b0ef0ec625eb552ed7e796421390 
Hashes for edist1.1.0cp38cp38manylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  da37ef0fff4855c9c4178a700907d059773a67e99b50b58f45834a26b5d0fd2c 

MD5  b1a4bba6563f7196dfd5c7b66980cda4 

BLAKE2256  d76e014f6fd682ffc8ff362ae14d83ce58c5dd71245db0f27645fa24967493f4 
Hashes for edist1.1.0cp38cp38manylinux2010_x86_64.whl
Algorithm  Hash digest  

SHA256  5954f781475ef97cbc30bbf172e56b1eab580a146277601bdc99bdfe2785f500 

MD5  dfe6b8473c743f4ed2c24c42c695f859 

BLAKE2256  b6f7064f60cbfa80cd074ab278aa2c2ad139e148a827a8edd1828ae89b43d505 
Hashes for edist1.1.0cp38cp38manylinux2014_x86_64.whl
Algorithm  Hash digest  

SHA256  afea331acf5d7932da3625ce8799b766fd3f9798ce5aea43b94e9d3fa3aa6bab 

MD5  9e4900273346b9ee647190c6b8aa81bd 

BLAKE2256  83605a3d7345e392724254d678c1cc8ee03f1e208f8f4ed05a80c50dc9273f5c 