Edit distance implementations in cython
Project description
Python Edit Distances
Copyright (C) 2019  Benjamin Paassen
Machine Learning Research Group
Center of Excellence Cognitive Interaction Technology (CITEC)
Bielefeld University
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, see http://www.gnu.org/licenses/.
Introduction
This library contains several edit distance and alignment algorithms for sequences and trees of arbitrary node type. Additionally, this library contains multiple backtracing mechanisms for every algorithm in order to facilitate more detailed interpretation and subsequent processing. Finally, this library provides a reference implementation for embedding edit distance learning (BEDL; Paaßen et al., 2018) which enables users to learn edit distance parameters instead of specifying them manually.
If you use this library in academic work, please cite:
 Paaßen, B., Mokbel, B., & Hammer, B. (2015). A Toolbox for Adaptive Sequence Dissimilarity Measures for Intelligent Tutoring Systems. In O. C. Santos, J. G. Boticario, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J. M. Luna, et al. (Eds.), Proceedings of the 8th International Conference on Educational Data Mining (pp. 632632). International Educational Datamining Society. (Link)
This library is historically based on its Java version, the TCS Alignment Toolbox.
Installation
This package is available on pypi as edist
. You can install
it via
pip install edist
If you wish to build this project from source, you need to first install cython and then execute the following commands in this directory:
python3 cython_setup.py build_ext inplace
cp *so edist/.
Quickstart Guide
There are multiple example cases illustrated in our demo notebooks. In particular:
sed_demo.ipynb
illustrates the Levenshtein distance (Levenshtein, 1965) and affine edit distance (Gotoh, 1982) as well as its backtracing,dtw_demo.ipynb
illustrates dynamic time warping (Vintsyuk, 1968) as well as its backtracing and speedup measures,ted_demo.ipynb
illustrates the tree edit distance (Zhang and Shasha, 1989) as well as its backtracing and support for edit functions, andbedl_demo.ipynb
illustrates embedding edit distance learning (Paaßen et al., 2018).
In general, applying this library works as follows. First, you select the
function that best fits for your data and your setting (see below for an
overview of all available functions). Let's say your function is called
distfun
. Then, you can compute the distance between two lists/trees
x
and y
via distfun(x, y)
. If you wish to compute the matrix of all
pairwise distances for an entire dataset of lists/trees X
, then you can
use the multiprocess
module as follows.
from edist.multiprocess import pairwise_distances_symmetric
D = pairwise_distances_symmetric(X, distfun)
If you wish to compute the matrix of all pairwise distances between one
dataset X
and another dataset Y
, you can use the following function.
from edist.multiprocess import pairwise_distances
D = pairwise_distances(X, Y, distfun)
If you wish to use a custom local distance function delta
, you can supply
it as additional argument to all these functions.
If you wish to compute the optimal alignment between two lists/trees x
and y
according to distfun
, you can use the function
distfun_backtrace(x, y)
. Note that, in case of multiple possible optimal
alignments, this function will always return the option that uses replacements
as early as possible. If you instead wish to sample a random optimal alignment,
you can use distfun_backtrace_stochastic(x, y)
. Unfortunately, it is
infeasible to enumerate the entire set of cooptimal alignments because this
set may be exponentially large. However, it is possible to characterize the
distribution of cooptimal alignments concisely by describing with which
probability each node in x
is paired with each node in y
. This probability
matrix is computed by the function distfun_backtrace_matrix(x, y)
and follows
the forwardbackward algorithm developed by Paaßen (2018).
List of Algorithms and Functions
The following edit distance algorithms and functions are contained in this library.
 The Levenshtein distance/sequence edit distance (Levenshtein, 1965):
edist.sed.standard_sed(x, y)
for edit distance computation between sequencesx
andy
with a cost of 1 for each replacement, deletion, and insertion.edist.sed.sed_string(x, y)
for the same, but specifically designed for strings and thus considerably faster (~factor 3).edist.sed.standard_sed_backtrace(x, y)
for backtracing for the standard edit distance.edist.sed.standard_sed_backtrace_stochastic(x, y)
for the same, but returning a random optimal alignment instead of a fixed one.edist.sed.standard_sed_backtrace_matrix(x, y)
for the same, but returning a probability distribution over all pairings between elements ofx
andy
.edist.sed.sed(x, y, delta)
for edit distance computation with a custom element distance functiondelta
.edist.sed.sed_backtrace(x, y, delta)
for backtracing for the edit distance with a custom element distance functiondelta
.edist.sed.sed_backtrace_stochastic(x, y, delta)
for the same, but returning a random optimal alignment instead of a fixed one.edist.sed.sed_backtrace_matrix(x, y, delta)
for the same, but returning a probability distribution over all pairings between elements ofx
andy
.
 The dynamic time warping distance (DTW; Vintsyuk, 1968):
edist.dtw.dtw_numeric(x, y)
for DTW computation between two time seriesx
andy
, each given as a double array.edist.dtw.dtw_manhattan(x, y)
for DTW computation between two time seriesx
andy
, each given as a double matrix. The distance between two frames is defined as the Manhattan distance.edist.dtw.dtw_euclidean(x, y)
for DTW computation between two time seriesx
andy
, each given as a double matrix. The distance between two frames is defined as the Euclidean distance.edist.dtw.dtw_string(x, y)
for DTW computation between two stringsx
andy
.edist.dtw.dtw(x, y, delta)
for DTW computation between two arbitrary sequencesx
andy
with a custom element distance functiondelta
.edist.dtw.dtw_backtrace(x, y, delta)
for backtracing for DTW with a custom element distance functiondelta
.edist.dtw.dtw_backtrace_stochastic(x, y, delta)
for the same, but returning a random optimal alignment instead of a fixed one.edist.dtw.dtw_backtrace_matrix(x, y, delta)
for the same, but returning a probability distribution over all pairings between elements ofx
andy
.
 The affine edit distance (Gotoh, 1982):
edist.aed.aed(x, y, rep, gap, skip)
for affine edit distance computation between two arbitrary sequencesx
andy
, where each frame replacement is scored with the functionrep
, each deletion and insertion is scored with the functiongap
, and each deletion and insertion extension is scored with the functionskip
.edist.aed.aed_backtrace(x, y, rep, gap, skip)
for backtracing for the affine edit distance with the replacement cost functionrep
, the deletion/insertion cost functiongap
, and the gap extension cost functionskip
.edist.aed.aed_backtrace_stochastic(x, y, delta)
for the same, but returning a random optimal alignment instead of a fixed one.edist.aed.aed_backtrace_matrix(x, y, delta)
for the same, but returning a probability distribution over all pairings between elements ofx
andy
.
 The tree edit distance (TED; Zhang and Shasha, 1989):
edist.ted.standard_ted(x_nodes, x_adj, y_nodes, y_adj)
for edit distance computation between the treesx
andy
, which are both given in a node list/adjacency list format. Both lists are supposed to be in depthfirstsearch order, e.g. a tree a(b, c) is supposed to be represented as the two lists['a', 'b', 'c']
and[[1, 2], [], []]
. The cost for replacements, deletions, and insertions is fixed to 1.edist.ted.standard_ted_backtrace(x_nodes, x_adj, y_nodes, y_adj)
for backtracing for the tree edit distance.edist.sed.standard_sed_backtrace_matrix(x_nodes, x_adj, y_nodes, y_adj)
for the same, but returning a probability distribution over all pairings between elements ofx
andy
.edist.ted.ted(x_nodes, x_adj, y_nodes, delta)
for tree edit distance computation with a custom node distance functiondelta
.edist.ted.ted_backtrace(x_nodes, x_adj, y_nodes, delta)
for backtracing for the tree edit distance with a custom element distance functiondelta
.edist.ted.ted_backtrace_matrix(x_nodes, x_adj, y_nodes, delta)
for the same, but returning a probability distribution over all pairings between elements ofx
andy
.
Additionally, this library contains a few helper modules, namely:
edist.adp
contains functions to compute arbitrary sequence edit distances that can be defined by a regular grammar. This is based on the framework of algebraic dynamic programming (ADP; Giegerich, Meyer, and Steffen, 2004), as applied by Paaßen, Mokbel, and Hammer (2016). In particular:edist.adp.edit_distance(x, y, grammar, deltas)
computes the sequence edit distance defined by the regular grammargrammar
and the cost functionsdeltas
between sequencesx
andy
,edist.adp.backtrace(x, y, grammar, deltas)
computes the backtracing for said edit distance,edist.adp.backtrace_stochastic(x, y, grammar, deltas)
does the same, but returns a random optimal alignment instead of a fixed one, andedist.adp.backtrace_matrix(x, y, grammar, deltas)
does the same, but returns a probability distribution over all pairings between elements ofx
andy
.
edist.alignment
models backtraces/alignments between sequences or trees. Instances of classedist.alignment.Alignment
are returned by every backtrace function (except forbacktrace_matrix
functions).edist.bedl
supports embedding edit distance learning (BEDL; Paaßen et al., 2018) to learn parameters for edit distance instead of learning them manually. Please refer to thebedl_demo
for more information.edist.edits
supports objects that model sequence edits, in particular replacements, deletions, and insertions, and provides the functionalignment_to_script(alignment, x, y)
, which transforms the alignmentalignment
between the sequencesx
andy
into a list of edits that transformx
intoy
.edist.tree_edits
supports objects that model tree edits, in particular replacements, deletions, and insertions, and provides the functionalignment_to_script(alignment, x_nodes, y_adj, y_nodes, y_adj)
, which transforms the alignmentalignment
between the treesx
andy
into a list of edits that transformx
intoy
.edist.tree_utils
is a collection of supporting functions for tree handling used by the library. Interesting for users of the library may be the following:edist.tree_utils.to_json
writes a tree to a JSON file.edist.tree_utils.from_json
reads a tree from a JSON file.edist.tree_utils.dataset_from_json
reads a list of trees from a directory containing JSON files.edist.tree_utils.tree_to_string
formats a tree as a string.
Background
The background for all algorithms covered in this library by far exceeds the scope of this Readme (we list material for further reading below). However, there are a few general points that are worth noting in short:
 Edit distance algorithms heavily rely on dynamic programming to be
efficient, i.e. to decompose the overall edit distance computation into
subtasks and to tabulate the results of these subtasks. In this way, we
can search an exponentially large space of possible alignments in polynomial
time. However, these decompositions rely on the critical assumption that
the element distance function $
\delta
$ is a metric, especially that this function is nonnegative, zero for selfdistances, and fulfills the triangular inequality. If any of these assumptions is broken, there may exist cheaper alignments that are not covered by the dynamic programming computation. So this is critical when generating your own $\delta
$ functions.  Interestingly, if $
\delta
$ is a metric, its metric properties translate to the overall edit distance (refer, e.g., to Theorem 3.2. in Paaßen, 2019). This can make handling edit distances quite appealing, mathematically. However, this does not hold for dynamic time warping, which always violates the triangular inequality.  Even though dynamic programming makes edit distances polynomial, computing
them can become prohibitively expensive for long sequences/large trees. In
particular, any sequence edit distance lies in $
\mathcal{O}(m \cdot n)
$, where $m
$ and $n
$ are the lengths of the input sequences, and the tree edit distance lies in $\mathcal{O}(m^2 \cdot n^2)
$. Fortunately, the cython implementation provided in this library is relatively fast and thus can cope with $m, n
$ even up to a few thousand elements. Still, it is key that you choose the edit distance function that is best fitting to your case. For example,edist.sed.sed_string
is about factor 15 faster compared to the more generaledist.sed.sed
.
For more background on the algorithms, we refer to the Wikipedia articles for the Levenshtein distance and dynamic time warping, to the paper of Gotoh (1982) with respect to the affine edit distance, to the review paper of Paaßen (2018) with respect to the tree edit distance and its backtracing, to Section 2.3.2 of the dissertation of Paaßen (2019) with respect to algebraic dynamic programming, and to Chapter 4 of the same dissertation with respect to embedding edit distance learning.
Licensing
This library is licensed under the GNU General Public License Version 3.
Dependencies
This library depends on NumPy for matrix operations, and on cython
for the effective Cinterface. Further, the bedl.py
module depends on
scikitlearn for the base interfaces and on SciPy for
optimization.
Literature
 Giegerich, R., Meyer, C., & Steffen, P. (2004). A discipline of dynamic programming over sequence data. Science of Computer Programming, 51(3), 215263. doi:10.1016/j.scico.2003.12.005
 Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162(3), 705708. doi:10.1016/00222836(82)903989
 Paaßen, B., Mokbel, B., & Hammer, B. (2015). A Toolbox for Adaptive Sequence Dissimilarity Measures for Intelligent Tutoring Systems. In O. C. Santos, J. G. Boticario, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J. M. Luna, et al. (Eds.), Proceedings of the 8th International Conference on Educational Data Mining (pp. 632632). International Educational Datamining Society. Link
 Paaßen, B., Mokbel, B., & Hammer, B. (2016). Adaptive structure metrics for automated feedback provision in intelligent tutoring systems. Neurocomputing, 192, 313. doi:10.1016/j.neucom.2015.12.108. Link
 Paaßen, B., Gallicchio, C., Micheli, A., & Hammer, B. (2018). Tree Edit Distance Learning via Adaptive Symbol Embeddings. Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 39733982. Link
 Paaßen, B. (2018). Revisiting the tree edit distance and its backtracing: A tutorial. arXiv:1805.06869.
 Paaßen, B. (2019). Metric Learning for Structured Data. Dissertation. Bielefeld University. doi:10.4119/unibi/2935545
 Vintsyuk, T.K. (1968). Speech discrimination by dynamic programming. Cybernetics, 4(1), 5257. doi:10.1007/BF01074755
 Zhang, K., & Shasha, D. (1989). Simple Fast Algorithms for the Editing Distance between Trees and Related Problems. SIAM Journal on Computing, 18(6), 12451262. doi:10.1137/0218082
Project details
Release history Release notifications
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size edist1.0.0cp37cp37mmanylinux1_x86_64.whl (3.1 MB)  File type Wheel  Python version cp37  Upload date  Hashes View hashes 
Filename, size edist1.0.0cp37cp37mmanylinux2010_x86_64.whl (3.1 MB)  File type Wheel  Python version cp37  Upload date  Hashes View hashes 
Filename, size edist1.0.0.tar.gz (757.6 kB)  File type Source  Python version None  Upload date  Hashes View hashes 
Hashes for edist1.0.0cp37cp37mmanylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  4218e2cbe9a6ef8928090eaaf3a1da412b07571bec86f512b995a68ec17734c4 

MD5  b53b80bc9eab5f63b79dc17de441cdbd 

BLAKE2256  12d6bf5a32b3df4e1eda46119e5536bf4888f2c18e438216b584a8b3ead6b6ae 
Hashes for edist1.0.0cp37cp37mmanylinux2010_x86_64.whl
Algorithm  Hash digest  

SHA256  ed339e6b0857ff8cdd896e72c5c84c3ab9f30416eef7ad61765572d79345560d 

MD5  f82346df633d65fe565338fe10c432ab 

BLAKE2256  f674e88ddac93dcaa7a0808b55c249e102d79dec469d2cf0c18036f29a9d7107 