Skip to main content

Edit distance implementations in cython

Project description

Python Edit Distances

Copyright (C) 2019 - Benjamin Paassen
Machine Learning Research Group
Center of Excellence Cognitive Interaction Technology (CITEC)
Bielefeld University

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, see http://www.gnu.org/licenses/.

Introduction

This library contains several edit distance and alignment algorithms for sequences and trees of arbitrary node type. Additionally, this library contains multiple backtracing mechanisms for every algorithm in order to facilitate more detailed interpretation and subsequent processing. Finally, this library provides a reference implementation for embedding edit distance learning (BEDL; Paaßen et al., 2018) which enables users to learn edit distance parameters instead of specifying them manually.

If you use this library in academic work, please cite:

  • Paaßen, B., Mokbel, B., & Hammer, B. (2015). A Toolbox for Adaptive Sequence Dissimilarity Measures for Intelligent Tutoring Systems. In O. C. Santos, J. G. Boticario, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J. M. Luna, et al. (Eds.), Proceedings of the 8th International Conference on Educational Data Mining (pp. 632-632). International Educational Datamining Society. (Link)

This library is historically based on its Java version, the TCS Alignment Toolbox.

Installation

This package is available on pypi as edist. You can install it via

pip install edist

If you wish to build this project from source, you need to first install cython and then execute the following commands in this directory:

python3 cython_setup.py build_ext --inplace
cp *so edist/.

Quickstart Guide

There are multiple example cases illustrated in our demo notebooks. In particular:

  • sed_demo.ipynb illustrates the Levenshtein distance (Levenshtein, 1965) and affine edit distance (Gotoh, 1982) as well as its backtracing,
  • dtw_demo.ipynb illustrates dynamic time warping (Vintsyuk, 1968) as well as its backtracing and speedup measures,
  • ted_demo.ipynb illustrates the tree edit distance (Zhang and Shasha, 1989) as well as its backtracing and support for edit functions, and
  • bedl_demo.ipynb illustrates embedding edit distance learning (Paaßen et al., 2018).

In general, applying this library works as follows. First, you select the function that best fits for your data and your setting (see below for an overview of all available functions). Let's say your function is called distfun. Then, you can compute the distance between two lists/trees x and y via distfun(x, y). If you wish to compute the matrix of all pairwise distances for an entire dataset of lists/trees X, then you can use the multiprocess module as follows.

from edist.multiprocess import pairwise_distances_symmetric
D = pairwise_distances_symmetric(X, distfun)

If you wish to compute the matrix of all pairwise distances between one dataset X and another dataset Y, you can use the following function.

from edist.multiprocess import pairwise_distances
D = pairwise_distances(X, Y, distfun)

If you wish to use a custom local distance function delta, you can supply it as additional argument to all these functions.

If you wish to compute the optimal alignment between two lists/trees x and y according to distfun, you can use the function distfun_backtrace(x, y). Note that, in case of multiple possible optimal alignments, this function will always return the option that uses replacements as early as possible. If you instead wish to sample a random optimal alignment, you can use distfun_backtrace_stochastic(x, y). Unfortunately, it is infeasible to enumerate the entire set of co-optimal alignments because this set may be exponentially large. However, it is possible to characterize the distribution of co-optimal alignments concisely by describing with which probability each node in x is paired with each node in y. This probability matrix is computed by the function distfun_backtrace_matrix(x, y) and follows the forward-backward algorithm developed by Paaßen (2018).

List of Algorithms and Functions

The following edit distance algorithms and functions are contained in this library.

  • The Levenshtein distance/sequence edit distance (Levenshtein, 1965):
    • edist.sed.standard_sed(x, y) for edit distance computation between sequences x and y with a cost of 1 for each replacement, deletion, and insertion.
    • edist.sed.sed_string(x, y) for the same, but specifically designed for strings and thus considerably faster (~factor 3).
    • edist.sed.standard_sed_backtrace(x, y) for backtracing for the standard edit distance.
    • edist.sed.standard_sed_backtrace_stochastic(x, y) for the same, but returning a random optimal alignment instead of a fixed one.
    • edist.sed.standard_sed_backtrace_matrix(x, y) for the same, but returning a probability distribution over all pairings between elements of x and y.
    • edist.sed.sed(x, y, delta) for edit distance computation with a custom element distance function delta.
    • edist.sed.sed_backtrace(x, y, delta) for backtracing for the edit distance with a custom element distance function delta.
    • edist.sed.sed_backtrace_stochastic(x, y, delta) for the same, but returning a random optimal alignment instead of a fixed one.
    • edist.sed.sed_backtrace_matrix(x, y, delta) for the same, but returning a probability distribution over all pairings between elements of x and y.
  • The dynamic time warping distance (DTW; Vintsyuk, 1968):
    • edist.dtw.dtw_numeric(x, y) for DTW computation between two time series x and y, each given as a double array.
    • edist.dtw.dtw_manhattan(x, y) for DTW computation between two time series x and y, each given as a double matrix. The distance between two frames is defined as the Manhattan distance.
    • edist.dtw.dtw_euclidean(x, y) for DTW computation between two time series x and y, each given as a double matrix. The distance between two frames is defined as the Euclidean distance.
    • edist.dtw.dtw_string(x, y) for DTW computation between two strings x and y.
    • edist.dtw.dtw(x, y, delta) for DTW computation between two arbitrary sequences x and y with a custom element distance function delta.
    • edist.dtw.dtw_backtrace(x, y, delta) for backtracing for DTW with a custom element distance function delta.
    • edist.dtw.dtw_backtrace_stochastic(x, y, delta) for the same, but returning a random optimal alignment instead of a fixed one.
    • edist.dtw.dtw_backtrace_matrix(x, y, delta) for the same, but returning a probability distribution over all pairings between elements of x and y.
  • The affine edit distance (Gotoh, 1982):
    • edist.aed.aed(x, y, rep, gap, skip) for affine edit distance computation between two arbitrary sequences x and y, where each frame replacement is scored with the function rep, each deletion and insertion is scored with the function gap, and each deletion and insertion extension is scored with the function skip.
    • edist.aed.aed_backtrace(x, y, rep, gap, skip) for backtracing for the affine edit distance with the replacement cost function rep, the deletion/insertion cost function gap, and the gap extension cost function skip.
    • edist.aed.aed_backtrace_stochastic(x, y, delta) for the same, but returning a random optimal alignment instead of a fixed one.
    • edist.aed.aed_backtrace_matrix(x, y, delta) for the same, but returning a probability distribution over all pairings between elements of x and y.
  • The tree edit distance (TED; Zhang and Shasha, 1989):
    • edist.ted.standard_ted(x_nodes, x_adj, y_nodes, y_adj) for edit distance computation between the trees x and y, which are both given in a node list/adjacency list format. Both lists are supposed to be in depth-first-search order, e.g. a tree a(b, c) is supposed to be represented as the two lists ['a', 'b', 'c'] and [[1, 2], [], []]. The cost for replacements, deletions, and insertions is fixed to 1.
    • edist.ted.standard_ted_backtrace(x_nodes, x_adj, y_nodes, y_adj) for backtracing for the tree edit distance.
    • edist.sed.standard_sed_backtrace_matrix(x_nodes, x_adj, y_nodes, y_adj) for the same, but returning a probability distribution over all pairings between elements of x and y.
    • edist.ted.ted(x_nodes, x_adj, y_nodes, delta) for tree edit distance computation with a custom node distance function delta.
    • edist.ted.ted_backtrace(x_nodes, x_adj, y_nodes, delta) for backtracing for the tree edit distance with a custom element distance function delta.
    • edist.ted.ted_backtrace_matrix(x_nodes, x_adj, y_nodes, delta) for the same, but returning a probability distribution over all pairings between elements of x and y.

Additionally, this library contains a few helper modules, namely:

  • edist.adp contains functions to compute arbitrary sequence edit distances that can be defined by a regular grammar. This is based on the framework of algebraic dynamic programming (ADP; Giegerich, Meyer, and Steffen, 2004), as applied by Paaßen, Mokbel, and Hammer (2016). In particular:
    • edist.adp.edit_distance(x, y, grammar, deltas) computes the sequence edit distance defined by the regular grammar grammar and the cost functions deltas between sequences x and y,
    • edist.adp.backtrace(x, y, grammar, deltas) computes the backtracing for said edit distance,
    • edist.adp.backtrace_stochastic(x, y, grammar, deltas) does the same, but returns a random optimal alignment instead of a fixed one, and
    • edist.adp.backtrace_matrix(x, y, grammar, deltas) does the same, but returns a probability distribution over all pairings between elements of x and y.
  • edist.alignment models backtraces/alignments between sequences or trees. Instances of class edist.alignment.Alignment are returned by every backtrace function (except for backtrace_matrix functions).
  • edist.bedl supports embedding edit distance learning (BEDL; Paaßen et al., 2018) to learn parameters for edit distance instead of learning them manually. Please refer to the bedl_demo for more information.
  • edist.edits supports objects that model sequence edits, in particular replacements, deletions, and insertions, and provides the function alignment_to_script(alignment, x, y), which transforms the alignment alignment between the sequences x and y into a list of edits that transform x into y.
  • edist.tree_edits supports objects that model tree edits, in particular replacements, deletions, and insertions, and provides the function alignment_to_script(alignment, x_nodes, y_adj, y_nodes, y_adj), which transforms the alignment alignment between the trees x and y into a list of edits that transform x into y.
  • edist.tree_utils is a collection of supporting functions for tree handling used by the library. Interesting for users of the library may be the following:
    • edist.tree_utils.to_json writes a tree to a JSON file.
    • edist.tree_utils.from_json reads a tree from a JSON file.
    • edist.tree_utils.dataset_from_json reads a list of trees from a directory containing JSON files.
    • edist.tree_utils.tree_to_string formats a tree as a string.

Background

The background for all algorithms covered in this library by far exceeds the scope of this Readme (we list material for further reading below). However, there are a few general points that are worth noting in short:

  • Edit distance algorithms heavily rely on dynamic programming to be efficient, i.e. to decompose the overall edit distance computation into subtasks and to tabulate the results of these subtasks. In this way, we can search an exponentially large space of possible alignments in polynomial time. However, these decompositions rely on the critical assumption that the element distance function $\delta$ is a metric, especially that this function is non-negative, zero for self-distances, and fulfills the triangular inequality. If any of these assumptions is broken, there may exist cheaper alignments that are not covered by the dynamic programming computation. So this is critical when generating your own $\delta$ functions.
  • Interestingly, if $\delta$ is a metric, its metric properties translate to the overall edit distance (refer, e.g., to Theorem 3.2. in Paaßen, 2019). This can make handling edit distances quite appealing, mathematically. However, this does not hold for dynamic time warping, which always violates the triangular inequality.
  • Even though dynamic programming makes edit distances polynomial, computing them can become prohibitively expensive for long sequences/large trees. In particular, any sequence edit distance lies in $\mathcal{O}(m \cdot n)$, where $m$ and $n$ are the lengths of the input sequences, and the tree edit distance lies in $\mathcal{O}(m^2 \cdot n^2)$. Fortunately, the cython implementation provided in this library is relatively fast and thus can cope with $m, n$ even up to a few thousand elements. Still, it is key that you choose the edit distance function that is best fitting to your case. For example, edist.sed.sed_string is about factor 15 faster compared to the more general edist.sed.sed.

For more background on the algorithms, we refer to the Wikipedia articles for the Levenshtein distance and dynamic time warping, to the paper of Gotoh (1982) with respect to the affine edit distance, to the review paper of Paaßen (2018) with respect to the tree edit distance and its backtracing, to Section 2.3.2 of the dissertation of Paaßen (2019) with respect to algebraic dynamic programming, and to Chapter 4 of the same dissertation with respect to embedding edit distance learning.

Licensing

This library is licensed under the GNU General Public License Version 3.

Dependencies

This library depends on NumPy for matrix operations, and on cython for the effective C-interface. Further, the bedl.py module depends on scikit-learn for the base interfaces and on SciPy for optimization.

Literature

  • Giegerich, R., Meyer, C., & Steffen, P. (2004). A discipline of dynamic programming over sequence data. Science of Computer Programming, 51(3), 215-263. doi:10.1016/j.scico.2003.12.005
  • Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162(3), 705-708. doi:10.1016/0022-2836(82)90398-9
  • Paaßen, B., Mokbel, B., & Hammer, B. (2015). A Toolbox for Adaptive Sequence Dissimilarity Measures for Intelligent Tutoring Systems. In O. C. Santos, J. G. Boticario, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J. M. Luna, et al. (Eds.), Proceedings of the 8th International Conference on Educational Data Mining (pp. 632-632). International Educational Datamining Society. Link
  • Paaßen, B., Mokbel, B., & Hammer, B. (2016). Adaptive structure metrics for automated feedback provision in intelligent tutoring systems. Neurocomputing, 192, 3-13. doi:10.1016/j.neucom.2015.12.108. Link
  • Paaßen, B., Gallicchio, C., Micheli, A., & Hammer, B. (2018). Tree Edit Distance Learning via Adaptive Symbol Embeddings. Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 3973-3982. Link
  • Paaßen, B. (2018). Revisiting the tree edit distance and its backtracing: A tutorial. arXiv:1805.06869.
  • Paaßen, B. (2019). Metric Learning for Structured Data. Dissertation. Bielefeld University. doi:10.4119/unibi/2935545
  • Vintsyuk, T.K. (1968). Speech discrimination by dynamic programming. Cybernetics, 4(1), 52-57. doi:10.1007/BF01074755
  • Zhang, K., & Shasha, D. (1989). Simple Fast Algorithms for the Editing Distance between Trees and Related Problems. SIAM Journal on Computing, 18(6), 1245-1262. doi:10.1137/0218082

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for edist, version 1.0.0
Filename, size File type Python version Upload date Hashes
Filename, size edist-1.0.0-cp37-cp37m-manylinux1_x86_64.whl (3.1 MB) File type Wheel Python version cp37 Upload date Hashes View hashes
Filename, size edist-1.0.0-cp37-cp37m-manylinux2010_x86_64.whl (3.1 MB) File type Wheel Python version cp37 Upload date Hashes View hashes
Filename, size edist-1.0.0.tar.gz (757.6 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page