Skip to main content

Lightweight, super fast library for sequence alignment using edit (Levenshtein) distance.

Project description

Lightweight, super fast library for sequence alignment using edit (Levenshtein) distance.

edlib.align("hello", "world")

Edlib is actually a C/C++ library, and this package is it’s wrapper for Python. Python Edlib has mostly the same API as C/C++ Edlib, so make sure to check out C/C++ Edlib docs for more code examples, details on API and how Edlib works.

Features

  • Calculates edit distance.

  • It can find optimal alignment path (instructions how to transform first sequence into the second sequence).

  • It can find just the start and/or end locations of alignment path - can be useful when speed is more important than having exact alignment path.

  • Supports multiple alignment methods: global(NW), prefix(SHW) and infix(HW), each of them useful for different scenarios.

  • You can extend character equality definition, enabling you to e.g. have wildcard characters, to have case insensitive alignment or to work with degenerate nucleotides.

  • It can easily handle small or very large sequences, even when finding alignment path.

  • Super fast thanks to Myers’s bit-vector algorithm.

Installation

pip install edlib

API

Edlib has only one function:

align(query, target, [mode], [task], [k])

To learn more about it, type help(edlib.align) in your python interpreter.

Usage

import edlib

result = edlib.align("elephant", "telephone")
print(result["editDistance"])  # 3
print(result["alphabetLength"])  # 8
print(result["locations"])  # [(None, 8)]
print(result["cigar"])  # None

result = edlib.align("ACTG", "CACTRT", mode="HW", task="path", additionalEqualities=[("R", "A"), ("R", "G")])
print(result["editDistance"])  # 0
print(result["alphabetLength"])  # 5
print(result["locations"])  # [(1, 4)]
print(result["cigar"])  # "4="

Benchmark

I run a simple benchmark on 7 Feb 2017 (using timeit, on Python3) to get a feeling of how Edlib compares to other Python libraries: editdistance and python-Levenshtein.

As input data I used pairs of DNA sequences of different lengths, where each pair has about 90% similarity.

#1: query length: 30, target length: 30
edlib.align(query, target): 1.88µs
editdistance.eval(query, target): 1.26µs
Levenshtein.distance(query, target): 0.43µs

#2: query length: 100, target length: 100
edlib.align(query, target): 3.64µs
editdistance.eval(query, target): 3.86µs
Levenshtein.distance(query, target): 14.1µs

#3: query length: 1000, target length: 1000
edlib.align(query, target): 0.047ms
editdistance.eval(query, target): 5.4ms
Levenshtein.distance(query, target): 1.9ms

#4: query length: 10000, target length: 10000
edlib.align(query, target): 0.0021s
editdistance.eval(query, target): 0.56s
Levenshtein.distance(query, target): 0.2s

#5: query length: 50000, target length: 50000
edlib.align(query, target): 0.031s
editdistance.eval(query, target): 13.8s
Levenshtein.distance(query, target): 5.0s

More

Check out C/C++ Edlib docs for more information about Edlib!

Development

Run make build to generate an extension module as .so file. You can test it then by importing it from python interpreter import edlib and running edlib.align(...) (you have to be positioned in the directory where .so was built). This is useful for testing while developing.

Run make sdist to create a source distribution, but not publish it - it is a tarball in dist/ that will be uploaded to pip on publish. Use this to check that tarball is well structured and contains all needed files, before you publish. Good way to test it is to run sudo pip install dist/edlib-*.tar.gz, which will try to install edlib from it, same way as pip will do it when it is published.

Run make publish to create a source distribution and publish it to the PyPI. Use this to publish new version of package. Make sure to bump the version in setup.py before publishing, if needed.

make clean removes all generated files.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edlib-1.2.4.tar.gz (54.8 kB view details)

Uploaded Source

File details

Details for the file edlib-1.2.4.tar.gz.

File metadata

  • Download URL: edlib-1.2.4.tar.gz
  • Upload date:
  • Size: 54.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for edlib-1.2.4.tar.gz
Algorithm Hash digest
SHA256 135033711591328c2616650cfe9efc8b2ea1b052c01cc59bfa0d414e65a36bbf
MD5 134d269966d3c205c1146d68f153d282
BLAKE2b-256 33040dd73c02c0a5148754a6144df9a7b5dd55ccdbb1b17481bbc89b0193c544

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page