Skip to main content

ScienceBeam Alignment

Project description

ScienceBeam Alignment

License: MIT

ScienceBeam Alignment provides generic low-level sequence alignment utility functions, similar to Python's SequenceMatcher.

This project is currently mainly used for training data generation, related to the ScienceBeam project. Although this project itself has no ScienceBeam dependency and can be considered a standalone sequence alignment library. It is however more targeted at document size sequences rather than massive gene sequences.

Pre-requisites

  • Python 2 or 3

API

SequenceMatcher

The mostly drop-in replacement of Python's SequenceMatcher is provided by fuzzywuzzy's StringMatcher.

In that respect, sciencebeam-alignment merely provides a wrapper with fallback.

WordSequenceMatcher

A wrapper around the aforementioned SequenceMatcher, but matching on word level tokens only.

It currently only implements get_matching_blocks.

The main advantage is that it is much faster for long texts, because it won't have to match individual characters. It isn't recommended for short texts, where character level alignment is probably more desirable.

example match results:

>>> from sciencebeam_alignment.word_sequence_matcher import (
...     WordSequenceMatcher
... )
>>> WordSequenceMatcher(a='word1', b='word2').get_matching_blocks()
[]
>>> WordSequenceMatcher(a='a word1 b', b='x word1 y').get_matching_blocks()
[(2, 2, 5)]

GlobalSequenceMatcher and LocalSequenceMatcher

The GlobalSequenceMatcher and LocalSequenceMatcher implements the Needleman-Wunsch global alignment as well as the Smith-Waterman local alignment algorithms. The implementation is somewhat inspired by python-alignment.

It does implement get_matching_blocks to match Python's SequenceMatcher.

By passing in a scoring object, the results can be influenced (e.g. gaps can be penalized more than mismatches).

It does also provide an optimized implementation using Cython. The level of optimization depends on the type of passed in sequences and scoring. The fastest being with integer sequences and simple scoring. Especially with longer sequences, the potential speed ups can be significant.

>>> from sciencebeam_alignment.align import LocalSequenceMatcher, SimpleScoring
>>> DEFAULT_SCORING = SimpleScoring(match_score=3, mismatch_score=-1, gap_score=-2)
>>> LocalSequenceMatcher(a='a word1 b', b='x word2 y', scoring=DEFAULT_SCORING).get_matching_blocks()
[(1, 1, 5), (7, 7, 1), (9, 9, 0)]

In addition, the get_multiple_matching_blocks can be used to retrieve multiple matching blocks with the same score:

>>> from sciencebeam_alignment.align import GlobalSequenceMatcher, SimpleScoring
>>> DEFAULT_SCORING = SimpleScoring(match_score=3, mismatch_score=-1, gap_score=-2)
>>> matcher = GlobalSequenceMatcher(a='xyzabc', b='abcxyz', scoring=DEFAULT_SCORING)
>>> list(matcher.get_multiple_matching_blocks(limit=2))
[[(3, 0, 3)], [(0, 3, 3)]]

get_multiple_matching_blocks returns a generator. The number of variations can be limited using the limit argument or by simply stopping early.

The GlobalSequenceMatcher can also be used to calculate the Levenshtein distance (or edit distance). An example is provided in sciencebeam_alignment.levenshtein:

>>> from sciencebeam_alignment.levenshtein import get_levenshtein_distance
>>> get_levenshtein_distance('kitten', 'sitting')
3
>>> from sciencebeam_alignment.levenshtein import get_levenshtein_ratio
>>> get_levenshtein_ratio('kitten', 'sitting')
0.5714285714285714

Calculating the levenshtein distance is mainly provided as an example. You might want to consider using python-Levenshtein.

To check whether the fast implementation is enabled:

>>> from sciencebeam_alignment.align import native_enabled
>>> native_enabled
True

Development

Development can be done either using Docker (default) or a virtual environment.

All commands are available via make.

Development using Docker

Build and run tests:

make build test

Or intended for CI:

make ci-build-and-test

Development using a virtual environment

make targets with the dev- prefix are intended for the use with the virtual environment.

This requires that you already have Python installed.

Setup (virtual environment)

make dev-venv

To update the dependencies:

make dev-install

Cython (virtual environment)

Compile code using Cython:

make dev-cython-clean dev-cython-compile

Tests (virtual environment)

make dev-test

Or:

make dev-watch

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sciencebeam_alignment-0.0.5.tar.gz (12.4 kB view details)

Uploaded Source

File details

Details for the file sciencebeam_alignment-0.0.5.tar.gz.

File metadata

  • Download URL: sciencebeam_alignment-0.0.5.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for sciencebeam_alignment-0.0.5.tar.gz
Algorithm Hash digest
SHA256 62167d35195e2a44f5186298853b7e4cc09a726eb91c189d0f0c4ee2a18c419c
MD5 b8be257c8e47b6b2404bfd6715222a8f
BLAKE2b-256 584b053c438c812662abdb38c2287872dbe1610abe6c186c57ed1145fa60b800

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page