ScienceBeam Alignment
Project description
ScienceBeam Alignment
ScienceBeam Alignment provides generic low-level sequence alignment utility functions, similar to Python's SequenceMatcher.
This project is currently mainly used for training data generation, related to the ScienceBeam project. Although this project itself has no ScienceBeam dependency and can be considered a standalone sequence alignment library. It is however more targeted at document size sequences rather than massive gene sequences.
Pre-requisites
- Python 2 or 3
API
SequenceMatcher
The mostly drop-in replacement of Python's SequenceMatcher is provided by fuzzywuzzy's StringMatcher.
In that respect, sciencebeam-alignment
merely provides a wrapper with fallback.
WordSequenceMatcher
A wrapper around the aforementioned SequenceMatcher
, but matching on word level tokens only.
It currently only implements get_matching_blocks
.
The main advantage is that it is much faster for long texts, because it won't have to match individual characters. It isn't recommended for short texts, where character level alignment is probably more desirable.
example match results:
>>> from sciencebeam_alignment.word_sequence_matcher import (
... WordSequenceMatcher
... )
>>> WordSequenceMatcher(a='word1', b='word2').get_matching_blocks()
[]
>>> WordSequenceMatcher(a='a word1 b', b='x word1 y').get_matching_blocks()
[(2, 2, 5)]
GlobalSequenceMatcher and LocalSequenceMatcher
The GlobalSequenceMatcher and LocalSequenceMatcher implements the Needleman-Wunsch global alignment as well as the Smith-Waterman local alignment algorithms. The implementation is somewhat inspired by python-alignment.
It does implement get_matching_blocks
to match Python's SequenceMatcher.
By passing in a scoring object, the results can be influenced (e.g. gaps can be penalized more than mismatches).
It does also provide an optimized implementation using Cython. The level of optimization depends on the type of passed in sequences and scoring. The fastest being with integer sequences and simple scoring. Especially with longer sequences, the potential speed ups can be significant.
>>> from sciencebeam_alignment.align import LocalSequenceMatcher, SimpleScoring
>>> DEFAULT_SCORING = SimpleScoring(match_score=3, mismatch_score=-1, gap_score=-2)
>>> LocalSequenceMatcher(a='a word1 b', b='x word2 y', scoring=DEFAULT_SCORING).get_matching_blocks()
[(1, 1, 5), (7, 7, 1), (9, 9, 0)]
In addition, the get_multiple_matching_blocks
can be used to retrieve multiple matching blocks with the same score:
>>> from sciencebeam_alignment.align import GlobalSequenceMatcher, SimpleScoring
>>> DEFAULT_SCORING = SimpleScoring(match_score=3, mismatch_score=-1, gap_score=-2)
>>> matcher = GlobalSequenceMatcher(a='xyzabc', b='abcxyz', scoring=DEFAULT_SCORING)
>>> list(matcher.get_multiple_matching_blocks(limit=2))
[[(3, 0, 3)], [(0, 3, 3)]]
get_multiple_matching_blocks
returns a generator. The number of variations can be limited using the limit
argument or by simply stopping early.
The GlobalSequenceMatcher
can also be used to calculate the Levenshtein distance (or edit distance). An example is provided in sciencebeam_alignment.levenshtein
:
>>> from sciencebeam_alignment.levenshtein import get_levenshtein_distance
>>> get_levenshtein_distance('kitten', 'sitting')
3
>>> from sciencebeam_alignment.levenshtein import get_levenshtein_ratio
>>> get_levenshtein_ratio('kitten', 'sitting')
0.5714285714285714
Calculating the levenshtein distance is mainly provided as an example. You might want to consider using python-Levenshtein.
To check whether the fast implementation is enabled:
>>> from sciencebeam_alignment.align import native_enabled
>>> native_enabled
True
Development
Development can be done either using Docker (default) or a virtual environment.
All commands are available via make
.
Development using Docker
Build and run tests:
make build test
Or intended for CI:
make ci-build-and-test
Development using a virtual environment
make
targets with the dev-
prefix are intended for the use with the virtual environment.
This requires that you already have Python installed.
Setup (virtual environment)
make dev-venv
To update the dependencies:
make dev-install
Cython (virtual environment)
Compile code using Cython:
make dev-cython-clean dev-cython-compile
Tests (virtual environment)
make dev-test
Or:
make dev-watch
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file sciencebeam_alignment-0.0.5.tar.gz
.
File metadata
- Download URL: sciencebeam_alignment-0.0.5.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 62167d35195e2a44f5186298853b7e4cc09a726eb91c189d0f0c4ee2a18c419c |
|
MD5 | b8be257c8e47b6b2404bfd6715222a8f |
|
BLAKE2b-256 | 584b053c438c812662abdb38c2287872dbe1610abe6c186c57ed1145fa60b800 |