Skip to main content

FASTA sequence aligner - test

Project description

# **Aligner**

### What
**Aligner** is a tool for [FASTA] (https://en.wikipedia
.org/wiki/FASTA_format) DNA sequence alignment.


### How to use
**Aligner** is fairly simple to use. All you need is a standard FASTA text
file (*.txt).

By the way, included in this package are two example files (fasta_example_1
.txt,
fasta_example_2.txt).

In order to install,

```
pip install sequence_aligner
```

To use aligner from the command line, you run the following commands:

```
aligner [-h] --file_path FILE_PATH [--storage_path STORAGE_PATH] [--results_name
RESULTS_NAME]
```

FILE PATH is the path to the fasta sequence text file.

While FILE PATH is required, STORAGE_PATH and FILE_NAME are optional.

By default, the result file will be stored in the following generated directory:
`/tmp/sequence_results`

And the file name will be generated with the following pattern:
`sequence_read-<datetime here>.txt` for example: `sequence_read-2016-07-01T16.42.21.246183.txt`

Once you run the command line, within seconds, your results will be
generated, and you'll receive a print out of the location and name of your file.

For example:
```Sequence Alignment results stored --> /tmp/sequence_results/sequence_read-2016-07-01T17.59.48.458859.txt```


### How it works
In order to align sequences, first, I created a dictionary of all
subsequences from the size 'greater than half' to full length.

One sequence in the list of sequences was designated the "anchor
sequence".

I gathered that at any given state of the anchor sequence, there
would exist a sequence with the the greatest overlap. Iterating
through the sequence list, the sequence with the
sub-sequence (from the dictionary mentioned above) having the
greatest 'score' is identified. The score is based on the amount of
overlap with the anchor sequence. This subsequence is then merged
into the anchor sequence. This iteration continues until all
sequences in the sequence list have been merged into the anchor
sequence.

## Caveats/Issues
* One issue I dealt with was speed. After many iterations of
refactoring, I was able to bring runtime down from ~45-50 min to ~10-11
min.
* This program assumes that sequences overlap and that there are no
mutations.

## Next Steps
* Improve efficiency
* Improve 'match-iness' algorithm -- right now, my 'score' is merely
based on length of overlap at a given state of the anchor sequence.
More preprocessing could be done before commencing alignment in order
to score the level of 'matchiness' between all pairs of sequences.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SequenceAligner-0.0.2.tar.gz (5.8 kB view details)

Uploaded Source

File details

Details for the file SequenceAligner-0.0.2.tar.gz.

File metadata

File hashes

Hashes for SequenceAligner-0.0.2.tar.gz
Algorithm Hash digest
SHA256 bbd2cd48d95acb53824ec39fae4e2cba51f398266ba44ad35e318bff073ee310
MD5 bb57fe5bbd82b7193b875c721c504db4
BLAKE2b-256 7f6bac0ae3cc75e024629f03e3916a6a336fea8c7dd016fac589d25f761a73e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page