Skip to main content

A Python implementation of the Burros-Wheeler and Huffman coding algorithms

Project description

dnazip

PyPI - Python Version PyPI

ForTheBadge built-with-science forthebadge forthebadge forthebadge forthebadge forthebadge forthebadge

  • A Python implementation of The Burrows-Wheeler Transform (BWT) alongside Huffman compression on DNA sequences.
  • Hosted on GitHub
  • Documentation? here

Demo

  • A demo for installation and using the GUI can be found here

Installation

  • You can install the package either from pip or from the source code hosted on github.

With pip

pip install dnazip-bioinfo

From source

git clone https://github.com/dabane-ghassan/dnazip.git
cd dnazip
sudo python3 setup.py install

Getting started

GUI

  • After installing the package from source or using pip, the interface can be launched simply from the command line:
dnazip
  • If problems occur with the installation, an interface instance can be imported and launched:
from dnazip.interface import Interface
gui = Interface()
gui.main()

Using the library

Generating a random DNA sequence

  • A random DNA sequence with an alphabet of ATCGN can be generated for any given length specified in the parameter.
from dnazip.sequence import Sequence
randseq = Sequence.generate(length=5000)
Sequence('/path/to/new/seq').write(randseq)

Encoding a DNA sequence with Burrows-Wheeler + Huffman Coding

  • To encode a DNA sequence using BWT and Huffman coding, you can use a FullEncoder object that will save two files to the same directory as the sequence, the Burrows-Wheeler transform and the UTF-8 zipped format of the sequence:
from dnazip.encoder import FullEncoder
encode = FullEncoder('/path/to/seq')
encode.full_zip()
  • The attributes of the object can be accessed to see intermediary results:
encode.bw_encoder.rotations # a matrix of string rotations from a sequence
encode.bw_encoder.bwm # The Burrows-Wheeler Matrix
encode.bw_encoder.bwt # The Burrows-Wheeler Transform

encode.huff_encoder.header # The header of the zip file that contains Huffman codes for each character as well as the sequence binary padding
encode.huff_encoder.binary # The binary sequence of the BW transform
encode.huff_encoder.unicode # 8-bits encoded binary sequence
  • A random sequence of size 1kB was compressed efficiently to 549 bytes.

Decoding a DNA sequence with Huffman decoding + Reversing Burrows-Wheeler transform

  • To decode a zipped DNA sequence using Huffman decoding and the inverse of BWT, you can use a FullDecoder object that will work in the same manner as the FullEncoder object:
from dnazip.decoder import FullDecoder
decode = FullDecoder('path/to/seq')
decode.full_unzip()
  • The attributes can also be accessed to see intermediary results:
decode.huff_decoder.header # The header of the zip file that contains Huffman codes for each character as well as the sequence binary padding that where saved when the Huffman tree was created
decode.huff_decoder.unicode # 8-bits encoded sequence
decode.huff_decoder.binary # The binary sequence

decode.bw_decoder.bwm # The Burrows-Wheeler reconstructed matrix
decode.bw_decoder.original # The original sequence

Building the Burrows-Wheeler transform using the advanced algorithm

  • The BWT can be constructed using a Suffix Array (SA) based algorithm that has a better time and space complexities:
from dnazip.sequence import Sequence
from dnazip.burros_wheeler import BurrosWheeler

seq = Sequence('/path/to/seq').read()
BurrosWheeler.bwt_advanced(seq)

Documentation

Detailed documentation on the architecture can be found here

About

:scroll: License

MIT Licensed © Ghassan Dabane, 2021.

forthebadge made-with-python forthebadge ForTheBadge uses-git ForTheBadge uses-badges

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dnazip-bioinfo-0.2.tar.gz (16.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page