Skip to main content

A specialized datastructure for counting short DNA sequences for use in Bioinformatics.

Project description

DNA Hash

A Python library for counting short DNA sequences for use in Bioinformatics. DNA Hash stores k-mer sequence counts by their up2bit encoding - a two-way hash that works with variable-length sequences. DNA Hash uses considerably less memory than a lookup table that stores sequences in plaintext. In addition, DNA Hash's novel autoscaling Bloom filter eliminates the need to explicitly store counts for sequences that have only been seen once.

  • Ultra-low memory footprint
  • Embarrassingly parallelizable
  • Open-source and free to use commercially

Note: The maximum sequence length is platform dependent. On a 64-bit machine, the max length is 31. On a 32-bit machine, the max length is 15.

Note: Due to the probabilistic nature of the Bloom filter, DNA Hash may over count sequences but at a bounded user-defined rate.

Example

from dna_hash import DNAHash, tokenizers

from Bio import SeqIO
from matplotlib import pyplot as plt

hash_table = DNAHash(max_false_positive_rate=0.001)

tokenizer = tokenizers.Canonical(tokenizers.Kmer(6))

with open('covid-19-virus.fasta', 'r') as file:
    for record in SeqIO.parse(file, 'fasta'):
        for token in tokenizer.tokenize(str(record.seq)):
            hash_table.increment(token)

for sequence, count in hash_table.top(25):
    print(f'{sequence}: {count}')

print(f'Total sequences: {hash_table.num_sequences}')
print(f'# of unique sequences: {hash_table.num_unique_sequences}')
print(f'# of singletons: {hash_table.num_singletons}')

counts, bins = hash_table.histogram(20)

plt.stairs(counts, bins)
plt.title('Histogram of SARS-CoV-2 Genome')
plt.xlabel('Counts')
plt.ylabel('Frequency')
plt.show()

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dnahash-0.0.1.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

DNAHash-0.0.1-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file dnahash-0.0.1.tar.gz.

File metadata

  • Download URL: dnahash-0.0.1.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for dnahash-0.0.1.tar.gz
Algorithm Hash digest
SHA256 bfdd47a6bae7b58f8435bfe59db5790dcf7418fc2a258a8e525ae03c4f5e4e2a
MD5 cce70f7fa1c48a038af1b92233aa8cad
BLAKE2b-256 db60305fadb87d8739e79cbd1b7067fdd374a756b9a3b93e9ec06313717097ac

See more details on using hashes here.

File details

Details for the file DNAHash-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: DNAHash-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for DNAHash-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dc9aff6d9a44860345c1c2b7c4d00c51565d9796e769ae0e22411d11c43f9ec1
MD5 931fc8a91a273c9e09ac32872873ea01
BLAKE2b-256 6b1e13e8d7cbaf93230638ccf4c363c78f1a590a03b62fa531ff8729871b0d4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page