Skip to main content

A specialized datastructure and tokenization library for counting genetic sequences for use in Machine Learning and Bioinformatics.

Project description

DNA Hash

A specialized datastructure and tokenization library for counting genetic sequences for use in Machine Learning and Bioinformatics. DNA Hash stores k-mer sequence counts by their up2bit encoding - an efficient two-way hash that works with variable-length sequences. As such, DNA Hash uses considerably less memory than a lookup table that stores sequences in plaintext. In addition, DNA Hash's novel autoscaling Bloom filter eliminates the need to explicitly store singletons and makes it suitable for use on streaming data.

  • Ultra-low memory footprint
  • Variable sequence lengths
  • Embarrassingly parallelizable
  • Open-source and free to use commercially

Note: Due to the probabilistic nature of the Bloom filter, DNA Hash may over count sequences at a bounded rate.

Installation

Install DNA Hash using a Python package manager, example pip:

pip install dnahash

Example Usage

from dna_hash import DNAHash
from dna_hash.tokenizers import Kmer, Canonical

from Bio import SeqIO
from matplotlib import pyplot as plt

hash_table = DNAHash(max_false_positive_rate=0.001)

tokenizer = Canonical(Kmer(6))

with open('covid-19-virus.fasta', 'r') as file:
    for record in SeqIO.parse(file, 'fasta'):
        for token in tokenizer.tokenize(str(record.seq)):
            hash_table.increment(token)

for sequence, count in hash_table.top(25):
    print(f'{sequence}: {count}')

print(f'Total sequences: {hash_table.num_sequences}')
print(f'# of unique sequences: {hash_table.num_unique_sequences}')
print(f'# of singletons: {hash_table.num_singletons}')

plt.hist(list(hash_table.counts.values()), bins=20)
plt.title('SARS-CoV-2 Genome')
plt.xlabel('Counts')
plt.ylabel('Frequency')
plt.show()
TAACAA: 70
TTAAAA: 68
ACAACA: 65
...
CATTAA: 49

Total sequences: 29876
# of unique sequences: 2013
# of singletons: 100

SARS-CoV-2 Histogram

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dnahash-0.0.3.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dnahash-0.0.3-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file dnahash-0.0.3.tar.gz.

File metadata

  • Download URL: dnahash-0.0.3.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for dnahash-0.0.3.tar.gz
Algorithm Hash digest
SHA256 512f4c2c8835bd53843a81ed8607fb414768835669beaef501ec35c992bb3d71
MD5 c4bd76f5e36d3d79805375c9eac812d0
BLAKE2b-256 f1708ebb0252bb21741ea8d1a8908a528f98cb6509b6e97eee927c17a7247511

See more details on using hashes here.

File details

Details for the file dnahash-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: dnahash-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for dnahash-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 589bb0f123eb9802877032491ae950a08fdd63b28a92eb1c62562a6a98b74dbc
MD5 90d6e7b5f493efdce13a6a5c5150d7f5
BLAKE2b-256 99ff69f8f2ede8ff9ea1c54ed8e1b4b584ab7264a5395ef85ecc0780e4e248c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page