Skip to main content

Library for working with biological sequence data as numpy arrays.

Project description

bionumpy

https://img.shields.io/pypi/v/bionumpy.svg https://img.shields.io/travis/knutdrand/bionumpy.svg Documentation Status

Library for working with biological sequence data as numpy arrays.

Features

bioNumpy is a library for handling large biological sequence date sets in Python. The main features of bioNumpy are input and output from common file formats (fastq, fasta, bed, bam, sam…) (IO); Encoding of common data types into numerical arrays (Encoding) and summarizing/counting features in a data sets (kmers, minimizers, gc content, string matching,,,).

IO

The main function for file input and output is the open function, which by default deduces the file type based on the suffix, and creates a buffered reader of a sensible type. For instance:

>>> import bionumpy as bnp
>>>
>>> fastq_entries_stream = bnp.open("example_data/reads.fq")
>>> for fastq_entries in fastq_entries_stream:
...     print(fastq_entries)
...
SequenceEntryWithQuality(name=Sequences(headerishere, anotherheader), sequence=Sequences(CTTGTTGA, CGG), quality=Sequences(!!!!!!!!, ~~~))

A couple of things to note: Firstly the bnp.open function understands that the .fq suffix means it’s a fastq file and reads it as such. It also knowns that usually a fastq file is usually too big to keep in memory so instead of reading in the whole file, it reads in chunks of data and delivers it as a stream (In this particular case, the file is so small that we only get one chunk). Third, it delivers the data for each chunk as a npdataclass where there is one array-like object for each field. This means that if we only care about the acutal DNA-sequences, we can operate on that on it’s own.

Another example would be to read in a vcf-file as such:

>>> variants_stream = bnp.open("example_data/variants.vcf")
>>> for chromosome, variants in variants_stream:
...     print(variants)
...
Variant(chromosome=['chr1' 'chr1' 'chr1'], position=array([883624, 887559, 887800]), ref_seq=Sequences(A, A, A), alt_seq=Sequences(G, C, G))
Variant(chromosome=['chr9' 'chr9' 'chr9'], position=array([883624, 887559, 887800]), ref_seq=Sequences(A, A, A), alt_seq=Sequences(G, C, G))

Again, bnp.open recognizes that this is a vcf file, and again it chooses an appropriate format to output it in. For vcf file this is a stream variant chunks, where each chunk is the variants for one chromosome. This format together with the ChromosomeMap decorator makes it easy to work with genomic data.

Sequences

bnp.Sequences objects are numpy arrays structured in a way that allows them to hold many sequences of unequal lenght. Under the hood they are npstructures.RaggedArray objects that hold byte-arrays, with an encoding that specifies which characters each byte represents. Most of the time, it is not necessary to think about these inner workings, but one can think of them as lists of strings (with the possibility of performing numpy functions on them). The most common way to get Sequences objects is to read a file, but they can also be created from lists of strings using the bnp.as_sequence_array function:

>>> bnp.as_sequence_array(["acgttgta", "gcttca", "gttattc"])
Sequences(acgttgta, gcttca, gttattc)

Encodings

The main point of bioNumpy is to leverage the computational power of numpy for biological data. A key element to this is to represent different types of biological data as numbers (in numpy arrays). The basic mechanism for doing this is by Encoding classes that can encode data types into numbers, and decode them back in to the data type. A driving idea in bionumpy is to make this happen automatically under the hood, so that a user can choose to ignore the inner workings and (in most cases) relate to sequence data sets in the same way as one would with standard numerical numpy data structures.

Summarization

A key application is to extract features from sequence data sets. A large set of interesting features can be computed as functions from sequences to scalar values. Examples are kmer-hashing (kmer->hash-value), minimizers(window->hash-value), string/motif-matching (sequence->bool), Position Weight Matrix scores (sequence->float). bioNumpy provides functionality to apply such functions to rolling windows across large sequence sets, through the RollableFunction class. By specifying a broadcastable function in the __call__ method, the rolling_window method will apply the function to all windows in a sequence set. Take the PositionWeightMatrix class for instance:

class PositionWeightMatrix(RollableFunction):
    def __init__(self, matrix, encoding=ACTGEncoding):
        self._matrix = np.asanyarray(matrix)
        self.window_size = self._matrix.shape[-1]
        self._indices = np.arange(self.window_size)
        self._encoding = ACTGEncoding

    def __call__(self, sequence: Sequence) -> float:
        sequence = as_sequence_array(sequence, self._encoding)
        scores = self._matrix[sequence, self._indices]
        return scores.sum(axis=-1)

It’s __call__ method specifies how to calculate the score of a sequence. Calling it’s rolling_window function will calculate the scores for all windows in a data set:

>>> import numpy as np
>>> sequences = bnp.as_sequence_array(["acgttgta", "gcttca", "gttattc"], encoding=bnp.encodings.ACTGEncoding)
>>>
>>> matrix = np.log([[0.1, 0.2],
...                  [0.2, 0.3],
...                  [0.4, 0.1],
...                  [0.3, 0.4]])
>>> pwm = bnp.position_weight_matrix.PositionWeightMatrix(matrix)
>>> pwm("ac")
-3.506557897319982
>>> pwm(["ac", "cg"])
array([-3.5065579 , -2.52572864])
>>> pwm.rolling_window(sequences)
RaggedArray([[-3.506557897319982, -2.525728644308255, -3.506557897319982, -3.2188758248682006, -1.83258146374831, -3.506557897319982, -2.525728644308255], [-2.4079456086518722, -3.9120230054281455, -3.2188758248682006, -2.120263536200091, -3.2188758248682006], [-3.506557897319982, -3.2188758248682006, -2.525728644308255, -4.605170185988091, -3.2188758248682006, -2.120263536200091]])

Further processing can be done with numpy functions, for instance finding the max score for each sequence in the set:

>>> pwm.rolling_window(sequences).max(axis=-1)
array([-1.83258146, -2.12026354, -2.12026354])

kmers

Another example of this concept is the kmer hashing class:

class KmerEncoding(RollableFunction):

    def __init__(self, k, alphabet_size=4):
        self.window_size = k
        self._k = k
        self._alphabet_size = alphabet_size
        self._convolution = self._alphabet_size ** np.arange(self._k)

    def __call__(self, sequence: Sequence) -> np.ndarray:
        return sequence.dot(self._convolution)

Here, the __call__ function specifies how to hash a kmer into a single number. Calling its rolling_window method will hash all the kmers in a sequence set.

>>> bnp.KmerEncoding(3).rolling_window(sequences)
RaggedArray([[52, 45, 43, 58, 46, 11], [39, 41, 26, 6], [43, 10, 34, 40, 26]])

To count all the 3-mers in the ‘reads.fq’ sequences we can do as follows:

>>> fastq_entries_stream = bnp.open("example_data/reads.fq")
>>> counts = np.zeros(4**3, dtype=int)
>>> kmer_encoding = bnp.KmerEncoding(3)
>>> for fastq_entries in fastq_entries_stream:
...     kmer_hashes = kmer_encoding.rolling_window(fastq_entries.sequence)
...     counts += np.bincount(kmer_hashes.ravel(), minlength=4**3)
...
>>> counts
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0])

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

History

0.1.0 (2021-12-17)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bionumpy-0.1.0.tar.gz (39.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bionumpy-0.1.0-py2.py3-none-any.whl (31.2 kB view details)

Uploaded Python 2Python 3

File details

Details for the file bionumpy-0.1.0.tar.gz.

File metadata

  • Download URL: bionumpy-0.1.0.tar.gz
  • Upload date:
  • Size: 39.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for bionumpy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 850aa0689343acc43fdbfb9f881d4b142fe9c8a64b1ee5ab0724a621f4dc9157
MD5 39734810b12ea3b54d220c5c3fc77787
BLAKE2b-256 08ffe52bac27f634e00005bee3844156e30c6b3919deac9dfe30b3e677770ea9

See more details on using hashes here.

File details

Details for the file bionumpy-0.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: bionumpy-0.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for bionumpy-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c97c57bea7883e19e565ef9c5966f4566c89a794471f01b6832310c6e1162b6c
MD5 197950317f8e9c12d96172aadfd94c3c
BLAKE2b-256 f6b034fecc4074b519bd71db9fc7df5873dacca1894613b387ccafd2234ac30f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page