Skip to main content

Build keras generator for genomic application

Project description

Keras_dna: simplifying deep genomics

Keras_dna logo

license

Description:

Keras_dna is an API that helps quick experimentation in applying deep learning to genomics. It enables quickly feeding a keras model (tensorflow) with genomic data without the need of laborious file conversions or storing tremendous amount of converted data. It reads the most common bioinformatics files and creates generators adapted to the keras models.

Use Keras_dna if you need a library that:

  • Allows fast usage of standard bioinformatic data to feed a keras model (nowaday standard for tensorflow).
  • Helps formatting the data to the model's needs.
  • Facilitates the standard evaluation of a model with genomics data (correlation, AUPRC, AUROC)

Read the documentation at keras_dna.

Keras_dna is compatible with: Python 3.6.


Guiding principles:

  • Furnishing a simplified API to create generators of genomic data.

  • Reading the DNA sequence directly and effectively from fasta files, to discard the need of conversion.

  • Generating the DNA sequence corresponding to the desired annotation (can be sparse annotation or continuous), passed with standard bioinformatic files (gff, bed, bigWig, bedGraph).

  • Easily adapting to the type of annotations, their number, the number of different cell types or species.


Getting started:

The core classes of keras_dna are Generator, to feed the keras model with genomical data, and ModelWrapper to attach a keras model to its keras_dna Generator.

Generator creates batches of DNA sequences corresponding to the desired annotation.

First example, a Generator instance that yields DNA sequences corresponding to a given genomical function (here binding site) as the positive class and other sequences as the negative class. The genome is furnished through a fasta file and the annotation is furnished with a gff file (could have been a bed), the DNA is one-hot-encoded, the genomical functions that we want to target need to be passed in a list.

from keras_dna import Generator

generator = Generator(batch_size=64,
                      fasta_file='species.fa',
                      annotation_files=['annotation.gff'],
                      annotation_list=['binding site'])

Second example, a Generator for continuous annotation, this time the annotation is furnished through a bigWig file (it could have been a wig or a bedGraph, but then a file containing the chromosome sizes needs to be passed as size), the desired length of DNA sequences need to be passed. This Generator instance yields all the DNA sequences of length 100 in the genome and labels them with the coverage at the nucleotide at the center.

from keras_dna import Generator

generator = Generator(batch_size=64,
                      fasta_file='species.fa',
                      annotation_files=['annotation.bw'],
                      window=100)

Generator owns a lot of keywords to adapt the format of the data both to the keras model and to the task at hand (predicting the sequences' genomical function in different cellular types, classifying between several different functions, predicting from two different inputs, labelling DNA sequences with both their genomical functions and an experimental coverages...)

ModelWrapper is a class designed to unify a keras model to its generator in order to simplify further usage (prediction, evaluation) of the model.

from keras_dna import ModelWrapper, Generator
from tensorflow.keras.models import Sequential()

generator = Generator(batch_size=64,
                      fasta_file='species.fa',
                      annotation_files=['annotation.bw'],
                      window=100)

model = Sequential()
### the model need to be compiled
model.compile(loss='mse', optimizer='adam')

wrapper = ModelWrapper(model=model,
                       generator_train=generator)

Train the model with .train()

wrapper.train(epochs=10)

Evaluate the model on a chromosome with .evaluate()

wrapper.evaluate(incl_chromosomes=['chr1'])

Predict on a chromosome with .predict()

wrapper.predict(incl_chromosomes=['chr1'], chrom_size='species.chrom.sizes')

Save the wrapper in hdf5 with .save()

wrapper.save(path='./path/to/wrapper', save_model=True)

Installation:

Dependencies:

  • pandas
  • numpy
  • pybedtools
  • pyBigWig
  • kipoiseq
  • tensorflow 2

We also strongly advice installing genomelake for fast reading of fasta files.

  • Install Keras_dna from PyPI:

Note: These installation steps assume that you are on a Linux or Mac environment. If you are on Windows, you will need to remove sudo to run the commands below.

sudo pip install keras_dna

If you are using a virtualenv, you may want to avoid using sudo:

pip install keras_dna

Note that libcurl (and the curl-config command) are required for installation. This is typically already installed on many Linux and OSX systems (this is also easilya vailable using a conda env, in practise we advise installing pyBigWig with conda before installing keras_dna).

  • Alternatively: install Keras_dna from the GitHub source:

First, clone Keras using git:

git clone https://github.com/etirouthier/keras_dna.git

Then, cd to the Keras_dna folder and run the install command:

cd keras_dna
sudo python setup.py install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keras_dna-0.0.31.tar.gz (39.2 kB view details)

Uploaded Source

Built Distribution

keras_dna-0.0.31-py3-none-any.whl (44.6 kB view details)

Uploaded Python 3

File details

Details for the file keras_dna-0.0.31.tar.gz.

File metadata

  • Download URL: keras_dna-0.0.31.tar.gz
  • Upload date:
  • Size: 39.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0.post20200209 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.10

File hashes

Hashes for keras_dna-0.0.31.tar.gz
Algorithm Hash digest
SHA256 ab332e2459f586bfec9758d3545eab4e433f2316410b1e7cb482a7e26e4c81db
MD5 0d3a3cc3ca506f17cfae4352c942d70b
BLAKE2b-256 669aede550e7921384c1ab3973ac3d055ea17e0e4382decd67000c2b987800ba

See more details on using hashes here.

File details

Details for the file keras_dna-0.0.31-py3-none-any.whl.

File metadata

  • Download URL: keras_dna-0.0.31-py3-none-any.whl
  • Upload date:
  • Size: 44.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0.post20200209 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.10

File hashes

Hashes for keras_dna-0.0.31-py3-none-any.whl
Algorithm Hash digest
SHA256 9594a1e6c5e1e1f210d95cc94f45eb3f44cee913cd78fad58cf44b688169ab91
MD5 3a2cc94afe123cd8b992e2e1d4f7d0a3
BLAKE2b-256 a99f5efdfb8b5ea1d6640605c96638bbe7f038d84d5d900cb3b161a5a5fa3829

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page