Skip to main content

Keras Genomics Data Generators

Project description

# Isolearn

A Python API for automated loading, processing and streaming of genomics data into deep learning models (Keras).
Implements Keras data generators for loading and encoding Pandas dataframes and RNA-Seq count matrices into numerical tensors.

#### When to use Isolearn
- When you want to encode DNA sequence features (e.g. hexamer counts or one-hot encodings) of large genomic datasets, for use in downstream learning algorithms.
- When you want to stream data and encode sequence features on the fly as mini batches.
- When you want seamless integration with Keras as parallelizable genomic data generators.
- When you are building a complex multi-task model composed of several data sets.

### Installation
The latest stable version of Isolearn can be installed with pip:
```sh
pip install isolearn
```

Isolearn can also be installed from the [github repository](https://github.com/johli/isolearn.git):
```sh
git clone https://github.com/johli/isolearn.git
cd isolearn
python setup.py install
```

#### Isolearn requires the following packages to be installed
- Keras >= 2.2.4
- Pandas >= 0.24.2
- Scipy >= 1.2.1
- Numpy >= 1.16.2

### Usage
Isolearn is centered around data generators, where the generator's task is to transform your sequence data (stored in a Pandas dataframe) and corresponding measurements (e.g. column in the Pandas dataframe, or RNA-Seq count matrix) into numerical input features and output targets.

A simple Keras Data Generator can built using the isolearn.keras package:
```python
import isolearn.keras as iso
import pandas as pd
import numpy as np

#Create some functional sequence data

data = pd.DataFrame(
{
'seq' : ['ACGTGGGCTTTCAACTCTAAAACGAGA', 'ACGTGGGCTTTCAACTCTAAAACGAGA', ...],
'enrichment' : [3.2, -5.1, ...]
}
)

#Construct a data generator
#The generator one-hot encodes the sequences
#It also takes the log of the enrichment targets

gen = iso.DataGenerator(
data_ids = np.arange(len(data), dtype=np.int),
sources = { 'data' : data },
batch_size = 32,
inputs = [
{
'id' : 'onehot',
'source_type' : 'dataframe',
'source' : 'data',
'extractor' : lambda row, index: row['seq'][100: 200],
'encoder' : iso.OneHotEncoder(seq_length=100),
'dim' : (100, 4),
'sparsify' : False
}
],
outputs = [
{
'id' : 'log_enrichment',
'source_type' : 'dataframe',
'source' : 'data',
'extractor' : lambda row, index: row['enrichment'],
'transformer' : lambda v: np.log(v)
}
],
shuffle = True
)

#Now we are all set to feed the data generator into Keras when training a model.
#We can also use the data generator directly as a batch streamer by simply indexing it:

[X], [y] = gen[13] #Generate batch 13
```

### Example Notebooks (Alternative Splicing)
These examples show how to build more complex data generators and how to integrate them into Keras or other learning algorithms. The examples are based on Alternative Splicing data from [https://github.com/Alex-Rosenberg/cell-2015](https://github.com/Alex-Rosenberg/cell-2015).

[Notebook 1a: Logistic Regression of Sequence Hexamer Counts](https://nbviewer.jupyter.org/github/johli/isolearn/blob/master/example/splicing_hexamer_regression.ipynb)<br/>
[Notebook 1b: Logistic Regression of Sequence Hexamer Counts (Multiple Cell Types)](https://nbviewer.jupyter.org/github/johli/isolearn/blob/master/example/splicing_hexamer_regression_multicell.ipynb)<br/>
[Notebook 2a: (Keras) Sequence-Convolutional Neural Network](https://nbviewer.jupyter.org/github/johli/isolearn/blob/master/example/splicing_cnn_multicell.ipynb)<br/>
[Notebook 2b: (Keras) Sequence-Convolutional Neural Network (Sampled Splice Junctions)](https://nbviewer.jupyter.org/github/johli/isolearn/blob/master/example/splicing_cnn_perturbed_multicell.ipynb)<br/>



Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isolearn-0.2.1.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

isolearn-0.2.1-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file isolearn-0.2.1.tar.gz.

File metadata

  • Download URL: isolearn-0.2.1.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for isolearn-0.2.1.tar.gz
Algorithm Hash digest
SHA256 d0feaa7de079e360e01e9ac65f8197325aeec2bd126c0d0725f13b8ea46187b6
MD5 59f40ed487b833d2ad784800081d583d
BLAKE2b-256 f31b02a1acbf51862117e64967c6946db95d536505bd0a2002509cd0f525c937

See more details on using hashes here.

File details

Details for the file isolearn-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: isolearn-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for isolearn-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e7b1d5536024d81c6b6e04f5f932a3a57b46ca2921704fd80b73a8f1afaa95f1
MD5 17267679f36df1ab40d751fb137e772c
BLAKE2b-256 eece41a4f0b429fc723486eeea6e0ac543c0e60cbd31fc73aac379dcbac04998

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page