Skip to main content

Simple and efficient random access to genomic data for deep learning models.

Project description

# genomelake
[![CircleCI](https://circleci.com/gh/kundajelab/genomelake.svg?style=svg)](https://circleci.com/gh/kundajelab/genomelake)[![Coverage Status](https://coveralls.io/repos/github/kundajelab/genomelake/badge.svg)](https://coveralls.io/github/kundajelab/genomelake)

Efficient random access to genomic data for deep learning models.

Supports the following types of input data:

- bigwig
- DNA sequence

genomelake extracts signal from genomic inputs in provided BED intervals.

## Requirements
- python 2.7 or 3.5
- bcolz
- cython
- numpy
- pybedtools
- pysam

## Installation
Clone the repository and run:

`python setup.py install`

## Getting started: training a protein-DNA binding model
Extract genome-wide sequence data into a genomelake data source:
```python
from genomelake.backend import extract_fasta_to_file

genome_fasta = "/mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa"
genome_data_directory = "./hg19_data_directory"
extract_fasta_to_file(genome_fasta, genome_data_directory)
```

Using a BED intervals file with labels, a genome data source, and genomelake's `ArrayExtractor`, generate input DNA sequences and labels:
```python
import pybedtools
from genomelake.extractors import ArrayExtractor
import numpy as np

def batch_iter(iterable, batch_size):
it = iter(iterable)
try:
while True:
values = []
for n in range(batch_size):
values += (next(it),)
yield values
except StopIteration:
yield values

def generate_inputs_and_labels(intervals_file, data_source, batch_size=128):
bt = pybedtools.BedTool(intervals_file)
extractor = ArrayExtractor(data_source)
intervals_generator = batch_iter(bt, batch_size)
for intervals_batch in intervals_generator:
inputs = extractor(intervals_batch)
labels = []
for interval in intervals_batch:
labels.append(float(interval.name))
labels = np.array(labels)
yield inputs, labels
```

Train a keras model of JUND binding to DNA using 101 base pair intervals and labels in `./examples/JUND.HepG2.chr22.101bp_intervals.tsv.gz`:
```python
from keras.models import Sequential
from keras.layers import Conv1D, Flatten, Dense

intervals_file = "./examples/JUND.HepG2.chr22.101bp_intervals.tsv.gz"
inputs_labels_generator = generate_inputs_and_labels(intervals_file, genome_data_directory)

model = Sequential()
model.add(Conv1D(15, 25, input_shape=(101, 4)))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit_generator(inputs_labels_generator, steps_per_epoch=100)
```

Here is the expected result:
```
100/100 [==============================] - 7s - loss: 0.0584 - acc: 0.9905
```

## License
genomelake is released under the BSD-3 license. See ``LICENSE`` for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomelake-0.1.4.tar.gz (88.5 kB view details)

Uploaded Source

File details

Details for the file genomelake-0.1.4.tar.gz.

File metadata

  • Download URL: genomelake-0.1.4.tar.gz
  • Upload date:
  • Size: 88.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for genomelake-0.1.4.tar.gz
Algorithm Hash digest
SHA256 fcec11be1cce15f08de0df34916f946a9470ed911050f5ab7098a1279e892253
MD5 6ce6e2977502c489a7473cc7bc0c7b16
BLAKE2b-256 07f9bfefaffe5478615f126704c385d0f38383d9dd912a65fb090eced83ba01e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page