DNA repeat annotations
Project description
DeepGRP is a python package used to predict genomic repetitive elements with a deep learning model consisting of bidirectional gated recurrent units with attention. The idea of DeepGRP was initially based on dna-nn, but was re-implemented and extended using TensorFlow 2.1. DeepGRP was tested for the prediction of HSAT2,3, alphoid, Alu and LINE-1 elements.
Getting Started
Installation
For installation you can use the PyPI version with:
pip install deepgrp
or install from this repository with:
git clone https://github.com/fhausmann/deepgrp cd deepgrp pip install .
Additionally you can install the developmental version with poetry:
git clone https://github.com/fhausmann/deepgrp cd deepgrp poetry install
Data preprocessing
For training and hyperparameter optimization the data have to be preprocessed. For inference / prediction the FASTA sequences can directly be used and you can skip this process. The provided script parse_rm can be used to extract repeat annotations from RepeatMasker annotations to a TAB seperated format by:
parse_rm GENOME.fa.out > GENOME.bed
The FASTA sequences have to be converted to a one-hot-encoded representation, which can be done with:
preprocess_sequence FASTAFILE.fa.gz
preprocess_sequence creates a one-hot-encoded representation in numpy compressed format in the same directory.
Hyperparameter optimization
For Hyperparameter optimization the github repository provides a jupyter notebook which can be used.
Hyperparameter optimization is based on the hyperopt package.
Training
Training of a model can be performed with the provided jupyter notebook.
Prediction
The prediction can be done with the deepgrp main function like:
deepgrp <modelfile> <fastafile> [<fastafile>, ...]
where <modelfile> contains the trained model in HDF5 format and <fastafile> is a (multi-)FASTA file containing DNA sequences. Several FASTA files can be given at once.
Requirements
Requirements are listed in pyproject.toml.
Additionally for compiling C/Cython code, a C compiler should be installed.
Further information
You can find material to reproduce the results in the repository deepgrp_reproducibility.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for deepgrp-0.2.2-cp37-cp37m-manylinux_2_33_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ad809a7fe3e797e30d7766a1f383978102b35e9981d02ca13b17292e2a696c7 |
|
MD5 | e5202c66e88313b94ecce10f0bc6fa57 |
|
BLAKE2b-256 | b7da990a7774444f633c60484a4fc566593e53a489878b12edabf34032f44ef7 |
Hashes for deepgrp-0.2.2-cp36-cp36m-manylinux_2_33_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c53507bd88aa685dde7ed354ab1e9ac6f599f761b3fdd09c2b351019f12151a5 |
|
MD5 | 4956822e443d714d6bff3fa16d329448 |
|
BLAKE2b-256 | 49b6fc27b57b3bbfe3524ada4d2e35ed062a1825523d08bde8b50eb85ed1b9c6 |