DNA repeat annotations
Project description
DeepGRP is a python package used to predict genomic repetitive elements with a deep learning model consisting of bidirectional gated recurrent units with attention. The idea of DeepGRP was initially based on dna-nn, but was re-implemented and extended using TensorFlow 2.1. DeepGRP was tested for the prediction of HSAT2,3, alphoid, Alu and LINE-1 elements.
Getting Started
Installation
For installation you can use the PyPI version with:
pip install deepgrp
or install from this repository with:
git clone https://github.com/fhausmann/deepgrp cd deepgrp pip install .
Additionally you can install the developmental version with poetry:
git clone https://github.com/fhausmann/deepgrp cd deepgrp poetry install
Data preprocessing
For training and hyperparameter optimization the data have to be preprocessed. For inference / prediction the FASTA sequences can directly be used and you can skip this process. The provided script parse_rm can be used to extract repeat annotations from RepeatMasker annotations to a TAB seperated format by:
parse_rm GENOME.fa.out > GENOME.bed
The FASTA sequences have to be converted to a one-hot-encoded representation, which can be done with:
preprocess_sequence FASTAFILE.fa.gz
preprocess_sequence creates a one-hot-encoded representation in numpy compressed format in the same directory.
Hyperparameter optimization
For Hyperparameter optimization the github repository provides a jupyter notebook which can be used.
Hyperparameter optimization is based on the hyperopt package.
Training
Training of a model can be performed with:
deepgrp train <parameter.toml> <TRAIN>.fa.gz.npz <VALIDATION>.fa.gz.npz <annotations.bed>
The prefix of <TRAIN> and <VALIDATION> should be as row identifier in the first column of <annotations.bed>.
For more fine-grained control of the training process you can also use the provided jupyter notebook.
Prediction
The prediction can be done with the deepgrp main function like:
deepgrp <modelfile> <fastafile> [<fastafile>, ...]
where <modelfile> contains the trained model in HDF5 format and <fastafile> is a (multi-)FASTA file containing DNA sequences. Several FASTA files can be given at once.
Requirements
Requirements are listed in pyproject.toml.
Additionally for compiling C/Cython code, a C compiler should be installed.
Contribution:
First of all any contributing are very welcome. If you want to contribute, please make a Pull request with your changes. Your code should be formatted using yapf using the default settings, they and they should pass all tests without issues. For testing currently mypy and pylint static tests are used, while pytest is used for functional tests.
If you’re adding new functionalities please provide corresponding tests in the tests directory.
Feel free to ask in case of any questions.
Further information
You can find material to reproduce the results in the repository deepgrp_reproducibility.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file deepgrp-0.2.3.tar.gz
.
File metadata
- Download URL: deepgrp-0.2.3.tar.gz
- Upload date:
- Size: 28.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.6 CPython/3.8.11 Linux/5.12.13-300.fc34.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b34e266a984b41d7033f18540bcdbb5a99d384af99ee7aec834edfd1e20df5b |
|
MD5 | 4f1e47333a13024273deabbe006c1f6d |
|
BLAKE2b-256 | a1bafb194c94b59431ee6eb8ccdd4c06e6ef613ffdffd2a8a14fdf14b89a112f |
File details
Details for the file deepgrp-0.2.3-cp38-cp38-manylinux_2_33_x86_64.whl
.
File metadata
- Download URL: deepgrp-0.2.3-cp38-cp38-manylinux_2_33_x86_64.whl
- Upload date:
- Size: 993.1 kB
- Tags: CPython 3.8, manylinux: glibc 2.33+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.6 CPython/3.8.11 Linux/5.12.13-300.fc34.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bff933a6bf6f7b8b7fdd765cff28c94528f2226dc17cb839c55a79c49cb28aba |
|
MD5 | 6efa30939b13f72a010146abdffd38e7 |
|
BLAKE2b-256 | f00a8a5b2c866e1876cdc340a0601b65699bd62164bc0cd0687a48bc2cddfbb6 |
File details
Details for the file deepgrp-0.2.3-cp37-cp37m-manylinux_2_33_x86_64.whl
.
File metadata
- Download URL: deepgrp-0.2.3-cp37-cp37m-manylinux_2_33_x86_64.whl
- Upload date:
- Size: 941.0 kB
- Tags: CPython 3.7m, manylinux: glibc 2.33+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.6 CPython/3.8.11 Linux/5.12.13-300.fc34.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab7d207ec271ab24a014075661c6a7333302e60a87ed6c0b5db9257f935463ce |
|
MD5 | f537593cf2334fa1d96f292e289d282c |
|
BLAKE2b-256 | 68b224918340372e05fa2f6b442530a8814c3976a5c40f78ffebc0fb9e268373 |