Skip to main content

DNA repeat annotations

Project description

PyPI version fury.io

DeepGRP is a python package used to predict genomic repetitive elements with a deep learning model consisting of bidirectional gated recurrent units with attention. The idea of DeepGRP was initially based on dna-nn, but was re-implemented and extended using TensorFlow 2.1. DeepGRP was tested for the prediction of HSAT2,3, alphoid, Alu and LINE-1 elements.

Getting Started

Installation

For installation you can use the PyPI version with:

pip install deepgrp

or install from this repository with:

git clone https://github.com/fhausmann/deepgrp
cd deepgrp
pip install .

Additionally you can install the developmental version with poetry:

git clone https://github.com/fhausmann/deepgrp
cd deepgrp
poetry install

Data preprocessing

For training and hyperparameter optimization the data have to be preprocessed. For inference / prediction the FASTA sequences can directly be used and you can skip this process. The provided script parse_rm can be used to extract repeat annotations from RepeatMasker annotations to a TAB seperated format by:

parse_rm GENOME.fa.out > GENOME.bed

The FASTA sequences have to be converted to a one-hot-encoded representation, which can be done with:

preprocess_sequence FASTAFILE.fa.gz

preprocess_sequence creates a one-hot-encoded representation in numpy compressed format in the same directory.

Hyperparameter optimization

For Hyperparameter optimization the github repository provides a jupyter notebook which can be used.

Hyperparameter optimization is based on the hyperopt package.

Training

Training of a model can be performed with:

deepgrp train <parameter.toml> <TRAIN>.fa.gz.npz <VALIDATION>.fa.gz.npz <annotations.bed>

The prefix of <TRAIN> and <VALIDATION> should be as row identifier in the first column of <annotations.bed>.

For more fine-grained control of the training process you can also use the provided jupyter notebook.

Prediction

The prediction can be done with the deepgrp main function like:

deepgrp <modelfile> <fastafile> [<fastafile>, ...]

where <modelfile> contains the trained model in HDF5 format and <fastafile> is a (multi-)FASTA file containing DNA sequences. Several FASTA files can be given at once.

Requirements

Requirements are listed in pyproject.toml.

Additionally for compiling C/Cython code, a C compiler should be installed.

Contribution:

First of all any contributing are very welcome. If you want to contribute, please make a Pull request with your changes. Your code should be formatted using yapf using the default settings, they and they should pass all tests without issues. For testing currently mypy and pylint static tests are used, while pytest is used for functional tests.

If you’re adding new functionalities please provide corresponding tests in the tests directory.

Feel free to ask in case of any questions.

Further information

You can find material to reproduce the results in the repository deepgrp_reproducibility.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepgrp-0.2.3.tar.gz (28.0 kB view details)

Uploaded Source

Built Distributions

deepgrp-0.2.3-cp38-cp38-manylinux_2_33_x86_64.whl (993.1 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.33+ x86-64

deepgrp-0.2.3-cp37-cp37m-manylinux_2_33_x86_64.whl (941.0 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.33+ x86-64

File details

Details for the file deepgrp-0.2.3.tar.gz.

File metadata

  • Download URL: deepgrp-0.2.3.tar.gz
  • Upload date:
  • Size: 28.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.6 CPython/3.8.11 Linux/5.12.13-300.fc34.x86_64

File hashes

Hashes for deepgrp-0.2.3.tar.gz
Algorithm Hash digest
SHA256 8b34e266a984b41d7033f18540bcdbb5a99d384af99ee7aec834edfd1e20df5b
MD5 4f1e47333a13024273deabbe006c1f6d
BLAKE2b-256 a1bafb194c94b59431ee6eb8ccdd4c06e6ef613ffdffd2a8a14fdf14b89a112f

See more details on using hashes here.

File details

Details for the file deepgrp-0.2.3-cp38-cp38-manylinux_2_33_x86_64.whl.

File metadata

  • Download URL: deepgrp-0.2.3-cp38-cp38-manylinux_2_33_x86_64.whl
  • Upload date:
  • Size: 993.1 kB
  • Tags: CPython 3.8, manylinux: glibc 2.33+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.6 CPython/3.8.11 Linux/5.12.13-300.fc34.x86_64

File hashes

Hashes for deepgrp-0.2.3-cp38-cp38-manylinux_2_33_x86_64.whl
Algorithm Hash digest
SHA256 bff933a6bf6f7b8b7fdd765cff28c94528f2226dc17cb839c55a79c49cb28aba
MD5 6efa30939b13f72a010146abdffd38e7
BLAKE2b-256 f00a8a5b2c866e1876cdc340a0601b65699bd62164bc0cd0687a48bc2cddfbb6

See more details on using hashes here.

File details

Details for the file deepgrp-0.2.3-cp37-cp37m-manylinux_2_33_x86_64.whl.

File metadata

  • Download URL: deepgrp-0.2.3-cp37-cp37m-manylinux_2_33_x86_64.whl
  • Upload date:
  • Size: 941.0 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.33+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.6 CPython/3.8.11 Linux/5.12.13-300.fc34.x86_64

File hashes

Hashes for deepgrp-0.2.3-cp37-cp37m-manylinux_2_33_x86_64.whl
Algorithm Hash digest
SHA256 ab7d207ec271ab24a014075661c6a7333302e60a87ed6c0b5db9257f935463ce
MD5 f537593cf2334fa1d96f292e289d282c
BLAKE2b-256 68b224918340372e05fa2f6b442530a8814c3976a5c40f78ffebc0fb9e268373

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page