Synthetic rule-based biological sequence data generation for architecture evaluation and search

These details have not been verified by PyPI

Project links

Homepage

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

seqgra: Principled Selection of Neural Network Architectures for Genomics Prediction Tasks

https://kkrismer.github.io/seqgra/

What is seqgra?

Sequence models based on deep neural networks have achieved state-of-the-art performance on regulatory genomics prediction tasks, such as chromatin accessibility and transcription factor binding. But despite their high accuracy, their contributions to a mechanistic understanding of the biology of regulatory elements is often hindered by the complexity of the predictive model and thus poor interpretability of its decision boundaries. To address this, we introduce seqgra, a deep learning pipeline that incorporates the rule-based simulation of biological sequence data and the training and evaluation of models, whose decision boundaries mirror the rules from the simulation process. The method can be used to (1) generate data under the assumption of a hypothesized model of genome regulation, (2) identify neural network architectures capable of recovering the rules of said model, and (3) analyze a model's predictive performance as a function of training set size, noise level, and the complexity of the rules behind the simulated data.

Installation

seqgra is a Python package that is part of PyPI, the package repositories behind pip.

To install seqgra with pip, run:

pip install seqgra

To install seqgra directly from this repository, run:

git clone https://github.com/gifford-lab/seqgra
cd seqgra
pip install .

System requirements

Python 3.7 (or higher)
R 3.5 (or higher)
- R package ggplot2 3.3.0 (or higher)
- R package gridExtra 2.3 (or higher)
- R package scales 1.1.0 (or higher)

The tensorflow package is only required if TensorFlow models are used and will not be automatically installed by pip install seqgra. Same is true for packages torch and pytorch-ignite, which are only required if PyTorch models are used.

R is a soft dependency, in the sense that it is used to create a number of plots (grammar-model-agreement plots, grammar heatmaps, and motif similarity matrix plots) and if not available, these plots will be skipped.

seqgra depends upon the Python package lxml, which in turn depends on system libraries that are not always present. On a Debian/Ubuntu machine you can satisfy those requirements using:

sudo apt-get install libxml2-dev libxslt-dev

Usage

Check out the following help pages:

Usage examples: seqgra example analyses with data definitions and model definitions
Command line utilities: argument descriptions for seqgra, seqgras, seqgrae, and seqgraa commands
Data definition: detailed description of the data definition language that is used to formalize grammars
Model definition: detailed description of the model definition language that is used to describe neural network architectures and hyperparameters for the optimizer, the loss, and the training process
Simulators, Learners, Evaluators, Comparators: brief descriptions of the most important classes
seqgra API reference: detailed description of the seqgra API
Source code: seqgra source code repository on GitHub

Citation

If you use seqgra in your work, please cite:

seqgra: Principled Selection of Neural Network Architectures for Genomics Prediction Tasks
Konstantin Krismer, Jennifer Hammelman, and David K. Gifford
bioRxiv 2021.06.14.448415; DOI: https://doi.org/10.1101/2021.06.14.448415

Funding

We gratefully acknowledge funding from NIH grants 1R01HG008754 and 1R01NS109217.

Project details

These details have not been verified by PyPI

Project links

Homepage

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.4

Jun 16, 2021

0.0.3

Jun 14, 2021

0.0.2

Jun 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqgra-0.0.4.tar.gz (116.6 kB view hashes)

Uploaded Jun 16, 2021 Source

Built Distribution

seqgra-0.0.4-py3-none-any.whl (179.6 kB view hashes)

Uploaded Jun 16, 2021 Python 3

Hashes for seqgra-0.0.4.tar.gz

Hashes for seqgra-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`f13d868fad51d1388a28f5de0a8d9e578a7370593bae1aa798f2dd6dc8744ee1`
MD5	`1abdbf6520c78bd6e0949b2068a35a49`
BLAKE2b-256	`4869b2fb1af341d52c9e572a5049b01f31a8c3985330ae6478942c2a2f2f76a6`

Hashes for seqgra-0.0.4-py3-none-any.whl

Hashes for seqgra-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0751fc346e3276fcba6e8a6c68e0d3b4f1030421a9e2775b01fd55f0a6176567`
MD5	`49c4162cdb43672c5cd876d0d10e7ffd`
BLAKE2b-256	`523e1b18b68dff0fdcb24865eea38f4fbbe8a4cb1191fd2dbcfd66dd368655b1`