A Convolutional Neural Network-based Sequence-to-Expression Prediction Tool
Project description
CoNSEPT (Convolutional Neural Network-based Sequence-to-Expression Prediction Tool)
Saurabh Sinha’s Lab, University of Illinois at Urbana-Champaign Sinha Lab
Description
CoNSEPT is a tool to predict gene expression in various cis and trans contexts. Inputs to CoNSEPT are enhancer sequence, transcription factor levels in one or many trans conditions, TF motifs (PWMs), and any prior knowledge of TF-TF interactions.
Note
Please see here for the version we used in modeling the regulation of rhomboid gene in Drosophila.
Getting Started
CoNSEPT can be installed from PyPI:
pip install CoNSEPT
Preparing Input files
To train CoNSEPT, five input files are required (See Example
directory for examples of input files):
- Enhancer Sequence (FASTA format):
>ENHANCER_1
AACCCA...TTACAAT
>ENHANCER_2
CGGACT...TAACATG
- Gene Expression (tab delimited values):
Rows TRANS_1 TRANS_2 ... TRANS_N
ENHANCER_1 EXPR EXPR EXPR
ENHANCER_2 EXPR EXPR EXPR
- Transcription Factor Levels (tab delimited values):
Rows TRANS_1 TRANS_2 ... TRANS_N
FACTOR_1 LEVEL LEVEL ... LEVEL
FACTOR_2 LEVEL LEVEL ... LEVEL
FACTOR_3 LEVEL LEVEL ... LEVEL
- TF-TF Interactions (tab delimited values):
This file specifies any prior knowledge of TF-TF interactions forcing the model to capture them:
FACTOR_1 FACTOR_1
FACTOR_1 FACTOR_2
FACTOR_2 FACTOR_3
- Position Count/Weight Matrices (tab delimited values):
Count matrices are recommended. All motifs are required to have the same length. Shorter motifs should be padded with equal count/weight over four nucleotides.
>FACTOR_1 LENGTH
A_COUNT C_COUNT G_COUNT T_COUNT
A_COUNT C_COUNT G_COUNT T_COUNT
...
A_COUNT C_COUNT G_COUNT T_COUNT
<
>FACTOR_2 LENGTH
A_COUNT C_COUNT G_COUNT T_COUNT
A_COUNT C_COUNT G_COUNT T_COUNT
...
A_COUNT C_COUNT G_COUNT T_COUNT
<
>FACTOR_3 LENGTH
A_COUNT C_COUNT G_COUNT T_COUNT
A_COUNT C_COUNT G_COUNT T_COUNT
...
A_COUNT C_COUNT G_COUNT T_COUNT
<
NOTE: Enhancer sequence and gene expression files should have the same ordering of enhancers.
Usage
Get a quick help on the required flags for running CoNSEPT:
consept -h
Here is the list of arguments for running in training/prediction modes:
- --sf: path to sequence file containing all train, validation, and test sequences
- --ef: path to gene expression file
- --tf: path to TFs' level file
- --cf: path to TF-TF interactions file
- --pwm: path to PWM/PCM file
- --pcount: pseudo count for modifying PWMs | Default: 0.0
- --nb: number of trans conditions present in gene (and TF) expression files
- --nTrain: the first nTrain sequences (in all trans conditions) are used for training
- --nValid: the next nValid sequences (in all trans conditions) are used for validation
- --nTest: the last nTest sequences (in all trans conditions) are used for testing
- --psb: pool size for extract binding sites, specify as x,2 (e.g. 5,2 or 10,2)
- --csc: size of kernels capturing prior TF interaction, specify as x,2 (e.g. 5,2 or 10,2)
- --sc: stride of kernels capturing prior TF interaction, specify as x,2 (e.g. 5,2 or 10,2)
- --nChan_noPrior: number of channels in the convolutional layer capturing general TF interactions
- --nChans: number of channels in the additional layers capturing longer-range interactions, specify one number per layer as x,y,z (e.g. 64,64,16,4 corresponds to four additional convolutional layers)
- --cAct: network activation function, is used for all layers before the output (e.g. relu)
- --oAct: output layer activation function | Default: No activation
- --dr: network dropout rate | Default: 0
- --cl2: weight of L2 regularization for short-range kernels | Default: 0.01
- --fcl2: weight of L2 regularization for long-range kernels | Default: 0.01
- --bs: batch size (Note: a training data of n1 enhancers and n2 trans conditions has a total size of n1*n2)| Default: 32
- --nEpoch: number of training epochs | Default: 100
- --lr: initial learning rate of ADAM optimizer | Default: 0.00001
- --step_LR: a threshold for train and validation error to control their fluctuations by tuning learning rate. A larger threshhold results in more fluctuations | Default: no threshold
- --save_freq: saving frequency (in epochs) | Default: 4
- --o: path to the output directory
- --restore: if specified, training is resumed from the last saved epoch
- --predict: if specified, the model only makes predictions with no training
- --ckpt: [if --predict is specified] the saved checkpoint to be used for prediction
- --pred_dir: [if --predict is specified] folder name inside the output directory to write predictions
Example
We provided an example dataset in example
directory (data obtained from here). The data consists of expressions driven by 52 enhancers in 17 trans conditions regulated by three TFs. Here is an example command-line for training a CoNSEPT model on this data (38 enhancers are used for training, 3 for validation and the last 11 enhancers for testing):
consept --sf {PATH-TO}/example/seq.fa --ef {PATH-TO}/example/expr.tab --tf {PATH-TO}/example/factor_expr.tab --cf {PATH-TO}/example/coop.tab --pwm {PATH-TO}/example/PWMs.txt --nb 17 --nTrain 38 --nValid 3 --nTest 11 --psb 4,2 --csc 4,2 --sc 4,2 --nChan_noPrior 0 --nChans 36,6 --cAct relu --oAct sigmoid --dr 0.5 --o {PATH-TO}/outputs
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file CoNSEPT-0.5.0.tar.gz
.
File metadata
- Download URL: CoNSEPT-0.5.0.tar.gz
- Upload date:
- Size: 13.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.6.0.post20201009 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa8f1c59eef890af2e917f992a2ad8a9c844eeba2709f5f21114622215ad8cf6 |
|
MD5 | 49a782d7735037d308dcf0c9fa33697e |
|
BLAKE2b-256 | ff68f5788d4186e56edaa3207c681a1fb6cdc8c95e43fb4d7afd1e1d219eb63c |
File details
Details for the file CoNSEPT-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: CoNSEPT-0.5.0-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.6.0.post20201009 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28bf72e19d6de0c7761e1fc5ecebd3e044d93631d0e8c40007c81ce1602912c1 |
|
MD5 | 4cb52de733668fcf4bdc9b723caca721 |
|
BLAKE2b-256 | 432af3f232d0d58856aa45cb3ca3705b5101a27e3de463123962ec058514664d |