Evaluattion of Predictive CapabilitY for ranking biomarker candidates.
Project description
Citing:
EPCY: Evaluation of Predictive CapabilitY for ranking biomarker gene candidates. Poster at ISMB ECCB 2019: https://f1000research.com/posters/8-1349
Introduction:
This tool was developed to Evaluate Predictive CapabilitY of each gene (feature) to become a predictive (bio)marker candidates. Documentation is available via Read the Docs.
Requirements:
python3
(Optional) virtualenv
Install:
Using pypi:
pip install epcy
From source:
python3 -m venv $HOME/.virtualenvs/epcy
source $HOME/.virtualenvs/epcy/bin/activate
pip install pip setuptools --upgrade
pip install wheel
cd [your_epcy_folder]
# If need it
# CFLAGS=-std=c99 pip3 install numpy==1.17.0
python3 setup.py install
epcy -h
Usage:
General:
After install:
epcy -h
From source:
cd [your_epcy_folder]
python3 -m epcy -h
Generic case:
EPCY is design to work on any quantitative data, provided that values of each feature are comparable between each samples (normalized).
To run a comparative analysis, epcy pred need two tabulated files:
# Run epcy on any normalized quantification data
epcy pred -d ./data/small_for_test/design.tsv -m ./data/small_for_test/exp_matrix.tsv -o ./data/small_for_test/default_condition
# If your data require a log2 transforamtion, add --log
epcy pred --log -d ./data/small_for_test/design.tsv -m ./data/small_for_test/exp_matrix.tsv -o ./data/small_for_test/default_condition
Result will be saved in prediction_capability.xls file, which is detail below.
You can personalize the design file using –condition –query
epcy pred_rna -d ./data/small_for_test/design.tsv -m ./data/small_for_test/exp_matrix.tsv -o ./data/small_for_test/condition2 --condition condition2 --query A
Working on RNA sequencing readcounts:
To run EPCY on readcounts not normalized use pred_rna tool as follow:
# To run on read count not normalized, add --cpm --log
epcy pred_rna --cpm --log -d ./data/small_for_test/design.tsv -m ./data/small_for_test/exp_matrix.tsv -o ./data/small_for_test/default_condition
Working on kallisto quantification:
EPCY allow to work directly on kallisto quantificaion using h5 files and have access to bootstrapped samples. To do so, a kallisto column need to be add to the design file (to specify the directory path where to find abundant.h5 file for each sample) and epcy pred_rna need to run as follow:
# To run on kallisto quantification, add --kall (+ --cpm --log)
epcy pred_rna --kal --cpm --log -d ./data/small_leucegene/5_inv16_vs_5/design.tsv -o ./data/small_leucegene/5_inv16_vs_5/
# !!! Take care kallisto quantification is on transcript not on gene
To run on gene level, a gff3 file of the genome annotation is needed, to have the correspondence between transcript and gene. This file can be download on ensembl
# To run on kallisto quantification and gene level, add --gene --anno [file.gff] (+ --kall --cpm --log)
epcy pred_rna --kal --cpm --log --gene --anno ./data/small_genome/Homo_sapiens.GRCh38.84.reduce.gff3 -d ./data/small_leucegene/5_inv16_vs_5/design.tsv -o ./data/small_leucegene/5_inv16_vs_5/
kallisto quantification allow to work on TPM:
# work on TPM, replace --cpm by --tpm
epcy pred_rna --kal --tpm --log --gene --anno ./data/small_genome/Homo_sapiens.GRCh38.84.reduce.gff3 -d ./data/small_leucegene/5_inv16_vs_5/design.tsv -o ./data/small_leucegene/5_inv16_vs_5/
Output:
predictive_capability.xls
This file is the main output which contain the evaluation of each features (genes, proteins, …). It’s a tabulated files 9 columns:
Default columns:
id: the id of each feature.
l2fc: log2 Fold change.
kernel_mcc: Matthews Correlation Coefficient (MCC) compute by a predictor using KDE.
kernel_mcc_low, kernel_mcc_high: boundaries of confidence interval (90%).
mean_query: mean(values) of samples specify as Query in design.tsv
mean_ref: mean(values) of samples specify as Ref in design.ts
bw_query: Estimate bandwidth used by KDE, to calculate the density of query samples
bw_ref: Estimate bandwidth used by KDE, to calculate the density of ref samples
Using –normal:
Using –auc –utest:
auc: Area Under the Curve
u_pv: pvalue compute by a MannWhitney rank test
Using –ttest:
t_pv: pvalue compute by ttest_ind
condition_predicted.xls
Using –full a secondary output file (condition_predicted.xls) specify for each features if the sample as been correctly predicted. Build an heatmap with this output could help you to explore your data. More details coming soon.
Bagging:
To improve the stability and accuracy of MCC computed, you can add n bagging (using -b n)
#Take care, it's take n time more longer!!!, use multiprocess (-t) seems a good idea :).
epcy pred_rna -b 4 -t 4 --cpm --log -d ./data/small_for_test/design.tsv -m ./data/small_for_test/exp_matrix.tsv -o ./data/small_for_test/default_condition
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.