Skip to main content

Evaluattion of Predictive CapabilitY for ranking biomarker candidates.

Project description

https://img.shields.io/badge/python-3.6-blue.svg https://travis-ci.org/iric-soft/epcy.svg?branch=master https://codecov.io/gh/iric-soft/epcy/branch/master/graph/badge.svg

Citing:

Introduction:

This tool was developed to Evaluate Predictive CapabilitY of each feature to become a biomarker candidates.

Requirements:

  • python3

  • (Optional) virtualenv

Install:

python3 -m venv $HOME/.virtualenvs/epcy
source $HOME/.virtualenvs/epcy/bin/activate
cd [your_epcy_folder]
CFLAGS=-std=c99 pip3 install numpy==1.17.0
python3 setup.py install
epcy -h

Usage:

General:

From source:

cd [your_epcy_folder]
python3 -m epcy -h

After setup install:

epcy -h

Generic case:

  • EPCY is design to work on any quantitative data, provided that values of each feature are comparable between each samples (normalized).

  • To run a comparative analysis, epcy pred need two tabulated files:

    • A matrix of quantitative normalized data for each samples (column) with an “ID” column to identify each feature.

    • A design table which describe the comparison.

# Run epcy on any normalized quantification data
epcy pred -d ./data/small_for_test/design.tsv -m ./data/small_for_test/exp_matrix.tsv -o ./data/small_for_test/default_subgroup
# If your data require a log2 transforamtion, add --log
epcy pred --log -d ./data/small_for_test/design.tsv -m ./data/small_for_test/exp_matrix.tsv -o ./data/small_for_test/default_subgroup
  • Result will be saved in prediction_capability.xls file, which is detail below.

  • You can personalize the design file using –subgroup –query

epcy pred_rna -d ./data/small_for_test/design.tsv -m ./data/small_for_test/exp_matrix.tsv -o ./data/small_for_test/subgroup2 --subgroup subgroup2 --query A

Working on RNA sequencing readcounts:

  • To run EPCY on readcounts not mormalized use pred_rna tool as follow:

# To run on read count not normalized, add --cpm --log
epcy pred_rna --cpm --log -d ./data/small_for_test/design.tsv -m ./data/small_for_test/exp_matrix.tsv -o ./data/small_for_test/default_subgroup

Working on kallisto quantification:

  • EPCY allow to work directly on kallisto quantificaion using h5 files, to have access to bootstrapped samples. To do so, a kallisto column need to be add to the design file (to specify the directory path where to find abundant.h5 file for each sample) and epcy pred_rna need to run as follow:

# To run on kallisto quantification, add --kall (+ --cpm --log)
epcy pred_rna --kal --cpm --log -d ./data/small_leucegene/5_inv16_vs_5/design.tsv -o ./data/small_leucegene/5_inv16_vs_5/
# !!! Take care kallisto quantification is on transcript not on gene
  • To run on gene level, a gff3 file of the genome annotation is needed, to have the correspondence between transcript and gene. This file can be download on ensembl

# To run on kallisto quantification and gene level, add --gene --anno [file.gff] (+ --kall --cpm --log)
epcy pred_rna --kal --cpm --log --gene --anno ./data/small_genome/Homo_sapiens.GRCh38.84.reduce.gff3 -d ./data/small_leucegene/5_inv16_vs_5/design.tsv -o ./data/small_leucegene/5_inv16_vs_5/
  • kallisto quantification allow to work on TPM:

# work on TPM, replace --cpm by --tpm
epcy pred_rna --kal --tpm --log --gene --anno ./data/small_genome/Homo_sapiens.GRCh38.84.reduce.gff3 -d ./data/small_leucegene/5_inv16_vs_5/design.tsv -o ./data/small_leucegene/5_inv16_vs_5/

Output:

predictive_capability.xls

This file is the main output which contain the evaluation of each features (genes, proteins, …). It’s a tabulated files 9 columns:

  • Default columns:

    • id: the id of each feature.

    • l2fc: log2 Fold change.

    • kernel_mcc: Matthews Correlation Coefficient (MCC) compute by a predictor using KDE.

    • kernel_mcc_low, kernel_mcc_high: boundaries of confidence interval (90%).

    • mean_query: mean(values) of samples specify as Query in design.tsv

    • mean_ref: mean(values) of samples specify as Ref in design.ts

    • bw_query: Estimate bandwidth used by KDE, to calculate the density of query samples

    • bw_ref: Estimate bandwidth used by KDE, to calculate the density of ref samples

  • Using –normal:

    • normal_mcc: MCC compute a predictor using normal distributions.

  • Using –auc –utest:

    • auc: Area Under the Curve

    • u_pv: pvalue compute by a MannWhitney rank test

  • Using –ttest:

subgroup_predicted.xls

Using –full a secondary output file (subgroup_predicted.xls) specify for each features if the sample as been correctly predicted. Build an heatmap with this output could help you to explore your data. More details coming soon.

Bagging:

To improve the stability and accuracy of MCC computed, you can add n bagging (using -b n)

#Take care, it's take n time more longer!!!, use multiprocess (-t) seems a good idea :).
epcy pred_rna -b 4 -t 4 --cpm --log -d ./data/small_for_test/design.tsv -m ./data/small_for_test/exp_matrix.tsv -o ./data/small_for_test/default_subgroup

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epcy-0.0.1.tar.gz (26.3 kB view hashes)

Uploaded Source

Built Distributions

epcy-0.0.1-py3.7.egg (75.1 kB view hashes)

Uploaded Source

epcy-0.0.1-py3-none-any.whl (33.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page