A tool for Prediction of high-risk cancer patients using mutation profiles

Project description

Prediction of high-risk cancer patients using mutation profiles

Benchmarking of mutation calling techniques by developing classification and regresion prediction models to predict the high-risk cancer patients.

Introduction

In this method, a user can predict the high-risk cancer patients using their mutation profiles in the form of Variant Calling Format (VCF) and Mutation Annotation Format (MAF) derived using four widely used mutation calling techniques, such as, MuTect2, MuSE, VarScan2, and SomaticSniper. A comparison can be made between the formats or between the techniques for same formats. In case of classification, the whole dataset is divided into 80:20 ratio, where 80% is used for training purpose, on which five fold cross-validation is appiled while training the model, whereas 20% dataset is used as testing dataset, which is kept to test the trained model. On the other hand, in case of regression, five-fold cross validation is applied on whole dataset. This method provides the following seven files as output:

Classification results: This file contains the performance measures for the seven different classifiers such as Decision tree (DT), Support Vector Classifier (SVC), Random Forest (RF), XGBoost (XGB), Gaussian Naive Bayes (GNB), Logistic Regression (LR), and k-nearest neighbors (KN). The perfomance of each classifiers is measured in terms of sensitivity (Sens), specificity (Spec), accuracy (Acc), Area Under the Receiver Operating Characteristic (AUC) F1-score (F1), Kappa, and Matthews Correlation Coefficient (MCC), for training (tr) and testing (te) dataset.
Regression results: This file contains the performance measures for the seven different regressors such as Random Forest (RFR), Ridge (RID), Lasso (LAS), Decision Tree (DTR), Elastic Net (ENT), Linear Regression (LR), and Support Vector Regression (SVR). The performance for each regressor is calculated in terms of mean absolute error (MAE), root mean-square error (RMSE), R2, Hazard Ratio (HR), and p-value.
Top10 Correlation results: This file contains the correlation results between number of mutations/gene/sample and overall survival time. Gene name, Correlation coefficents and p-value is reported for top10 genes based on their correlation coefficients. Classification and regression models are developed using mutations/gene/sample values from these top-10 genes.
Correlation results: This file contains the correlation results between number of mutations/gene/sample and overall survival time for all the genes sorted in order of coefficents.
Mutations per sample per gene file: This file reports the number of mutations/gene/sample along with overall survival time (OS.time) and overall status (OS).
Best Classification Model: This is model file which user can use to make the survival group prediction such as High-/low-risk group for unknown samples based on top-10 genes. Model with the highest AUROC will be saved.
Best Regression Model: This is model file which user can use to make the survival time prediction for unknown samples based on top-10 genes. Model with the highest HR value will be saved.

Standalone

The Standalone version of this method is written in python3 and following libraries are necessary for the successful run:

scikit-learn
Pandas
Numpy
rpy2
tqdm

Important Note

In order to run the provided example, please download the following files:

Database file required by annovar to map the coordinates with gene names. Click Me
Example VCF files. Click Me
Download these files and unzip them.

Minimum USAGE

To know about the available option for the stanadlone, type the following command:

mutation_bench -h

To run the example, type the following command:

mutation_bench -i test/ -t MUTECT2 -f VCF -s gdc_sample_sheet.tsv -c clinical_data.tsv

This will provide the five output files as afore-mentioned. It will use other parameters by default. It will save the output in .csv format with files string "mutation_based_results.csv" as the suffix.

Full Usage

usage: muthrp.py [-h]
		-i INPUT
		-t {VARSCAN2,MUTECT2,MUSE,SOMATICSNIPER}
		-f {VCF,MAF}
		-s SAMPLE 
		[-o OUTPUT]
		[-d DATABASE]
		[-c CLINICAL]

Please provide following arguments for the sucessful run

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, -I INPUT, --input INPUT
                        Input Directory: Please provide the path of directory containing either VCF or MAF files.
  -t {VARSCAN2,MUTECT2,MUSE,SOMATICSNIPER}, -T {VARSCAN2,MUTECT2,MUSE,SOMATICSNIPER}, --tech {VARSCAN2,MUTECT2,MUSE,SOMATICSNIPER}
                        Techniques: Please provide the techniques to be consider from the available options. By default, its MUTECT2
  -f {VCF,MAF}, -F {VCF,MAF}, --format {VCF,MAF}
                        File Type Formats: VCF: Variat Calling Format; MAF: Mutation Annotation Format.
  -s SAMPLE, -S SAMPLE, --sample SAMPLE
                        Sample Data File: Please provide the sample file containing the file IDs of patients to map with the TCGA IDs.
  -o OUTPUT, -O OUTPUT, --output OUTPUT
                        Output: This would be the prefix that will be added in the output filename.
  -d DATABASE, -D DATABASE, --database DATABASE
                        Database: Please provide the path the database required by annovar to map the coordinates with gene names.
  -c CLINICAL, -C CLINICAL, --clinical CLINICAL
                        Clinical Data File: Please provide the file containing the clinical information of patients with OS and OS.time to calculate HR.

Input File: This argument takes the path of directory containing VCF or MAF files.

Output File: This is the string which would be incorporated in the output filenames stored in the .csv format.

Technique: User is allowed to choose one of four techniques provided in the help.

Sample file: This file comprises the information of the sample and their case IDs, which is used to trace the clinical data.

Clinical Data: This file comprises of clinical information of patients.

Database: This folder contains the files to get the gene names by mapping VCF file coordinates on it.

Format: User is allowed to choose between two formats of mutations, such as VCF or MAF.

Package Files

It contain following files, brief descript of these files given below

README.md : This file provide information about this package

muthrp.py : Main python program

test : This folder contains the VCF files for test run.

humandb : This folder contain the files for gene mapping required by annovar script.

clinical_data.tsv : This file contains the clinical data i.e. OS and OS.time for each patient.

gdc_sample_sheet.tsv : This file contains the information of samples.

convert2annovar.pl : perl script to convert"genotype calling" format into ANNOVAR format.

annotate_variation.pl : perl script for annotate variations.

Mutations_gene_sample_MUTECT2_VCF_test_run.csv : This is the example output which reports mutations/gene/sample.

Correlation_MUTECT2_VCF_test_run.csv : This is the example output reporting correlation between mutations/gene and OS_time.

Top10_Correlated_genes_MUTECT2_VCF_test_run.csv : This is the example output exhibits top-10 genes used to train classification and regression models.

Classification_MUTECT2_VCF_test_run.csv : This is example output for classification results.

Regression_MUTECT2_VCF_test_run.csv : This is example output for performanc of regression models.

LR_Mutect2_VCF_Classification.pkl : This is the pickle file for best classification model trained on top10 genes.

ENT_Mutect2_VCF_Regression.pkl : This is the pickle file for best regression model trained on top10 genes.

Reference

Patiyal S, Dhall A, Raghava GPS. Prediction of risk-associated genes and high-risk liver cancer patients from their mutation profile: benchmarking of mutation calling techniques. Biol Methods Protoc. 2022 May 27;7(1):bpac012. doi: 10.1093/biomethods/bpac012

Project details

Release history Release notifications | RSS feed

5.8

Aug 19, 2023

5.7

Aug 19, 2023

5.6

Aug 19, 2023

5.5

Aug 19, 2023

5.4

Aug 18, 2023

5.3

Aug 18, 2023

5.2

Aug 18, 2023

5.1

Aug 18, 2023

5.0

Aug 18, 2023

4.9

Aug 17, 2023

4.8

Aug 17, 2023

4.7

Aug 17, 2023

4.6

Aug 17, 2023

4.5

Aug 17, 2023

4.4

Aug 17, 2023

4.3

Aug 17, 2023

4.2

Aug 17, 2023

4.1

Aug 17, 2023

This version

4.0

Aug 17, 2023

3.9

Aug 17, 2023

3.8

Aug 17, 2023

3.7

Aug 17, 2023

3.6

Aug 16, 2023

3.5

Aug 16, 2023

3.4

Aug 15, 2023

3.3

Aug 14, 2023

3.2

Aug 14, 2023

3.0

Aug 12, 2023

2.8

Aug 12, 2023

2.7

Aug 12, 2023

2.6

Aug 12, 2023

2.5

Aug 12, 2023

2.4

Aug 12, 2023

2.3

Aug 11, 2023

2.2

Aug 11, 2023

2.1

Aug 11, 2023

2.0

Aug 11, 2023

1.9

Aug 11, 2023

1.8

Aug 11, 2023

1.7

Aug 11, 2023

1.6

Aug 11, 2023

1.5

Aug 11, 2023

1.4

Aug 10, 2023

1.3

Aug 10, 2023

1.2

Aug 10, 2023

1.1

Aug 10, 2023

1.0

Aug 10, 2023

0.2

Aug 19, 2023

0.1

Aug 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mutation_bench-4.0.tar.gz (106.2 kB view details)

Uploaded Aug 17, 2023 Source

Built Distribution

mutation_bench-4.0-py3-none-any.whl (104.6 kB view details)

Uploaded Aug 17, 2023 Python 3

File details

Details for the file mutation_bench-4.0.tar.gz.

File metadata

Download URL: mutation_bench-4.0.tar.gz
Upload date: Aug 17, 2023
Size: 106.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for mutation_bench-4.0.tar.gz
Algorithm	Hash digest
SHA256	`1ac41fa4d7bfaa9d5d27a553fc824af70b01db8f867e37bb943b100a3be68c8b`
MD5	`842eaa3ddee790c59907f0b5fcc27302`
BLAKE2b-256	`25dfee0aafea8ad09b18a34aa6aa27398349ccefcfdfb19f14276bdcc4ee6d66`

See more details on using hashes here.

File details

Details for the file mutation_bench-4.0-py3-none-any.whl.

File metadata

Download URL: mutation_bench-4.0-py3-none-any.whl
Upload date: Aug 17, 2023
Size: 104.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for mutation_bench-4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e8f5eac21fb562b97812d95ddb1dd4a948fe61103ac0e11a8d14f933e8c30bc7`
MD5	`f903f770b7668a9d60b9f1818b1444ff`
BLAKE2b-256	`b4f72634595c507d33dba1dec9fa8c30d2cf4f0f7f091b77d00b9a1f9b033517`