An RNA-Seq based tool for Supervised Cancer Origin Prediction using Expression
Project description
Cancerscope for SCOPE
SCOPE, Supervised Cancer Origin Prediction using Expression, is a method for predicting the tumor type (or matching normal) of an RNA-Seq sample.
SCOPE's python package, cancerscope, allows users to pass the RPKM values with matching Gene IDs and receive a set of probabilities across 66 different categories (40 tumor types and 26 healthy tissues), that sum to 1. Users can optionally evaluate the impact of various pathways on classification outcome using the 'PIE' pathway impact evaluation extension.
Since SCOPE is an ensemble-based approach, it is possible to train additional models and include them in the ensemble that SCOPE uses (Instructions forthcoming).
Installation
Using theano and lasagne backend
All releases pre-Version 1.00 are theano and lasagne compatible (py2.7-py3.7 supported)
Before installing cancerscope, you will need to install the correct version of the packages lasagne and theano.
pip install --upgrade https://github.com/Theano/Theano/archive/master.zip
pip install --upgrade https://github.com/Lasagne/Lasagne/archive/master.zip
You may also need the following:
pip install mkl-service
Automated Install
Make sure you have all other required libraries installed (only needed if using Theano/lasagne backend).
You can set up cancerscope using the command pip install cancerscope
.
At initial install, cancerscope will attempt to download the models needed for prediction. This may take a while depending on your internet connection (3-10 minutes). Please ensure you have a reliable internet connection and atleast 5 GB of space before proceeding with install.
Setup and Usage
To get started with SCOPE, launch a python instance and run:
>>> import cancerscope
Incase the download was unsuccessful at the time of package install, the first time you import cancerscope, the package will attempt to set up a local download of the models needed for prediction. Please be patient as this will take a while (3-10 minutes).
Prediction - Example
Prediction can be performed from a pre-formatted input file, or by passing in the data matrix. Please refer to the tutorial and detailed documentation for more information.
The commands are as simple as follows:
>>> import cancerscope as cs
>>> scope_obj = cs.scope()
This will set up the references to the requires SCOPE models.
Next, you can process the predictions straight from the input file:
>>> predictions_from_file = scope_obj.get_predictions_from_file(filename)
Here, the input file should be prepared as follows. Columns should be tab-separated, with unique sample IDs. The first column is always the Gene identifier (Official HUGO ID, Ensemble Gene ID, or Gencode). Each cell is the RPKM value of the corresponding gene, sample pair. An example is shown with the first 2 rows of input.
ENSEMBL | Sample 1 | Sample 2 | ... |
---|---|---|---|
ENSG000XXXXX | 0.2341 | 9451.2 | .... |
...or you can pass in the data matrix, list of sample names, list of feature names, the type of gene names (ENSG, HUGO etc), and optionally, the list of sample names.
>>> predictions = scope_obj.predict(
X = numpy_array_X,
x_features = list_of_features,
x_features_genecode = string_genecode,
x_sample_names = list_of_sample_names)
The output will look like this:
'ix' | sample_ix |
label |
pred |
freq |
models |
rank_pred |
sample_name |
---|---|---|---|---|---|---|---|
0 | 0 | BLCA_TS | 0.268193 | 2 | v1_none17kdropout,v1_none17k | 1 | test1 |
1 | 0 | LUSC_TS | 0.573807 | 1 | v1_smotenone17k | 2 | test1 |
2 | 0 | PAAD_TS | 0.203504 | 1 | v1_rm500 | 3 | test1 |
3 | 0 | TFRI_GBM_NCL_TS | 0.552021 | 1 | v1_rm500dropout | 4 | test1 |
4 | 1 | ESCA_EAC_TS | 0.562124 | 2 | v1_smotenone17k,v1_none17k | 1 | test2 |
5 | 1 | HSNC_TS | 0.223115 | 1 | v1_rm500 | 2 | test2 |
6 | 1 | MB-Adult_TS | 0.743373 | 1 | v1_none17kdropout | 3 | test2 |
7 | 1 | TFRI_GBM_NCL_TS | 0.777685 | 1 | v1_rm500dropout | 4 | test2 |
Here, 2 samples, called test1 and test2, were processed. The top prediction from each model in the ensemble was taken, and aggregated.
- For instance, 2 models predicted that 'BLCA_TS' was the most likely class for test1. The column freq gives you the count of contributing models for a prediction, and the column models lists these models. The other 3 models had a prediction of 'LUSC_TS', 'PAAD_TS', and 'TFRI_GBM_NCL_TS' respectively.
- You can use the rank of the predictions, shown in the column rank_pred, to filter out the prediction you want to use for interpretation.
- When SCOPE is highly confident in the prediction, you will see freq = 5, indicating all models have top-voted for the same class.
Visualizing or exporting results - Example
cancerscope can also automatically generate plots for each sample, and save the prediction dataframe to file. This is done by passing the output directory to the prediction functions:
>>> predictions_from_file = scope_obj.get_predictions_from_file(filename, outdir = output_folder)
>>> predictions = scope_obj.predict(X = numpy_array_X, x_features = list_of_features, x_features_genecode = string_genecode, x_sample_names = list_of_sample_names, **outdir = output_folder**)
This will automatically save the dataframe returned from the prediction functions as output_folder + /SCOPE_topPredictions.txt
, and the predictions from all models across all classes as output_folder + /SCOPE_allPredictions.txt
.
Sample specific plots could also generated automatically in the same directory, and labelled SCOPE_sample-SAMPLENAME_predictions.svg
. As of version 0.30 onwards, this option has been deprecated, but plots can still be generated from the dataframes provided (SCOPE_allPredictions.txt).
Citing cancerscope
If you have used this package for any academic research, it would be great if you could cite the associated paper.
Full citation:
Grewal JK, Tessier-Cloutier B, Jones M, et al. Application of a Neural Network Whole Transcriptome–Based Pan-Cancer Method for Diagnosis of Primary and Metastatic Cancers. JAMA Netw Open. 2019;2(4):e192597. doi:10.1001/jamanetworkopen.2019.2597
A bibtex citation is provided for your ease of use:
@article{jgscope2019,
author = {Grewal, Jasleen K. and Tessier-Cloutier, Basile and Jones, Martin and Gakkhar, Sitanshu and Ma, Yussanne and Moore, Richard and Mungall, Andrew J. and Zhao, Yongjun and Taylor, Michael D. and Gelmon, Karen and Lim, Howard and Renouf, Daniel and Laskin, Janessa and Marra, Marco and Yip, Stephen and Jones, Steven J. M.},
title = "{Application of a Neural Network Whole Transcriptome–Based Pan-Cancer Method for Diagnosis of Primary and Metastatic CancersAssessment of a Machine Learning–Based Method for Diagnosing Primary and Metastatic CancersAssessment of a Machine Learning–Based Method for Diagnosing Primary and Metastatic Cancers}",
journal = {JAMA Network Open},
volume = {2},
number = {4},
pages = {e192597-e192597},
year = {2019},
month = {04},
issn = {2574-3805},
doi = {10.1001/jamanetworkopen.2019.2597},
url = {https://doi.org/10.1001/jamanetworkopen.2019.2597},
eprint = {https://jamanetwork.com/journals/jamanetworkopen/articlepdf/2731678/grewal\_2019\_oi\_190114.pdf},
}
License
cancerscope is distributed under the terms of the MIT license.
Feature requests
If you wished outputs were slightly (or significantly) easier to use, or want to see additional options for customizing the output, please open up a GitHub issue here.
Issues
If you encounter any problems, please contact the developer and provide detailed error logs and description here.
Common Errors
Theano is a bit finicky when working with the cudnn backend, and may sometimes throw errors at you due to version conflicts. Here's a common one if you are setting up cancerscope in GPU-friendly environment.
RuntimeError: Mixed dnn version. The header is version 5110 while the library is version 7401.
- Please ensure that only 1 cudnn version exists on your system.
- Cancerscope has been developed and tested with cudnn-7.0 (v3.0)
pkg_resources.VersionConflict: (pandas xxxx (/path/to/sitepckgs/), Requirement.parse('pandas>=0.23.4'))
- This error may arise because you have an older version of pandas installed, which conflicts with the plotting library we use (plotnine, this package needs pandas >=0.23.4)
- You can either manually install plotnine ('pip install plotnine') or update your pandas library ('pip update pandas')
The following required packages cannot be built: freetype, png
- You need to install these dependencies for matplotlib. If using conda, run the following:
conda install freetype; conda install libpng; conda install matplotlib
. Otherwise, runningpip install matplootlib
should resolve the issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file cancerscope-1.0.tar.gz
.
File metadata
- Download URL: cancerscope-1.0.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f11d10a2076d19031626dff779651ec72da3cd57e9821309bc0d45c06e688208 |
|
MD5 | b2e6c2ef48ce76822d4fed67092b8adb |
|
BLAKE2b-256 | 0c91b5e7003b2ff4b94c4af5253f07c3cdce25727f607698fcd177a3bcdd9aca |