A method to improve TCGA pancancer classifiers performance
Project description
PanClassif: A machine learning classifier pipeline for TCGA pancancer classification
This is a complete machine learning pipeline package to work with TCGA cancer RNA-seq gene count data.
Data prerequisition
- TCGA cancer & normal samples downloaded using TCGA2STAT
- smoothed version of the above collected data using knn-smoothing (Wagner et al., 2017)
- Dataset Mendeley Click Here
Functions
featSelect(homepath, cancerpath, normalpath, k)
Params
- homepath : (str) Path where you want to save all the generated files and folders.
- cancerpath : (str)
Path where all the cancer's cancer gene expression matrix are located. - normalpath : (str)
Path where all the cancer's normal gene expression matrix are located. - k : (int) The number of top genes you want to choose per cancer. (default: k=5) you can not put k less than 5
dataProcess(homepath,names,cancerpath,smoothed_cancer,smoothed_normal,scale_mode)
Params
- homepath : (str) Path where you want to save all the generated files and folders.
- cancerpath : (str) Path where all the cancer's cancer gene expression matrix are located.
- names : (list)
List of the cancer names found from
featSelect
function. - smoothed_cancer : (str) Path where all the cancer's smoothed cancer gene expression matrix are located.
- smoothed_normal : (str) Path where all the cancer's smoothed normal gene expression matrix are located.
- scale_mode (int): Here (0 is for Standardization and 1 for normalization) for data scalling
upsampled(names, homepath)
binary_merge(names, homepath)
multi_merge(names, homepath)
Params
- names : (list)
List of the cancer names found from
featSelect
function. - homepath : (str)
Path where you want to save all the generated files and folders.
classification(homepath, classifier, mode, save_model)
Params
- homepath : (str) Path where you want to save all the generated files and folders
- classifer : (sklearn's classification model) Provide the classification model's instance you want to use. For example: RandomForestClassifier(n_estimators=100).
- Or, classifer : (str) If you want to use "Neural Network" then just type "NN". For example: classifier = "NN"
- mode : (str) There is two mode 1) binary 2) multi. Use "binary" for binary classification & "multi" for multiclass classification. (default: mode = "binary")
- save_model : (str) Optional parameter. Use it only if you want to save the model. For example: save_model = "your_model_name"
gsea(homepath)
- homepath : (str) Path where you want to save all the generated files and folders
Example
homepath = '/home'
cancerpath = '/home/cancer/'
normalpath = '/home/normal/'
smoothed_cancer = '/home/smoothed_cancer'
smoothed_normal = '/home/smoothed_normal'
Data Load and Process Phase
import panclassif as pc
#You have to follow below order to work the code properly
names = pc.featSelect(homepath,cancerpath,normalpath, k=5)
pc.dataProcess(homepath,names,cancerpath,smoothed_cancer,smoothed_normal)
pc.upsampled(names, homepath)
pc.binary_merge(names, homepath)
pc.multi_merge(names, homepath)
Classification Phase
from sklearn.ensemble import RandomForestClassifier
pc.classification(homepath, RandomForestClassifier(n_estimators=100), mode="multi", save_model="RF")
Gene enrichment check
pc.gsea(homepath)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
File details
Details for the file panclassif-2.1.3-py3-none-any.whl
.
File metadata
- Download URL: panclassif-2.1.3-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.26.0 setuptools/57.4.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9c417e851476895cfbb261644f24a09ad396bf914d28f9383721d653b55829c |
|
MD5 | 85f1859ee942a464028d46d902658724 |
|
BLAKE2b-256 | c07fa271948b52b7093b82ad99cf5770eed545e6fa4cbe5dbc574313ce610772 |