GMM with chi-square protocol
Project description
GMMchi
GMMchi is the python package for Gaussian Mixture Modeling using the chi-square protocol. GMMchi enables the efficient subcategorization of gene expression data in large datasets. The method is based on identifying mixtures of normal and non-normal distributed tails. Although GMMchi is developed to identify unique patterns in gene expression data, the general use case of pattern identification and subcategorization creates a much wider application for GMMchi in any datasets that exhibit mixtures of normal or non-normal data.
When applying GMMchi, a-priori knowledge of distinct subpopulations due to underlying mechanisms (i.e. mutation, methylation..etc) is ideal for interpreting the result of the analysis. The pre-print use-case was meant for gene expression data analysis where bimodal distributions are related in relation to mutated v.s. normal populations. In other words, the expectation is that the mutated and wildtype subpopulation expresses levels of expression that cluster in distinct distributions.
GMMchi offers a systematic approach for identifying and characterizing different patterns of normal and non-normal mixtures. Moreover, the advantage of transforming continuous data into categorized data enables researchers/users the ability to study and correlate genes to phenotypes and to explore data using a pattern-based (vs the traditional trend-base) analysis. This package assumes users with zero python knowledge thus starts with very simple instructions. We will explore several examples of GMMchi on gene expression analysis.
Getting Started
Download Anaconda at https://www.anaconda.com/distribution/
After downloading, open terminal (Mac) or cmd (Windows). Open Jupyter Notebook either by entering 'jupyter notebook' in the terminal/cmd or using the anaconda application downloaded. This will open an IDE (Integrated development environment) using your default browser. Jupyter notebook is essentially representing your computer files in an IDE and allows you to directly interact with blocks of python code, making it a much more pleasant experience vs using the cli (command line interface).
Create a new python script by creating a new folder by clicking new --> python 3 notebook located on the top right corner. You can also use an existing folder where you keep your python scripts.
Download Package
Download the GMMchi package by:
pip install git+https://github.com/jeffliu6068/GMMchi.git
or
pip install GMMchi
Import
Once installed, import the package by:
import GMMchi
Intuition: How GMMchi Works in Gene Expression Analysis
The goal of GMMchi in gene expression analysis is the categorization of continuous data into 1s, 2s and occasionally 3s. 1s and 2/3s represent low or non-expressing vs high-expressing samples of any given gene, respectively. There are many ways GMMchi-categorized data can be studied in downstream analysis. The steps outlined below is an example of a standard method of analyzing a large dataset:
Postprocessing
- Determine the background threshold of your input sample
- Filter and remove genes that are not expressed by any of the samples
Analysis
- If doing analysis on a single gene, categorize a gene expression by applying GMMchi on your gene of interest in the postprocessed data
- If doing a full-scale screen or analysis, categorize all gene expresssions by running a for loop to apply GMMchi on each gene to recreate a categorized matrix with each sample cateogrized as 1, 2, or 3
- Run 2x2 table analysis on the categorized matrix
Available Tools in the GMMchisquare Package
Calculating Background Threshold
GMMchi.GMMmodelingt is the function that runs GMMchi on input data. Here, we define calc_back = True to specify the use of calculating background threshold.
means, std, filt = GMMchi.GMM_modelingt('TCGA Colorectal Cancer', input_data_cancer, log2transform=True,
verbosity = True, Single_tail_validation=False, calc_back = True)
Input
'TCGA Colorectal Cancer': When calc_back = True, input string will be automatically used as the title for your output graphs
input_data_cancer: Dataframe with genes (row) x samples (columns)
log2transform: perform log2-transformation on the data
verbosity: print each stage of GMMchi
calc_back: Boolean to indicate whether to calculate background threshold
Single_tail_validation: Boolean to indicate whether to run single tail identification on non-normal tails, usually the dataset is so big, it is much more computationally efficient to set this as False
Output
means: Mean of identified distributions
std: Standard deviation of identified distributions
filt: Cutoff between the distributions, this is the background threshold that seperates the background vs normal distribution
Filter and Remove Non-expressing Genes
GMMchi.probe_filter is used to filter and remove non-expressing probe sets or genes based on a background threshold determined above. The background threshold can be a-priori or determined via a statistical method included in this package (shown above).
input_dataf = GMMchi.probe_filter(input_data_cancer, log2transform=True, filt=-0.829)
Input
filt (float): the background threshold. Note that the threshold needs to match the parameter log2transform such that if log2transform = True, filt needs to be a log2transformed threshold and vice versa
Output
input_dataf: return dataframe with filtered probesets
Categorizing the Distribution a Single Gene
gene = 'TGFB1' #Transforming growth factor beta 1
info, classif, categories, chi, bins, f = GMMchi.GMM_modelingt(gene, input_dataf, log2transform = True,
filt=-0.83, meanf= -3.3, stdf = 1.95)
Input
gene: gene of interest
input_dataf: Dataframe with genes (row) x samples (columns), this is usually the postprocessed data or the output of GMM.probe_filter
meanf: mean of the background distribution (retrieved from the result of Calculating Background Threshold)
stdf: standard deviation of background distribution (retrieved from the result of Calculating Background Threshold)
Output
info: mean(s), covariance(s), and threshold(s) of the identified distribution returned as a list of list
classif: name of the category the distribution is identified as
Classifications:
1) Bimodal
2) Unimodal
3) Categorical unimodal
4) Unimodal with a non-normal tail
5) Bimodal with a non-normal tail
6) Poorly fitted bimodal
categories: the returned categorized data as a list
chi: the chi-square goodness of fit of the fitted model returned as a float
bins: the bins of the histogram returned as a list
f: figure of the plot returned as a matplotlib fig object
Large-scale Categorization of the Input Data (All genes)
Below is an example of how we can use this algorithm on a large scale analysis on all genes or probe sets:
genes = input_dataf.index #the index of the dataframe or a list of all genes
categorize = [] #append as list of list of categorized data
for gene in tqdm(genes):
info, classif, categories, chi, bins, f = GMMchi.GMM_modelingt(gene, input_dataf, log2transform=True,
filt=6.5924, meanf= 5.14, stdf = 1.01)
categorize.append(categories)
del classif, categories, chi #free up memory
categorized_df = pd.DataFrame(categorize, index = input_dataf.index, columns = input_dataf.columns)
Run a 2x2 Table Analysis
GMMchi.find_hits is used to perform a 2x2 contingency table analysis with the categorized data returned from GMM.GMMmodelingt on the gene of interest
hits, significant_hits, table_sig_hits = GMMchi.find_hits(categorized_df, primary='TGFB1')
twobytwo_table = pd.DataFrame(table_sig_hits, columns = ['+/+', '+/-', '-/+', '-/-', 'p-value', 'R value', 'Inclusion Criterion'], index = significant_hits.T.columns[1:])
#save your 2x2 table for further analysis
twobytwo_table.sort_values('R value', ascending=False).to_csv(r'2by2table_tgfb1.csv')
Input
categorized_df: Dataframe with categorized data that is composed of 1 or 2s (1 = low; 2 = high)
primary: Gene of interest that will be used as the primary gene compared to all other genes (index) to find correlation
Output
Hits: 2x2 contingency table p value
significant_hits: 2x2 contingency table with p value filtered for <= 0.05
table_sig_hits: returned as a list of ['+/+','+/-','-/+','-/-','p-value','R value', 'Inclusion Criterion']
Working Example
Please find a working example in the example folder
Authors
- Ta-Chun (Jeff) Liu - jeffliu6068
- Peter Kalugin - Initial work
- Sir Walter Fred Bodmer FRS FRSE - Supervision
License
This project is licensed under the MIT License - see the LICENSE.md file for details
Acknowledgments
- Hat tip to anyone whose code was used
- Inspiration: Thank you for all that has contributed ideas and expertise to make this possible. Let's advance science together.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file GMMchi-0.1.tar.gz
.
File metadata
- Download URL: GMMchi-0.1.tar.gz
- Upload date:
- Size: 20.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.5.0.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b250acf8c423927fa4b374082cb812fc524e010922ee59f5ec5ce1ea5171f12a |
|
MD5 | 2efd9525742b73f6a625e0f884c1b193 |
|
BLAKE2b-256 | 14a17797e248c91eee459fdb838b4b1964e8a7a63618b128b6e2f671cc5e88eb |
File details
Details for the file GMMchi-0.1-py3-none-any.whl
.
File metadata
- Download URL: GMMchi-0.1-py3-none-any.whl
- Upload date:
- Size: 22.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.5.0.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 00af4220461359598764c4b328c12efbe2dff8831df7e8d2d8f52bb47e7b2a20 |
|
MD5 | 64dab27f69eb122bc930cbf74abb0ff4 |
|
BLAKE2b-256 | d120133cf405370ec8e9a9206b49fd7a73b13f29343a2d6ecebe6bc6117b77e5 |