PROFILE methodology for the binarisation and normalisation of RNA-seq data
Project description
profile_binr
The PROFILE methodology for the binarisation and normalisation of RNA-seq data.
This is a Python interface to a set of normalisation and binarisation functions for RNA-seq data originally written in R.
This software package is based on the methodology developed by Beal, Jonas; Montagud, Arnau; Traynard, Pauline; Barillot, Emmanuel; and Calzone, Laurence at Computational Systems Biology of Cancer team at Institut Curie (contact-sysbio@curie.fr). It generalizes and offers a Python interface of the original implementation in Rmarkdown notebooks available at https://github.com/sysbio-curie/PROFILE.
Installation
Using conda
The tool can be installed using the Conda package profile_binr in the colomoto
channel. Note that some of its dependencies requires the conda-forge
channel.
conda install -c conda-forge colomoto::profile_binr
Using pip
Requirements
- R (≥4.0)
- R packages:
- mclust
- diptest
- moments
- magrittr
- tidyr
- dplyr
- tibble
- bigmemory
- doSNOW
- foreach
- glue
pip install profile_binr
Usage
Once again this is a minimal example :
from profile_binr import ProfileBin
import pandas as pd
# your data is assumed to contain observations as
# rows and genes as columns
data = pd.read_csv("path/to/your/data.csv")
data.head()
Clec1b | Kdm3a | Coro2b | 8430408G22Rik | Clec9a | Phf6 | Usp14 | Tmem167b | |
---|---|---|---|---|---|---|---|---|
cell_id | ||||||||
HSPC_025 | 0.0 | 4.891604 | 1.426148 | 0.0 | 0.0 | 2.599758 | 2.954035 | 6.357369 |
HSPC_031 | 0.0 | 6.877725 | 0.000000 | 0.0 | 0.0 | 2.423483 | 1.804914 | 0.000000 |
HSPC_037 | 0.0 | 0.000000 | 6.913384 | 0.0 | 0.0 | 2.051659 | 8.265465 | 0.000000 |
LT-HSC_001 | 0.0 | 0.000000 | 8.178374 | 0.0 | 0.0 | 6.419817 | 3.453502 | 2.579528 |
HSPC_001 | 0.0 | 0.000000 | 9.475577 | 0.0 | 0.0 | 7.733370 | 1.478900 | 0.000000 |
# create the binarisation instance using the dataframe
# with the index containing the cell identifier
# and the columns being the gene names
probin = ProfileBin(data)
# compute the criteria used to binarise/normalise the data :
# This method uses a parallel implementation, you can specify the
# number of workers with an integer
probin.fit(8) # train using 8 threads
# Look at the computed criteria
probin.criteria.head(8)
Dip | BI | Kurtosis | DropOutRate | MeanNZ | DenPeak | Amplitude | Category | |
---|---|---|---|---|---|---|---|---|
Clec1b | 0.358107 | 1.635698 | 54.017736 | 0.876208 | 1.520978 | -0.007249 | 8.852181 | ZeroInf |
Kdm3a | 0.000000 | 2.407548 | -0.784019 | 0.326087 | 3.847940 | 0.209239 | 10.126676 | Bimodal |
Coro2b | 0.000000 | 2.320060 | 7.061604 | 0.658213 | 2.383819 | 0.004597 | 9.475577 | ZeroInf |
8430408G22Rik | 0.684454 | 3.121069 | 21.729044 | 0.884058 | 2.983472 | 0.005663 | 9.067857 | ZeroInf |
Clec9a | 1.000000 | 2.081717 | 140.089285 | 0.965580 | 2.280293 | -0.009361 | 9.614233 | Discarded |
Phf6 | 0.000000 | 1.988667 | -1.389024 | 0.035628 | 5.025501 | 2.017547 | 10.135226 | Bimodal |
Usp14 | 0.000000 | 2.208080 | -1.224987 | 0.007850 | 6.109964 | 8.245570 | 11.088750 | Bimodal |
Tmem167b | 0.000000 | 2.430813 | 0.093023 | 0.393720 | 3.448331 | 0.072982 | 9.486826 | Bimodal |
# get binarised data (alternatively .binarise()):
my_bin = probin.binarize()
my_bin.head()
Clec1b | Kdm3a | Coro2b | 8430408G22Rik | Clec9a | Phf6 | Usp14 | Tmem167b | |
---|---|---|---|---|---|---|---|---|
HSPC_025 | NaN | 1.0 | NaN | NaN | NaN | 0.0 | 0.0 | 1.0 |
HSPC_031 | NaN | 1.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 |
HSPC_037 | NaN | 0.0 | 1.0 | NaN | NaN | 0.0 | 1.0 | 0.0 |
LT-HSC_001 | NaN | 0.0 | 1.0 | NaN | NaN | 1.0 | 0.0 | 0.0 |
HSPC_001 | NaN | 0.0 | 1.0 | NaN | NaN | 1.0 | 0.0 | 0.0 |
# idem for normalised data :
my_norm = probin.normalize()
my_norm.head()
Clec1b | Kdm3a | Coro2b | 8430408G22Rik | Clec9a | Phf6 | Usp14 | Tmem167b | |
---|---|---|---|---|---|---|---|---|
HSPC_025 | 0.0 | 9.786196e-01 | 0.184102 | 0.0 | NaN | 0.000801 | 8.318176e-05 | 9.999970e-01 |
HSPC_031 | 0.0 | 9.999981e-01 | 0.000000 | 0.0 | NaN | 0.000462 | 8.084114e-07 | 6.874397e-11 |
HSPC_037 | 0.0 | 4.408417e-09 | 0.892449 | 0.0 | NaN | 0.000145 | 9.999940e-01 | 6.874397e-11 |
LT-HSC_001 | 0.0 | 4.408417e-09 | 1.000000 | 0.0 | NaN | 0.991865 | 6.230178e-04 | 1.599753e-04 |
HSPC_001 | 0.0 | 4.408417e-09 | 1.000000 | 0.0 | NaN | 0.999865 | 2.171153e-07 | 6.874397e-11 |
References
- Béal J, Montagud A, Traynard P, Barillot E and Calzone L (2019) Personalization of Logical Models With Multi-Omics Data Allows Clinical Stratification of Patients. Front. Physiol. 9:1965. doi:10.3389/fphys.2018.01965
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for profile_binr-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4db566efbfeae710a77ba5daa8d7dfe81af173c66f64e17fee24deaf3cc38953 |
|
MD5 | a92a54567f81d648bb7a58169d20d901 |
|
BLAKE2b-256 | a06c2ae65d88bbeead5995f56c9582d2351e40782a66e59642d01feb365e7a18 |