A novel method for unsupervised patient stratification.
Project description
UnPaSt
UnPaSt is a novel method for identification of differentially expressed biclusters.
Cite
UnPaSt preprint: https://arxiv.org/abs/2408.00200.
Code: https://github.com/ozolotareva/unpast_paper/
Quick Start
Using UnPaSt online
Local installation
UnPaSt is available on PyPI and can be installed using pip
pip install unpast
wget https://github.com/ozolotareva/unpast/raw/refs/heads/main/unpast/tests/test_input/synthetic_clear_biclusters.tsv
unpast --exprs synthetic_clear_biclusters.tsv
To use --clustering WGCNA method instead of default one, you would also need to install the necessary R packages (see Requirements below).
Running in Docker
UnPaSt is also available as a Docker image, preinstalled R packages included. To pull the Docker image:
# load image and example data
docker pull freddsle/unpast
wget https://github.com/ozolotareva/unpast/raw/refs/heads/main/unpast/tests/test_input/synthetic_clear_biclusters.tsv
# run UnPaSt in a Docker environment with current directory and user
docker run --rm -it -u $(id -u):$(id -g) -v "$(pwd)":/data \
freddsle/unpast \
--exprs /data/synthetic_clear_biclusters.tsv \
--out_dir /data/results/synthetic_clear_biclusters
To use some previous docker version, replace freddsle/unpast with freddsle/unpast:<version> with a specific version tag, see available tags here.
Development setup
Developer mode allows you to run modified UnPaSt code. This is useful for local updates or contributing to the project.
Docker development environment
To run UnPaSt in a Docker container with the latest code from the repository, you can use the following command:
# Clone the repository to get code
git clone https://github.com/ozolotareva/unpast.git
cd unpast
# Define the command to run UnPaSt
# using unpast.run_unpast to surpass pre-insalled version from the Docker image
command="python -m unpast.run_unpast --exprs unpast/tests/scenario_B500.exprs.tsv.gz --basename results/scenario_B500 --verbose"
# Run UnPaSt using Docker
docker run --rm -it -u $(id -u):$(id -g) -v "$(pwd)":/data --entrypoint bash freddsle/unpast -c "cd /data && $command"
Requirements
UnPaSt requires Python 3.9-3.11 and certain Python and R packages.
Python and R dependencies
Python Dependencies
The Python dependencies are installed automatically when installing via pip (see pyproject.toml).
They include (with recommended versions):
fisher = ">=0.1.9,<=0.1.14"
pandas = "1.3.5"
python-louvain = "0.15"
matplotlib = "3.7.1"
seaborn = "0.11.1"
numba = ">=0.51.2,<=0.55.2"
numpy = "1.22.3"
scikit-learn = "1.2.2"
scikit-network = ">=0.24.0,<0.26.0"
scipy = ">=1.7.1,<=1.7.3"
statsmodels = "0.13.2"
kneed = "0.8.1"
R Dependencies
For the WGCNA clustering method, UnPaSt requires R and specific R packages.
UnPaSt utilizes R packages for certain analyses. Ensure that you have R installed with the following packages:
WGCNA(version 1.70-3 or higher)limma(version 3.42.2 or higher)
Installing R
Ensure that R (version 4.3.1 or higher) is installed on your system. You can download R from CRAN.
It is recommended to use BiocManager for installing R packages:
install.packages("BiocManager")
BiocManager::install("WGCNA")
BiocManager::install("limma")
API Reference
Input
UnPaSt requires a tab-separated file with features (e.g. genes) in rows, and samples in columns.
- Feature and sample names must be unique.
- At least 2 features and 5 samples are required.
- Data must be between-sample normalized.
Recommendations:
- It is recommended that UnPaSt be applied to datasets with 20+ samples.
- If the cohort is not large (<20 samples), reducing the minimal number of samples in a bicluster (
min_n_samples) to 2 is recommended. - If the number of features is small, using the Louvain method for feature clustering instead of WGCNA and/or disabling feature selection by setting the binarization p-value (
p-val) to 1 might be helpful.
Examples
- Simulated data example: Biclustering of a matrix with 10 000 rows (features) and 200 columns (samples) with four implanted biclusters consisting of 500 features and 10-100 samples each. For more details, see Figure 3 and Methods here.
mkdir -p results;
# running UnPaSt with default parameters and example data
unpast --exprs unpast/tests/scenario_B500.exprs.tsv.gz --basename results/scenario_B500
# with different binarization and clustering methods
unpast --exprs unpast/tests/scenario_B500.exprs.tsv.gz --basename results/scenario_B500 --binarization ward --clustering Louvain
# help
unpast -h
- Real data example. Analysis of a subset of 200 samples randomly chosen from TCGA-BRCA dataset, including consensus biclustering and visualization: jupyter-notebook.
Outputs
The program creates a folder runs/run_<timestamp>/ with the results of UnPaSt run, where <timestamp> is the date and time of the run in the format YYYYMMDDTHHMMSS.
The folder contains the files
run_YYYYMMDDTHHMMSS
├── args.tsv
├── biclusters.tsv
└── unpast.log
The file biclusters.tsv contains the identified biclusters, with one bicluster per line. The format of this file is as follows:
-
- the first line starts with
#, storing the parameters of UnPaSt
- the first line starts with
-
- the second line contains the column headers.
-
- each subsequent line represents a bicluster with the following columns:
- SNR: Signal-to-noise ratio of the bicluster, calculated as the average SNR of its features.
- n_genes: Number of genes in the bicluster.
- n_samples: Number of samples in the bicluster.
- genes: Space-separated list of gene names.
- samples: Space-separated list of sample names.
- direction: Indicates whether the bicluster consists of up-regulated ("UP"), down-regulated ("DOWN"), or both types of genes ("BOTH").
- genes_up, genes_down: Space-separated lists of up- and down-resulated genes respectively.
- gene_indexes: 0-based index of the genes in the input matrix.
- sample_indexes: 0-based index of the samples in the input matrix.
The files args.tsv and unpast.log contain the parameters used for the run and the log of the run respectively.
Along with the biclustering result, if save mode is used, UnPaSt saves the intermediate results of feature binarization. The files are stored in the binarization/ subfolder and include:
binarization
├── bin_args.tsv
├── bin_background.tsv
├── bin_res.tsv
└── bin_stats.tsv
with the files:
bin_args.tsvcontains the subset of parameters used for binarization.bin_background.tsvcontains background distributions of SNR values for each evaluated bicluster size.bin_res.tsvcontains binarized input data.bin_stats.tsvprovides binarization statistics for each processed feature.
The binarization files can be used to restart UnPaSt with the same input and seed from the feature clustering step and skip time-consuming feature binarization.
Versions
UnPaSt version used in PathoPlex paper: UnPaSt_PathoPlex.zip
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unpast-0.1.11.tar.gz.
File metadata
- Download URL: unpast-0.1.11.tar.gz
- Upload date:
- Size: 20.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1329e57dc61b6f96a36a70a48e497094915c9e84cf35f3703064f8f4741d927d
|
|
| MD5 |
6d02848549522496813c89da9dbaa7b0
|
|
| BLAKE2b-256 |
6ce5d9f98a23a12b7936ddf9c575bcddcfebb981a3bd678da6606eca1bc06bf5
|
Provenance
The following attestation bundles were made for unpast-0.1.11.tar.gz:
Publisher:
publish.yml on ozolotareva/unpast
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
unpast-0.1.11.tar.gz -
Subject digest:
1329e57dc61b6f96a36a70a48e497094915c9e84cf35f3703064f8f4741d927d - Sigstore transparency entry: 276602684
- Sigstore integration time:
-
Permalink:
ozolotareva/unpast@876252eff52ebe364e78535d3988c157c65be561 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ozolotareva
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@876252eff52ebe364e78535d3988c157c65be561 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file unpast-0.1.11-py3-none-any.whl.
File metadata
- Download URL: unpast-0.1.11-py3-none-any.whl
- Upload date:
- Size: 20.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b881e3589ce7b2ceca865a7868954cc56a00e9a768353b638d11d58089a4b00
|
|
| MD5 |
823f672e2de9a2ca5078e6b65a7b50ea
|
|
| BLAKE2b-256 |
a5774a369e8399fb0730aff5d84d281106b4b07babcaba2ae9064d32a8087d46
|
Provenance
The following attestation bundles were made for unpast-0.1.11-py3-none-any.whl:
Publisher:
publish.yml on ozolotareva/unpast
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
unpast-0.1.11-py3-none-any.whl -
Subject digest:
1b881e3589ce7b2ceca865a7868954cc56a00e9a768353b638d11d58089a4b00 - Sigstore transparency entry: 276602701
- Sigstore integration time:
-
Permalink:
ozolotareva/unpast@876252eff52ebe364e78535d3988c157c65be561 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ozolotareva
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@876252eff52ebe364e78535d3988c157c65be561 -
Trigger Event:
workflow_dispatch
-
Statement type: