**sumo** is a command-line tool to identify molecular subtypes in multi-omics datasets. It implements a novel nonnegative matrix factorization (NMF) algorithm to identify groups of samples that share molecular signatures, and provides tools to evaluate such assignments.
Project description
sumo is a command-line tool to identify molecular subtypes in multi-omics datasets. It implements a novel nonnegative matrix factorization (NMF) algorithm to identify groups of samples that share molecular signatures, and provides tools to evaluate such assignments.
Installation
You can install sumo from PyPI, by executing command below. Please note that we require python 3.6+.
pip install python-sumo
Documentation
The official documentation is available at https://python-sumo.readthedocs.io
License
Usage
Typical workflow includes running prepare mode for preparation of similarity matrices from feature matrices, followed by factorization of produced multiplex network (mode run). Third mode evaluate can be used for comparison of created cluster labels against biologically significant labels.
prepare
Generates similarity matrices for samples based on biological data and saves them into multiplex network files.
usage: sumo prepare [-h] [-method METHOD] [-k K] [-alpha ALPHA]
[-missing MISSING] [-atol ATOL] [-sn SN] [-fn FN] [-df DF]
[-ds DS] [-logfile LOGFILE] [-log {DEBUG,INFO,WARNING}]
[-plot PLOT]
infile1,infile2,... outfile.npz
positional arguments:
infile1,infile2,... comma-delimited list of paths to input files,
containing standardized feature matrices, with samples
in columns and features in rows (supported types of
files: ['.txt', '.txt.gz', '.txt.bz2', '.tsv',
'.tsv.gz', '.tsv.bz2'])
outfile.npz path to output .npz file
optional arguments:
-h, --help show this help message and exit
-method METHOD either one method of sample-sample similarity
calculation, or comma-separated list of methods for
every layer (available methods: ['euclidean',
'cosine', 'pearson', 'spearman'], default of
euclidean)
-k K fraction of nearest neighbours to use for sample
similarity calculation using Euclidean distance
similarity (default of 0.1)
-alpha ALPHA hypherparameter of RBF similarity kernel, for
Euclidean distance similarity (default of 0.5)
-missing MISSING acceptable fraction of available values for assessment
of distance/similarity between pairs of samples -
either one value or comma-delimited list for every
layer (default of [0.1])
-atol ATOL if input files have continuous values, sumo checks if
data is standardized feature-wise, meaning all
features should have mean close to zero, with standard
deviation around one; use this parameter to set
tolerance of standardization checks (default of 0.01)
-sn SN index of row with sample names for input files
(default of 0)
-fn FN index of column with feature names for input files
(default of 0)
-df DF if percentage of missing values for feature exceeds
this value, remove feature (default of 0.1)
-ds DS if percentage of missing values for sample (that
remains after feature dropping) exceeds this value,
remove sample (default of 0.1)
-logfile LOGFILE path to save log file, by default stdout is used
-log {DEBUG,INFO,WARNING}
sets the logging level (default of INFO)
-plot PLOT path to save adjacency matrix heatmap(s), by default
plots are displayed on screen
Example
sumo prepare -plot plot.png methylation.txt,expression.txt prepared.data.npz
run
Cluster multiplex network using non-negative matrix tri-factorization to identify molecular subtypes.
usage: sumo run [-h] [-sparsity SPARSITY] [-n N]
[-method {max_value,spectral}] [-max_iter MAX_ITER] [-tol TOL]
[-calc_cost CALC_COST] [-logfile LOGFILE]
[-log {DEBUG,INFO,WARNING}] [-h_init H_INIT] [-t T]
infile.npz k outdir
positional arguments:
infile.npz input .npz file containing adjacency matrices for
every network layer and sample names (file created by
running program with mode "run") - consecutive
adjacency arrays in file are indexed in following way:
"0", "1" ... and index of sample name vector is
"samples"
k either one value describing number of clusters or
coma-delimited range of values to check (sumo will
suggest cluster structure based on cophenetic
correlation coefficient)
outdir path to save output files
optional arguments:
-h, --help show this help message and exit
-sparsity SPARSITY either one value or coma-delimited list of sparsity
penalty values for H matrix (sumo will try different
values and select the best results; default of [0.1])
-n N number of repetitions (default of 50)
-method {max_value,spectral}
method of cluster extraction (default of "max_value")
-max_iter MAX_ITER maximum number of iterations for factorization
(default of 500)
-tol TOL if objective cost function value fluctuation (|Δℒ|) is
smaller than this value, stop iterations before
reaching max_iter (default of 1e-05)
-calc_cost CALC_COST number of steps between every calculation of objective
cost function (default of 20)
-logfile LOGFILE path to save log file (by default printed to stdout)
-log {DEBUG,INFO,WARNING}
set the logging level (default of INFO)
-h_init H_INIT index of adjacency matrix to use for H matrix
initialization (by default using average adjacency)
-t T number of threads (default of 1)
Example
sumo run -t 10 prepared.data.npz 2,5 results_dir
evaluate
Evaluate clustering results, given set of labels.
usage: sumo evaluate [-h] [-metric {NMI,purity,ARI}] [-logfile LOGFILE]
infile.tsv labels
positional arguments:
infile.tsv input .tsv file containing sample names in 'sample'
and clustering labels in 'label' column (clusters.tsv
file created by running sumo with mode 'run')
labels .tsv of the same structure as input file
optional arguments:
-h, --help show this help message and exit
-metric {NMI,purity,ARI}
metric for accuracy evaluation (by default all metrics
are calculated)
-logfile LOGFILE path to save log file (by default printed to stdout)
-log {DEBUG,INFO,WARNING}
sets the logging level (default of INFO)
Example
sumo evaluate results_dir/k3/clusters.tsv labels.tsv
interpret
Find features that support clusters separation.
usage: sumo interpret [-h] [-logfile LOGFILE] [-log {DEBUG,INFO,WARNING}]
[-hits HITS] [-max_iter MAX_ITER] [-n_folds N_FOLDS]
[-t T] [-seed SEED] [-sn SN] [-fn FN] [-df DF] [-ds DS]
sumo_results.npz infile1,infile2,... output_prefix
positional arguments:
sumo_results.npz path to sumo_results.npz (created by running program
with mode "run")
infile1,infile2,... comma-delimited list of paths to input files,
containing standardized feature matrices, with samples
in columns and features in rows(supported types of
files: ['.txt', '.txt.gz', '.txt.bz2', '.tsv',
'.tsv.gz', '.tsv.bz2'])
output_prefix prefix of output files - sumo will create two output
files (1) .tsv file containing matrix (features x
clusters), where the value in each cell is the
importance of the feature in that cluster; (2)
.hits.tsv file containing features of most importance
optional arguments:
-h, --help show this help message and exit
-logfile LOGFILE path to save log file (by default printed to stdout)
-log {DEBUG,INFO,WARNING}
sets the logging level (default of INFO)
-hits HITS sets number of most important features for every
cluster, that are logged in .hits.tsv file
-max_iter MAX_ITER maximum number of iterations, while searching through
hyperparameter space
-n_folds N_FOLDS number of folds for model cross validation (default of
5)
-t T number of threads (default of 1)
-seed SEED random state (default of 1)
-sn SN index of row with sample names for input files
(default of 0)
-fn FN index of column with feature names for input files
(default of 0)
-df DF if percentage of missing values for feature exceeds
this value, remove feature (default of 0.1)
-ds DS if percentage of missing values for sample (that
remains after feature dropping) exceeds this value,
remove sample (default of 0.1)
Example
sumo interpret results_dir/k3/sumo_results.npz methylation.txt,expression.txt interpret_results
Please refer to documentation for example usage cases and suggestions for data preprocessing.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file python-sumo-0.2.5.tar.gz.
File metadata
- Download URL: python-sumo-0.2.5.tar.gz
- Upload date:
- Size: 49.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05e1a18320db0d4c42a84c5b53a71014dafb9c1eaccd9b42cd3d986b7aab3533
|
|
| MD5 |
137ce6f1c49b76cd4f2d255cf4639ffd
|
|
| BLAKE2b-256 |
1fbf53714cd77723309f105b300fb739dc6c598538aa33956b5d953b0940c80e
|
File details
Details for the file python_sumo-0.2.5-py3-none-any.whl.
File metadata
- Download URL: python_sumo-0.2.5-py3-none-any.whl
- Upload date:
- Size: 38.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74561f6c20e43c505cab13b2565cdb1f34249b6c0eced82942379b5b04606222
|
|
| MD5 |
e26050590137b6e71c5025686194edef
|
|
| BLAKE2b-256 |
6f45a725c606cf8b4af9745327ca1e1d6e011183628841ece0aa3ae0545f7c00
|