No project description provided

Project description

PARC

PARC, “phenotyping by accelerated refined community-partitioning” - is a fast, automated, combinatorial graph-based clustering approach that integrates hierarchical graph construction (HNSW) and data-driven graph-pruning with the new Leiden community-detection algorithm. PARC:ultrafast and accurate clustering of phenotypic data of millions of single cells.

Check out Readthedocs for An installation guide, examples on different data and tutorials.

:eight_spoked_asterisk: PARC forms the clustering basis for our new Trajectory Inference (TI) method VIA available on Github or readthedocs. VIA is a single-cell TI method that offers topology construction and visualization, pseudotimes, automated prediction of terminal cell fates and temporal gene dynamics along detected lineages. VIA can also be used to topologically visualize the graph-based connectivity of clusters found by PARC in a non-TI context.

Installation

Using pip

conda create --name ParcEnv pip
pip install parc // tested on linux

Cloning repository and running setup.py (ensure dependencies are installed)

git clone https://github.com/ShobiStassen/PARC.git
python3 setup.py install // cd into the directory of the cloned PARC folder containing setup.py and issue this command

install dependencies separately if needed (linux) If the pip install doesn't work, it usually suffices to first install all the requirements (using pip) and subsequently install parc (also using pip)

pip install igraph, leidenalg, hnswlib, umap-learn
pip install parc

Windows installation Once you have Visual Studio installed it should be smooth sailing (sometimes requires a restart after intalling VS). It might be easier to install dependences using either pip or conda -c conda-forge install. If this doesn't work then you might need to consider using binaries to install igraph and leidenalg.

python-igraph: download the python36 Windows Binaries by Gohlke leidenalg: depends on python-igraph. download windows binary available for python3.6 only

conda create --name parcEnv python=3.6 pip
pip install igraph #(or install python_igraph-0.7.1.post6-cp36-cp36m-win_amd64.whl )
pip install leidenalg #(or install leidenalg-0.7.0-cp36-cp36m-win_amd64.whl)
pip install hnswlib
pip install parc

Example Usage on Covid-19 scRNA-seq data

Check out the Jupyter Notebook for how to pre-process and PARC cluster the new Covid-19 BALF dataset by Liao et. al 2020. We also show how to integrate UMAP with HNSW such that the embedding in UMAP is constructed using the HNSW graph built in PARC, enabling a very fast and memory efficient viusalization (particularly noticeable when n_cells > 1 Million)

PARC Cluster-level average gene expression

PARC visualizes cells by integrating UMAP embedding on the HNSW graph

Example Usage 1. (small test sets) - IRIS and Digits dataset from sklearn

import parc
import matplotlib.pyplot as plt
from sklearn import datasets

// load sample IRIS data
//data (n_obs x k_dim, 150x4)
iris = datasets.load_iris()
X = iris.data
y=iris.target

plt.scatter(X[:,0],X[:,1], c = y) // colored by 'ground truth'
plt.show()

Parc1 = parc.PARC(X,true_label=y) // instantiate PARC
//Parc1 = parc.PARC(X) // when no 'true labels' are available
Parc1.run_PARC() // run the clustering
parc_labels = Parc1.labels
# View scatterplot colored by PARC labels
plt.scatter(X[:, 0], X[:, 1], c=parc_labels, cmap='rainbow')
plt.show()

# Run umap on the HNSW knngraph already built in PARC (more time and memory efficient for large datasets)
// Parc1.knn_struct = p1.make_knn_struct() // if you choose to visualize before running PARC clustering. then you need to include this line
graph = Parc1.knngraph_full()
X_umap = Parc1.run_umap_hnsw(X, graph)
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=Parc1.labels)
plt.show()





// load sample digits data
digits = datasets.load_digits()
X = digits.data // (n_obs x k_dim, 1797x64)
y = digits.target
Parc2 = parc.PARC(X,true_label=y, jac_std_global='median') // 'median' is default pruning level
Parc2.run_PARC()
parc_labels = Parc2.labels

Example Usage 2. (mid-scale scRNA-seq): 10X PBMC (Zheng et al., 2017)

pre-processed datafile

annotations

import parc
import csv
import numpy as np
import pandas as pd

## load data (50 PCs of filtered gene matrix pre-processed as per Zheng et al. 2017)

X = csv.reader(open("./pca50_pbmc68k.txt", 'rt'),delimiter = ",")
X = np.array(list(X)) // (n_obs x k_dim, 68579 x 50)
X = X.astype("float")
// OR with pandas as: X = pd.read_csv("'./pca50_pbmc68k.txt", header=None).values.astype("float")

y = [] // annotations
with open('./zheng17_annotations.txt', 'rt') as f:
    for line in f: y.append(line.strip().replace('\"', ''))
// OR with pandas as: y =  list(pd.read_csv('./data/zheng17_annotations.txt', header=None)[0])   

// setting small_pop to 50 cleans up some of the smaller clusters, but can also be left at the default 10
parc1 = parc.PARC(X,true_label=y,jac_std_global=0.15, random_seed =1, small_pop = 50) // instantiate PARC
parc1.run_PARC() // run the clustering
parc_labels = parc1.labels

tsne plot of annotations and PARC clustering

Example Usage 3. 10X PBMC (Zheng et al., 2017) integrating Scanpy pipeline

raw datafile 68K pbmc from github page

10X compressed file "filtered genes"

pip install scanpy

import scanpy.api as sc
import pandas as pd
//load data
path = './data/filtered_matrices_mex/hg19/'
adata = sc.read(path + 'matrix.mtx', cache=True).T  # transpose the data
adata.var_names = pd.read_csv(path + 'genes.tsv', header=None, sep='\t')[1]
adata.obs_names = pd.read_csv(path + 'barcodes.tsv', header=None)[0]

// annotations as per correlation with pure samples
annotations = list(pd.read_csv('./data/zheng17_annotations.txt', header=None)[0])
adata.obs['annotations'] = pd.Categorical(annotations)

//pre-process as per Zheng et al., and take first 50 PCs for analysis
sc.pp.recipe_zheng17(adata)
sc.tl.pca(adata, n_comps=50)
// setting small_pop to 50 cleans up some of the smaller clusters, but can also be left at the default 10
parc1 = parc.PARC(adata.obsm['X_pca'], true_label = annotations, jac_std_global=0.15, random_seed =1, small_pop = 50)  
parc1.run_PARC() // run the clustering
parc_labels = parc1.labels
adata.obs["PARC"] = pd.Categorical(parc_labels)

//visualize
sc.settings.n_jobs=4
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.pl.umap(adata, color='annotations')
sc.pl.umap(adata, color='PARC')

Example Usage 4. Large-scale (70K subset and 1.1M cells) Lung Cancer cells (multi-ATOM imaging cytometry based features)

normalized image-based feature matrix 70K cells

Lung Cancer cells annotation 70K cells

Lung Cancer Digital Spike Test of n=100 H1975 cells on N281604

1.1M cell features and annotations

import parc
import pandas as pd

// load data: digital mix of 7 cell lines from 7 sets of pure samples (1.1M cells)
X = pd.read_csv("'./LungData.txt", header=None).values.astype("float")
y = list(pd.read_csv('./LungData_annotations.txt', header=None)[0]) // list of cell-type annotations

// run PARC on 1.1M cells
// jac_weighted_edges can be set to false which provides an unweighted graph to leiden and offers some speedup
parc1 = parc.PARC(X, true_label=y,jac_weighted_edges = False)
parc1.run_PARC() // run the clustering
parc_labels = parc1.labels

// run PARC on H1975 spiked cells
parc2 = parc.PARC(X, true_label=y, jac_std_global = 0.15, jac_weighted_edges = False) // 0.15 corresponds to pruning ~60% edges and can be effective for rarer populations than the default 'median'
parc2.run_PARC() // run the clustering
parc_labels_rare = parc2.labels

tsne plot of annotations and PARC clustering, heatmap of features

Parameters and Attributes

Parameter tuning

For a more detailed explanation of the impact of tuning key parameters please see the Supplementary Analysis in our paper. PARC Supplementary Analysis

Input Parameter	Description
`data`	(numpy.ndarray) n_samples x n_features
`true_label`	(numpy.ndarray) (optional)
`dist_std_local`	(optional, default = 2) local pruning threshold: the number of standard deviations above the mean minkowski distance between neighbors of a given node. the higher the parameter, the more edges are retained
`jac_std_global`	(optional, default = 'median') global level graph pruning. This threshold can also be set as the number of standard deviations below the network's mean-jaccard-weighted edges. 0.1-1 provide reasonable pruning. higher value means less pruning. e.g. a value of 0.15 means all edges that are above mean(edgeweight)-0.15*std(edge-weights) are retained. We find both 0.15 and 'median' to yield good results resulting in pruning away ~ 50-60% edges
`dist_std_local`	(optional, default = 2) local pruning threshold: the number of standard deviations above the mean minkowski distance between neighbors of a given node. higher value means less pruning
`random_seed`	(optional, default = 42) The random seed to pass to Leiden
`resolution_parameter`	(optional, default = 1) Uses ModuliartyVP and RBConfigurationVertexPartition
`jac_weighted_edges`	(optional, default = True) Uses Jaccard weighted pruned graph as input to community detection. For very large datasets set this to False to observe a speed-up with negligble impact on accuracy

Attributes	Description
`labels`	(list) length n_samples of corresponding cluster labels
`f1_mean`	(list) f1 score (not weighted by population). For details see supplementary section of paper
`stats_df`	(DataFrame) stores parameter values and performance metrics

Developers

Installation

To install a development version:

git clone https://github.com/ShobiStassen/PARC.git
cd PARC
pip install -e ".[extra]"

Testing

To run the test suite for the package:

cd PARC
pytest tests/

References to dependencies

Leiden (pip install leidenalg) (V.A. Traag, 2019 doi.org/10.1038/s41598-019-41695-z)
hsnwlib Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI, preprint: https://arxiv.org/abs/1603.09320
igraph (igraph.org/python/)

Citing

If you find this code useful in your work, please consider citing this paper PARC:ultrafast and accurate clustering of phenotypic data of millions of single cells

Project details

Release history Release notifications | RSS feed

This version

0.40

May 28, 2024

0.33

Feb 8, 2022

0.32

Jan 25, 2022

0.31

Oct 13, 2020

0.30

Jun 27, 2020

0.29

Jun 26, 2020

0.28

Jun 3, 2020

0.27

Jun 2, 2020

0.26

May 21, 2020

0.25

May 21, 2020

0.24

May 21, 2020

0.23

Mar 25, 2020

0.22

Mar 6, 2020

0.20

Mar 6, 2020

0.19

Nov 21, 2019

0.18

Oct 15, 2019

0.17

Sep 25, 2019

0.16

Sep 25, 2019

0.15

Sep 25, 2019

0.14

Sep 25, 2019

0.13

Sep 25, 2019

0.12

Sep 25, 2019

0.11

Sep 25, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parc-0.40.tar.gz (19.9 kB view details)

Uploaded May 28, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parc-0.40-py3-none-any.whl (14.9 kB view details)

Uploaded May 28, 2024 Python 3

File details

Details for the file parc-0.40.tar.gz.

File metadata

Download URL: parc-0.40.tar.gz
Upload date: May 28, 2024
Size: 19.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for parc-0.40.tar.gz
Algorithm	Hash digest
SHA256	`e8eaca198fc172b1d260019f2e0860bc6268879a38c3f6296e562b1e7bddaad0`
MD5	`2c35dba6498bd5e21afcd0c8e510741c`
BLAKE2b-256	`48c44124788a6db233f36e776ecfc116e5dce0bdb747c29fa9f4b72bf1b62ce8`

See more details on using hashes here.

File details

Details for the file parc-0.40-py3-none-any.whl.

File metadata

Download URL: parc-0.40-py3-none-any.whl
Upload date: May 28, 2024
Size: 14.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for parc-0.40-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7889f8d01485fabf37f7425685a7ec5e61d7e5e00eb16c5411b65ccb153d3f8a`
MD5	`b89cb0d1f7e994b59febd4e45134d778`
BLAKE2b-256	`6588cff7ad656bffd110c0547af2ace1019ad05e6031051a64a396ed5055f08e`

See more details on using hashes here.

parc 0.40

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

PARC

Installation

Example Usage on Covid-19 scRNA-seq data

PARC Cluster-level average gene expression

PARC visualizes cells by integrating UMAP embedding on the HNSW graph

Example Usage 1. (small test sets) - IRIS and Digits dataset from sklearn

Example Usage 2. (mid-scale scRNA-seq): 10X PBMC (Zheng et al., 2017)

Example Usage 3. 10X PBMC (Zheng et al., 2017) integrating Scanpy pipeline

Example Usage 4. Large-scale (70K subset and 1.1M cells) Lung Cancer cells (multi-ATOM imaging cytometry based features)

Parameters and Attributes

Parameter tuning

Developers

Installation

Testing

References to dependencies

Citing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes