Graph-based clustering for high-dimensional single-cell data
Project description
PhenoGraph for Python3
PhenoGraph is a clustering method designed for high-dimensional single-cell data. It works by creating a graph ("network") representing phenotypic similarities between cells and then identifying communities in this graph.
This implementation is written in Python3 and depends only on scikit-learn (>= 0.17)
and its dependencies.
This software package includes compiled binaries that run community detection based on C++ code written by E. Lefebvre and J.-L. Guillaume in 2008 ("Louvain method"). The code has been altered to interface more efficiently with the Python code here. It should work on reasonably current Linux, Mac and Windows machines.
To install PhenoGraph, simply run the setup script:
pip install PhenoGraph
Expected use is within a script or interactive kernel running Python 3.x
. Data are expected to be passed as a numpy.ndarray
.
When applicable, the code uses CPU multicore parallelism via multiprocessing
.
To run basic clustering:
import phenograph
communities, graph, Q = phenograph.cluster(data)
For a dataset of N rows, communities
will be a length N vector of integers specifying a community assignment for each row
in the data. Any rows assigned -1
were identified as outliers and should not be considered as a member of any community.
graph
is a N x N scipy.sparse
matrix representing the weighted graph used for community detection.
Q
is the modularity score for communities
as applied to graph
.
If you use PhenoGraph in work you publish, please cite our publication:
@article{Levine_PhenoGraph_2015,
doi = {10.1016/j.cell.2015.05.047},
url = {http://dx.doi.org/10.1016/j.cell.2015.05.047},
year = {2015},
month = {jul},
publisher = {Elsevier {BV}},
volume = {162},
number = {1},
pages = {184--197},
author = {Jacob H. Levine and Erin F. Simonds and Sean C. Bendall and Kara L. Davis and El-ad D. Amir and Michelle D. Tadmor and Oren Litvin and Harris G. Fienberg and Astraea Jager and Eli R. Zunder and Rachel Finck and Amanda L. Gedman and Ina Radtke and James R. Downing and Dana Pe'er and Garry P. Nolan},
title = {Data-Driven Phenotypic Dissection of {AML} Reveals Progenitor-like Cells that Correlate with Prognosis},
journal = {Cell}
}
Release Notes
Version 1.5.4
- Faster and more efficient sorting by size of clusters, for large nearest neighbours graph, implementing multiprocessing and faster methods for sorting.
Version 1.5.3
- Phenograph supports now Leiden algorithm for community detection.
The new feature can be called from
phenograph.cluster
, by choosingleiden
as the clustering algorithm.
Version 1.5.2
- Include simple parallel implementation of brute force nearest neighbors search using scipy's
cdist
andmultiprocessing
. This may be more efficient thankdtree
on very large high-dimensional data sets and avoids memory issues that arise insklearn
's implementation. - Refactor
parallel_jaccard_kernel
to remove unnecessary use ofctypes
andmultiprocessing.Array
.
Version 1.5.1
- Make
louvain_time_limit
a parameter tophenograph.cluster
.
Version 1.5
phenograph.cluster
can now take as input a square sparse matrix, which will be interpreted as a k-nearest neighbor graph. Note that this graph must have uniform degree (i.e. the same value of k at every point).- The default
time_limit
for Louvain iterations has been increased to a more generous 2000 seconds (~half hour).
Version 1.4.1
- After observing inconsistent behavior of sklearn.NearestNeighbors with respect to inclusion of self-neighbors, the code now checks that self-neighbors have been included before deleting those entries.
Version 1.4
- The dependence on IPython and/or ipyparallel has been removed. Instead the native
multiprocessing
package is used. - Multiple CPUs are used by default for computation of nearest neighbors and Jaccard graph.
Version 1.3
- Proper support for Linux.
Troubleshooting
Notebook freezes after several attempts of running PhenoGraph using Jypyter Notebook
-
Running
PhenoGraph
from a Jupyter Notebook repeatedly on macOS Catalina, but not Mojave, using Python 3.7.6, causes a hang and the notebook becomes unresponsive, even for a basic matrix of nearest neighbors. However, this issue was not reproducible in command line usingPython
interpreter in both Catalina and Mojave platforms, without using Jupyter Notebook.It was found that attempting to plot principal components using
:func:`~matplotlib.pyplot.scatter`
in Jupyter Notebook causes a freeze, and
PhenoGraph
becomes unresponsive unless the kernel is restarted. When removing this line of code, everything goes back to normal and the Jupyter Notebook stopes crashing with repeated runs ofPhenoGraph
.
Architecture related error
-
When attempting to process very large nearest neighbours graph, e.g. a 2000000
x
2000000 kNN graph matrix with 300 nearest neighbours, astruct.error()
is raised:struct.error: 'i' format requires -2147483648 <= number <= 2147483647
This issue was reported on stackoverflow and it's related to the multiprocessing while building the Jaccard object.
The
struct.error()
has been fixed in python >= 3.8.0.
leidenalg
inside conda environment
- When using
PhenoGraph
inside a conda environmentleiden
takes longer to complete for larger samples compared to the system Python.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for PhenoGraph-1.5.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b269ec07c120d8624f1185151022e33c7101467980ec22da1d095fe40e6fe9fd |
|
MD5 | dec79921dfa9d3c06f4bef8ec641ab7f |
|
BLAKE2b-256 | 71617274e235b787f4db22033bf364a584433865bcd51fc23dfdd3b2a901d4f6 |