Pure-Python re-implementation of CRAN mclust — Gaussian mixture model clustering with all 14 covariance parameterizations, AnnData-native.
Project description
pymclustR
A pure-Python re-implementation of CRAN mclust (Scrucca, Fop, Murphy, Raftery 2016) — Gaussian mixture model clustering with all 14 covariance parameterizations and BIC-driven model selection.
The PyPI distribution is
pymclustR; the Python import name ismclust_py(sofrom mclust_py import Mclust).
- AnnData-native — drop-in for the scanpy ecosystem
- No
rpy2, no R install, no Fortran toolchain - All 14 Banfield-Raftery / Celeux-Govaert parameterizations (EII, VII, EEI, VEI, EVI, VVI, EEE, VEE, EVE, VVE, EEV, VEV, EVV, VVV) plus the 1-D specials E / V
- Same Mclust() driver: BIC scan over (G × model), pick the maximum, return classification + soft responsibilities
Same upstream-mirror pattern as
monocle2-pyandmilor-py: the canonical implementation lives inomicverse; this repo is the standalone slice for users who want mclust without the full omicverse stack.
Install
pip install pymclustR
Quick-start
import numpy as np
from sklearn import datasets
from mclust_py import Mclust, predict_mclust
X = datasets.load_iris().data # 150 × 4
fit = Mclust(X, G=range(1, 6)) # tries G = 1..5 across all 14 models
print(fit) # Mclust(VEV, G=2) loglik=-215.83 BIC=-561.7
print(fit.BIC.iloc[:, :4].head()) # full BIC table
# Predict on new cells
z_new, cls_new = predict_mclust(fit, X_new)
Per-cell results are also written to AnnData via the convenience adapter:
from mclust_py import mclust_anndata
fit = mclust_anndata(adata, use_rep="X_pca", n_components=10, G=range(2, 12))
# adata.obs['mclust'] — 1-based hard label
# adata.obs['mclust_uncertainty']
# adata.obsm['mclust_z'] — soft responsibilities
# adata.uns['mclust'] — BIC table + parameters
Tutorial — end-to-end on Iris
from sklearn import datasets
from mclust_py import Mclust, predict_mclust, hc, hclass, partition_to_z, me
X = datasets.load_iris().data
y = datasets.load_iris().target
# 1) Default: BIC scan over G = 1..9 × all 14 covariance models
fit = Mclust(X, G=range(1, 10))
print("best:", fit.model_name, "G =", fit.G, "BIC =", fit.bic)
print("classification table:", dict(zip(*np.unique(fit.classification, return_counts=True))))
# 2) Pick a specific covariance family
fit_vvv = Mclust(X, G=[3], model_names=["VVV"]) # G=3, full per-cluster Σ
fit_eee = Mclust(X, G=[3], model_names=["EEE"]) # G=3, common full Σ
# 3) Use a custom initial partition (e.g. from your favourite labels)
z_init = partition_to_z(y, G=3) # one-hot from ground truth
result = me(X, "VVV", z_init) # single me() call
print("loglik:", result.loglik, "iters:", result.iterations)
# 4) Hierarchical-clustering init (mclust default — Ward on SVD-whitened data)
tree = hc(X, model_name="VVV", use="SVD")
labels = hclass(tree, G=4) # cut the dendrogram into 4 clusters
fit4 = Mclust(X, G=[4], model_names=["VVV"],
z_init=partition_to_z(labels, G=4))
# 5) Soft-classify new cells
z_new, cls_new = predict_mclust(fit, X[:5])
print("posterior responsibilities for first 5 cells:\n", z_new)
# 6) AnnData (optional — needs `pip install pymclustR[anndata]`)
import anndata as ad
adata = ad.AnnData(X)
from mclust_py import mclust_anndata
mclust_anndata(adata, use_rep="X", G=range(2, 7))
# adata.obs['mclust'] — 1-based hard label
# adata.obs['mclust_uncertainty']
# adata.obsm['mclust_z'] — soft responsibilities
# adata.uns['mclust'] — BIC table + parameters
See examples/comparison_R_vs_python.ipynb for a full visual walk-through with plots, BIC curves, and R-parity tables.
Module map
| Module | What it covers |
|---|---|
mclust_py.mclust |
top-level Mclust() + MclustResult (BIC scan, model selection) |
mclust_py.em |
shared E-step + EM driver me() (Aitken stop, mirrors R's meXXX) |
mclust_py.models |
M-step for all 14 parameterizations (closed-form + Stiefel iterative) |
mclust_py.bic |
parameter counts and BIC formula |
mclust_py.hc |
Ward-on-SVD initial partition (hc() / hclass() / partition_to_z()) |
mclust_py.density |
per-component multivariate-normal log-density |
mclust_py.anndata_adapter |
mclust_anndata() for scanpy users |
Notebooks
| Notebook | What it shows |
|---|---|
examples/comparison_R_vs_python.ipynb |
Side-by-side comparison of CRAN mclust 6.1.1 and mclust-py on 4 datasets × 14 covariance models × G ∈ {2,3,4,5}. Plots cluster scatters, BIC curves, classification confusion matrices, per-record loglik scatter, and per-model parity bars. Outputs are baked in so you can browse on GitHub without running anything. |
The notebook is built and executed by examples/_build_notebooks.py:
python examples/_build_notebooks.py
R parity
tests/r_parity_dump.R runs CRAN mclust::me() on four canonical datasets (blobs2, blobs5, iris, faithful) for every (modelName, G) in {14 models} × {2,3,4,5} — 224 records total. tests/r_parity_compare.py then runs mclust_py.me() from the same z_init and reports correlations:
| Model | loglik rel-err | z corr | mean corr | sigma corr | classification match |
|---|---|---|---|---|---|
| EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV, EVV, VVV | ~1e-15 (machine epsilon) | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| VEE | 6e-5 | 1.0000 | 1.0000 | 1.0000 | 0.999 |
| EVE | 5e-3 | 0.9751 | 0.9998 | 0.9843 | 0.975 |
| VVE | 3e-2 | 0.9351 | 0.9958 | 0.9198 | 0.939 |
| overall (224 records) | 3e-3 | 0.9935 | 0.9997 | 0.9931 | 0.9937 |
12 of 14 models hit bitwise parity with R. The two outliers (EVE / VVE) share a single orientation matrix D across components — the underlying optimisation problem (Browne–McNicholas 2013) has multiple local maxima, so different solvers can land on different stationary points. We use a Stiefel-manifold gradient descent (provably monotone) where R uses Browne–McNicholas MM; both solve the same problem and agree on simple data, but on hard data (e.g. iris with G ≥ 4) they may pick different local maxima.
For exact agreement on the EM step:
- Use the same
z_init. mclust'shc()is a Fortran routine that optimises the classification log-likelihood — Python's Ward-on-SVD approximates it well but is not bitwise identical. To force exact agreement, run R'shc(data, modelName="VVV")once and pass the resulting partition asz_inittoMclust(). - Use the same control parameters. Defaults match
emControl()(tol=1e-5,eps=.Machine$double.eps). - Use the same model_names. Default in both is the full 14-model list for
d>1,c("E","V")ford=1.
To regenerate the parity dump:
# in CMAP env (R + r-mclust)
Rscript tests/r_parity_dump.R
# then in omicdev env
python tests/r_parity_compare.py
pytest tests/test_r_parity.py -q
Relationship to omicverse
This package is developed upstream in omicverse. If you already use omicverse, install it instead — ov.utils.mclust_R(adata, ...) and the upcoming ov.utils.mclust_py(adata, ...) cover the same surface plus the omicverse registry glue.
Citation
If you use this package, please cite the original mclust paper:
Scrucca L., Fop M., Murphy T.B., Raftery A.E. (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289–317.
and acknowledge omicverse / this repo for the Python port.
License
GNU GPLv3 — matches both upstream omicverse and CRAN mclust.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymclustr-0.1.0.tar.gz.
File metadata
- Download URL: pymclustr-0.1.0.tar.gz
- Upload date:
- Size: 61.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e58f1949ef5e52c359e79231aab95e368982a990eef9d4bcefa4ed065926692d
|
|
| MD5 |
8460b5ce8199df7925bd4188ef2fb7a2
|
|
| BLAKE2b-256 |
03f3db2e7a2c4ed5d6e3f8f00a0eaf576d5ff208f144efbab2a27857920093ce
|
File details
Details for the file pymclustr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pymclustr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 46.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
306b3784f3cba6feda89b382c918483b5b2c69ad8381cabfa15d7b2957292148
|
|
| MD5 |
6fad56f28c4d8b2b2bdc9358facbb166
|
|
| BLAKE2b-256 |
0e040ac4a89f62af908e1ca669d6519f69cbd16783d70de64dfb9b4330bad815
|