Skip to main content

Pure-Python re-implementation of CRAN mclust — Gaussian mixture model clustering with all 14 covariance parameterizations, AnnData-native.

Project description

pymclustR

PyPI Python License: GPL v3

A pure-Python re-implementation of CRAN mclust (Scrucca, Fop, Murphy, Raftery 2016) — Gaussian mixture model clustering with all 14 covariance parameterizations and BIC-driven model selection.

The PyPI distribution is pymclustR; the Python import name is mclust_py (so from mclust_py import Mclust).

  • AnnData-native — drop-in for the scanpy ecosystem
  • No rpy2, no R install, no Fortran toolchain
  • All 14 Banfield-Raftery / Celeux-Govaert parameterizations (EII, VII, EEI, VEI, EVI, VVI, EEE, VEE, EVE, VVE, EEV, VEV, EVV, VVV) plus the 1-D specials E / V
  • Same Mclust() driver: BIC scan over (G × model), pick the maximum, return classification + soft responsibilities

Same upstream-mirror pattern as monocle2-py and milor-py: the canonical implementation lives in omicverse; this repo is the standalone slice for users who want mclust without the full omicverse stack.

Install

pip install pymclustR

Quick-start

import numpy as np
from sklearn import datasets
from mclust_py import Mclust, predict_mclust

X = datasets.load_iris().data        # 150 × 4
fit = Mclust(X, G=range(1, 6))       # tries G = 1..5 across all 14 models
print(fit)                            # Mclust(VEV, G=2)  loglik=-215.83 BIC=-561.7
print(fit.BIC.iloc[:, :4].head())    # full BIC table

# Predict on new cells
z_new, cls_new = predict_mclust(fit, X_new)

Per-cell results are also written to AnnData via the convenience adapter:

from mclust_py import mclust_anndata
fit = mclust_anndata(adata, use_rep="X_pca", n_components=10, G=range(2, 12))
# adata.obs['mclust']           — 1-based hard label
# adata.obs['mclust_uncertainty']
# adata.obsm['mclust_z']        — soft responsibilities
# adata.uns['mclust']           — BIC table + parameters

Tutorial — end-to-end on Iris

from sklearn import datasets
from mclust_py import Mclust, predict_mclust, hc, hclass, partition_to_z, me

X = datasets.load_iris().data
y = datasets.load_iris().target

# 1) Default: BIC scan over G = 1..9 × all 14 covariance models
fit = Mclust(X, G=range(1, 10))
print("best:", fit.model_name, "G =", fit.G, "BIC =", fit.bic)
print("classification table:", dict(zip(*np.unique(fit.classification, return_counts=True))))

# 2) Pick a specific covariance family
fit_vvv = Mclust(X, G=[3], model_names=["VVV"])     # G=3, full per-cluster Σ
fit_eee = Mclust(X, G=[3], model_names=["EEE"])     # G=3, common full Σ

# 3) Use a custom initial partition (e.g. from your favourite labels)
z_init = partition_to_z(y, G=3)                       # one-hot from ground truth
result = me(X, "VVV", z_init)                         # single me() call
print("loglik:", result.loglik, "iters:", result.iterations)

# 4) Hierarchical-clustering init (mclust default — Ward on SVD-whitened data)
tree = hc(X, model_name="VVV", use="SVD")
labels = hclass(tree, G=4)                            # cut the dendrogram into 4 clusters
fit4 = Mclust(X, G=[4], model_names=["VVV"],
              z_init=partition_to_z(labels, G=4))

# 5) Soft-classify new cells
z_new, cls_new = predict_mclust(fit, X[:5])
print("posterior responsibilities for first 5 cells:\n", z_new)

# 6) AnnData (optional — needs `pip install pymclustR[anndata]`)
import anndata as ad
adata = ad.AnnData(X)
from mclust_py import mclust_anndata
mclust_anndata(adata, use_rep="X", G=range(2, 7))
# adata.obs['mclust']           — 1-based hard label
# adata.obs['mclust_uncertainty']
# adata.obsm['mclust_z']        — soft responsibilities
# adata.uns['mclust']           — BIC table + parameters

See examples/comparison_R_vs_python.ipynb for a full visual walk-through with plots, BIC curves, and R-parity tables.

Module map

Module What it covers
mclust_py.mclust top-level Mclust() + MclustResult (BIC scan, model selection)
mclust_py.em shared E-step + EM driver me() (Aitken stop, mirrors R's meXXX)
mclust_py.models M-step for all 14 parameterizations (closed-form + Stiefel iterative)
mclust_py.bic parameter counts and BIC formula
mclust_py.hc Ward-on-SVD initial partition (hc() / hclass() / partition_to_z())
mclust_py.density per-component multivariate-normal log-density
mclust_py.anndata_adapter mclust_anndata() for scanpy users

Notebooks

Notebook What it shows
examples/comparison_R_vs_python.ipynb Side-by-side comparison of CRAN mclust 6.1.1 and mclust-py on 4 datasets × 14 covariance models × G ∈ {2,3,4,5}. Plots cluster scatters, BIC curves, classification confusion matrices, per-record loglik scatter, and per-model parity bars. Outputs are baked in so you can browse on GitHub without running anything.

The notebook is built and executed by examples/_build_notebooks.py:

python examples/_build_notebooks.py

R parity

tests/r_parity_dump.R runs CRAN mclust::me() on four canonical datasets (blobs2, blobs5, iris, faithful) for every (modelName, G) in {14 models} × {2,3,4,5} — 224 records total. tests/r_parity_compare.py then runs mclust_py.me() from the same z_init and reports correlations:

Model loglik rel-err z corr mean corr sigma corr classification match
EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV, EVV, VVV ~1e-15 (machine epsilon) 1.0000 1.0000 1.0000 1.0000
VEE 6e-5 1.0000 1.0000 1.0000 0.999
EVE 5e-3 0.9751 0.9998 0.9843 0.975
VVE 3e-2 0.9351 0.9958 0.9198 0.939
overall (224 records) 3e-3 0.9935 0.9997 0.9931 0.9937

12 of 14 models hit bitwise parity with R. The two outliers (EVE / VVE) share a single orientation matrix D across components — the underlying optimisation problem (Browne–McNicholas 2013) has multiple local maxima, so different solvers can land on different stationary points. We use a Stiefel-manifold gradient descent (provably monotone) where R uses Browne–McNicholas MM; both solve the same problem and agree on simple data, but on hard data (e.g. iris with G ≥ 4) they may pick different local maxima.

For exact agreement on the EM step:

  1. Use the same z_init. mclust's hc() is a Fortran routine that optimises the classification log-likelihood — Python's Ward-on-SVD approximates it well but is not bitwise identical. To force exact agreement, run R's hc(data, modelName="VVV") once and pass the resulting partition as z_init to Mclust().
  2. Use the same control parameters. Defaults match emControl() (tol=1e-5, eps=.Machine$double.eps).
  3. Use the same model_names. Default in both is the full 14-model list for d>1, c("E","V") for d=1.

To regenerate the parity dump:

# in CMAP env (R + r-mclust)
Rscript tests/r_parity_dump.R

# then in omicdev env
python tests/r_parity_compare.py
pytest tests/test_r_parity.py -q

Relationship to omicverse

This package is developed upstream in omicverse. If you already use omicverse, install it instead — ov.utils.mclust_R(adata, ...) and the upcoming ov.utils.mclust_py(adata, ...) cover the same surface plus the omicverse registry glue.

Citation

If you use this package, please cite the original mclust paper:

Scrucca L., Fop M., Murphy T.B., Raftery A.E. (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289–317.

and acknowledge omicverse / this repo for the Python port.

License

GNU GPLv3 — matches both upstream omicverse and CRAN mclust.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymclustr-0.1.0.tar.gz (61.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pymclustr-0.1.0-py3-none-any.whl (46.6 kB view details)

Uploaded Python 3

File details

Details for the file pymclustr-0.1.0.tar.gz.

File metadata

  • Download URL: pymclustr-0.1.0.tar.gz
  • Upload date:
  • Size: 61.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pymclustr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e58f1949ef5e52c359e79231aab95e368982a990eef9d4bcefa4ed065926692d
MD5 8460b5ce8199df7925bd4188ef2fb7a2
BLAKE2b-256 03f3db2e7a2c4ed5d6e3f8f00a0eaf576d5ff208f144efbab2a27857920093ce

See more details on using hashes here.

File details

Details for the file pymclustr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pymclustr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 46.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pymclustr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 306b3784f3cba6feda89b382c918483b5b2c69ad8381cabfa15d7b2957292148
MD5 6fad56f28c4d8b2b2bdc9358facbb166
BLAKE2b-256 0e040ac4a89f62af908e1ca669d6519f69cbd16783d70de64dfb9b4330bad815

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page