Pure-Python re-implementation of CRAN mclust — Gaussian mixture model clustering with all 14 covariance parameterizations, AnnData-native.

These details have not been verified by PyPI

Project links

Project description

pymclustR

A pure-Python re-implementation of CRAN mclust (Scrucca, Fop, Murphy, Raftery 2016) — Gaussian mixture model clustering with all 14 covariance parameterizations and BIC-driven model selection.

The PyPI distribution is pymclustR; the Python import name is mclust_py (so from mclust_py import Mclust).

AnnData-native — drop-in for the scanpy ecosystem
No rpy2, no R install, no Fortran toolchain
All 14 Banfield-Raftery / Celeux-Govaert parameterizations (EII, VII, EEI, VEI, EVI, VVI, EEE, VEE, EVE, VVE, EEV, VEV, EVV, VVV) plus the 1-D specials E / V
Same Mclust() driver: BIC scan over (G × model), pick the maximum, return classification + soft responsibilities

Same upstream-mirror pattern as monocle2-py and milor-py: the canonical implementation lives in omicverse; this repo is the standalone slice for users who want mclust without the full omicverse stack.

Install

pip install pymclustR

Quick-start

import numpy as np
from sklearn import datasets
from mclust_py import Mclust, predict_mclust

X = datasets.load_iris().data        # 150 × 4
fit = Mclust(X, G=range(1, 6))       # tries G = 1..5 across all 14 models
print(fit)                            # Mclust(VEV, G=2)  loglik=-215.83 BIC=-561.7
print(fit.BIC.iloc[:, :4].head())    # full BIC table

# Predict on new cells
z_new, cls_new = predict_mclust(fit, X_new)

Per-cell results are also written to AnnData via the convenience adapter:

from mclust_py import mclust_anndata
fit = mclust_anndata(adata, use_rep="X_pca", n_components=10, G=range(2, 12))
# adata.obs['mclust']           — 1-based hard label
# adata.obs['mclust_uncertainty']
# adata.obsm['mclust_z']        — soft responsibilities
# adata.uns['mclust']           — BIC table + parameters

Tutorial — end-to-end on Iris

from sklearn import datasets
from mclust_py import Mclust, predict_mclust, hc, hclass, partition_to_z, me

X = datasets.load_iris().data
y = datasets.load_iris().target

# 1) Default: BIC scan over G = 1..9 × all 14 covariance models
fit = Mclust(X, G=range(1, 10))
print("best:", fit.model_name, "G =", fit.G, "BIC =", fit.bic)
print("classification table:", dict(zip(*np.unique(fit.classification, return_counts=True))))

# 2) Pick a specific covariance family
fit_vvv = Mclust(X, G=[3], model_names=["VVV"])     # G=3, full per-cluster Σ
fit_eee = Mclust(X, G=[3], model_names=["EEE"])     # G=3, common full Σ

# 3) Use a custom initial partition (e.g. from your favourite labels)
z_init = partition_to_z(y, G=3)                       # one-hot from ground truth
result = me(X, "VVV", z_init)                         # single me() call
print("loglik:", result.loglik, "iters:", result.iterations)

# 4) Hierarchical-clustering init (mclust default — Ward on SVD-whitened data)
tree = hc(X, model_name="VVV", use="SVD")
labels = hclass(tree, G=4)                            # cut the dendrogram into 4 clusters
fit4 = Mclust(X, G=[4], model_names=["VVV"],
              z_init=partition_to_z(labels, G=4))

# 5) Soft-classify new cells
z_new, cls_new = predict_mclust(fit, X[:5])
print("posterior responsibilities for first 5 cells:\n", z_new)

# 6) AnnData (optional — needs `pip install pymclustR[anndata]`)
import anndata as ad
adata = ad.AnnData(X)
from mclust_py import mclust_anndata
mclust_anndata(adata, use_rep="X", G=range(2, 7))
# adata.obs['mclust']           — 1-based hard label
# adata.obs['mclust_uncertainty']
# adata.obsm['mclust_z']        — soft responsibilities
# adata.uns['mclust']           — BIC table + parameters

See examples/comparison_R_vs_python.ipynb for a full visual walk-through with plots, BIC curves, and R-parity tables.

Module map

Module	What it covers
`mclust_py.mclust`	top-level `Mclust()` + `MclustResult` (BIC scan, model selection)
`mclust_py.em`	shared E-step + EM driver `me()` (Aitken stop, mirrors R's `meXXX`)
`mclust_py.models`	M-step for all 14 parameterizations (closed-form + Stiefel iterative)
`mclust_py.bic`	parameter counts and BIC formula
`mclust_py.hc`	Ward-on-SVD initial partition (`hc()` / `hclass()` / `partition_to_z()`)
`mclust_py.density`	per-component multivariate-normal log-density
`mclust_py.anndata_adapter`	`mclust_anndata()` for scanpy users

Notebooks

Notebook	What it shows
`examples/comparison_R_vs_python.ipynb`	Side-by-side comparison of CRAN `mclust 6.1.1` and `mclust-py` on 4 datasets × 14 covariance models × G ∈ {2,3,4,5}. Plots cluster scatters, BIC curves, classification confusion matrices, per-record loglik scatter, and per-model parity bars. Outputs are baked in so you can browse on GitHub without running anything.

The notebook is built and executed by examples/_build_notebooks.py:

python examples/_build_notebooks.py

R parity

tests/r_parity_dump.R runs CRAN mclust::me() on four canonical datasets (blobs2, blobs5, iris, faithful) for every (modelName, G) in {14 models} × {2,3,4,5} — 224 records total. tests/r_parity_compare.py then runs mclust_py.me() from the same z_init and reports correlations:

Model	loglik rel-err	z corr	mean corr	sigma corr	classification match
EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV, EVV, VVV	~1e-15 (machine epsilon)	1.0000	1.0000	1.0000	1.0000
VEE	6e-5	1.0000	1.0000	1.0000	0.999
EVE	5e-3	0.9751	0.9998	0.9843	0.975
VVE	3e-2	0.9351	0.9958	0.9198	0.939
overall (224 records)	3e-3	0.9935	0.9997	0.9931	0.9937

12 of 14 models hit bitwise parity with R. The two outliers (EVE / VVE) share a single orientation matrix D across components — the underlying optimisation problem (Browne–McNicholas 2013) has multiple local maxima, so different solvers can land on different stationary points. We use a Stiefel-manifold gradient descent (provably monotone) where R uses Browne–McNicholas MM; both solve the same problem and agree on simple data, but on hard data (e.g. iris with G ≥ 4) they may pick different local maxima.

For exact agreement on the EM step:

Use the same z_init. mclust's hc() is a Fortran routine that optimises the classification log-likelihood — Python's Ward-on-SVD approximates it well but is not bitwise identical. To force exact agreement, run R's hc(data, modelName="VVV") once and pass the resulting partition as z_init to Mclust().
Use the same control parameters. Defaults match emControl() (tol=1e-5, eps=.Machine$double.eps).
Use the same model_names. Default in both is the full 14-model list for d>1, c("E","V") for d=1.

To regenerate the parity dump:

# in CMAP env (R + r-mclust)
Rscript tests/r_parity_dump.R

# then in omicdev env
python tests/r_parity_compare.py
pytest tests/test_r_parity.py -q

Relationship to omicverse

This package is developed upstream in omicverse. If you already use omicverse, install it instead — ov.utils.mclust_R(adata, ...) and the upcoming ov.utils.mclust_py(adata, ...) cover the same surface plus the omicverse registry glue.

Citation

If you use this package, please cite the original mclust paper:

Scrucca L., Fop M., Murphy T.B., Raftery A.E. (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289–317.

and acknowledge omicverse / this repo for the Python port.

License

GNU GPLv3 — matches both upstream omicverse and CRAN mclust.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymclustr-0.1.0.tar.gz (61.0 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pymclustr-0.1.0-py3-none-any.whl (46.6 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file pymclustr-0.1.0.tar.gz.

File metadata

Download URL: pymclustr-0.1.0.tar.gz
Upload date: Apr 18, 2026
Size: 61.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pymclustr-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e58f1949ef5e52c359e79231aab95e368982a990eef9d4bcefa4ed065926692d`
MD5	`8460b5ce8199df7925bd4188ef2fb7a2`
BLAKE2b-256	`03f3db2e7a2c4ed5d6e3f8f00a0eaf576d5ff208f144efbab2a27857920093ce`

See more details on using hashes here.

File details

Details for the file pymclustr-0.1.0-py3-none-any.whl.

File metadata

Download URL: pymclustr-0.1.0-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 46.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pymclustr-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`306b3784f3cba6feda89b382c918483b5b2c69ad8381cabfa15d7b2957292148`
MD5	`6fad56f28c4d8b2b2bdc9358facbb166`
BLAKE2b-256	`0e040ac4a89f62af908e1ca669d6519f69cbd16783d70de64dfb9b4330bad815`

See more details on using hashes here.

pymclustR 0.1.0

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pymclustR

Install

Quick-start

Tutorial — end-to-end on Iris

Module map

Notebooks

R parity

Relationship to omicverse

Citation

License

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes