Mixed-Copula Mixture Model for clustering mixed-type data

These details have not been verified by PyPI

Project links

Project description

pymcmm

Mixed-Copula Mixture Model (MCMM) for clustering datasets with mixed continuous, categorical, and ordinal data types.

Features

Mixed Data Types: Handle continuous, categorical, and ordinal variables simultaneously
Gaussian Copula: Capture complex dependencies between variables
Missing Values: Native support for missing data
Student-t Marginals: Robust to outliers with automatic degree of freedom estimation
Speedy Mode: Efficient computation for large datasets using sparse MST/KNN graphs
Cython Acceleration: Optional speedup (up to 35x) with Cython

Installation

From GitHub (Recommended)

pip install git+https://github.com/YuZhao20/pymcmm.git

From PyPI

pip install pymcmm

With Cython Acceleration

pip install git+https://github.com/YuZhao20/pymcmm.git
pip install cython
cd /path/to/pymcmm
python setup.py build_ext --inplace

Verify acceleration:

import mcmm
mcmm.check_acceleration()

Quick Start

import pandas as pd
from mcmm import MCMMGaussianCopulaSpeedy

df = pd.DataFrame({
    'income': [50000, 60000, 75000, 80000],
    'age': [25, 35, 45, 55],
    'gender': ['M', 'F', 'M', 'F'],
    'satisfaction': [1, 2, 3, 4],
})

model = MCMMGaussianCopulaSpeedy(
    n_components=2,
    cont_marginal='student_t',
    copula_likelihood='full',
    verbose=1
)

model.fit(
    df,
    cont_cols=['income', 'age'],
    cat_cols=['gender'],
    ord_cols=['satisfaction']
)

clusters = model.predict(df)
probabilities = model.predict_proba(df)

print(f"BIC: {model.bic_:.2f}")
print(f"Log-likelihood: {model.loglik_:.2f}")

Cython Acceleration

Overview

pymcmm includes optional Cython-accelerated implementations that provide significant speedups for computationally intensive operations. Cython is not required - the package automatically falls back to pure Python implementations if Cython modules are not available.

What Gets Accelerated

Component	Pure Python	Cython	Speedup
Normal CDF/PPF	scipy.stats	Custom C implementation	up to 10x
Student-t CDF	scipy.stats	Incomplete beta function	up to 15x
Bivariate copula density	numpy/scipy	Optimized C loops	up to 20x
E-step (batch)	Python loops	Parallel Cython	up to 35x
M-step (marginals)	Python loops	Vectorized Cython	up to 25x
Weighted correlation	numpy	Optimized pairwise	up to 10x

Performance Benchmark

Typical speedup for a dataset with n=500, p=13, K=3:

Mode	Pure Python	Cython	Speedup
MCMMGaussianCopula	~65s	~1.9s	up to 35x
MCMMGaussianCopulaSpeedy	~45s	~1.5s	up to 30x

Note: Actual speedup varies depending on hardware and dataset characteristics.

Building Cython Extensions

Prerequisites:

C compiler (gcc, clang, or MSVC)
Cython >= 0.29
NumPy development headers

macOS:

xcode-select --install
brew install libomp  # Optional: for parallel processing
pip install cython
python setup.py build_ext --inplace

Linux:

sudo apt-get install build-essential python3-dev
pip install cython
python setup.py build_ext --inplace

Windows:

pip install cython
python setup.py build_ext --inplace

Verification and Benchmarking

import mcmm

# Check if Cython is enabled
mcmm.check_acceleration()

# Run performance benchmark
mcmm.run_benchmark()

Troubleshooting

If Cython compilation fails:

Missing compiler: Install build tools for your platform
NumPy headers not found: Reinstall NumPy with pip install --force-reinstall numpy
OpenMP errors on macOS: The library works without OpenMP; parallel loops will be sequential

The package will always work without Cython - just slower.

Model Classes

MCMMGaussianCopula

Full copula model with O(p^2) pairwise dependencies.

from mcmm import MCMMGaussianCopula

model = MCMMGaussianCopula(
    n_components=3,
    cont_marginal='student_t',
    copula_likelihood='full',
    max_iter=100,
    verbose=1
)

MCMMGaussianCopulaSpeedy

Optimized for large datasets using sparse graph approximation.

from mcmm import MCMMGaussianCopulaSpeedy

model = MCMMGaussianCopulaSpeedy(
    n_components=3,
    cont_marginal='student_t',
    speedy_graph='mst',
    corr_subsample=3000,
    n_jobs=-1,
    verbose=1
)

Parameters

Parameter	Default	Description
`n_components`	3	Number of clusters
`cont_marginal`	'student_t'	Marginal for continuous vars: 'gaussian' or 'student_t'
`t_nu`	5.0	Initial degrees of freedom for Student-t
`estimate_nu`	True	Estimate nu from data
`ord_marginal`	'cumlogit'	Ordinal marginal: 'cumlogit' or 'freq'
`copula_likelihood`	'full'	Copula type: 'full' or 'pairwise'
`pairwise_weight`	'abs_rho'	Pairwise weight: 'abs_rho' or 'uniform'
`dt_mode`	'mid'	Discretization mode: 'mid' or 'random'
`shrink_lambda`	0.05	Correlation matrix shrinkage
`max_iter`	100	Maximum EM iterations
`tol`	1e-4	Convergence tolerance
`n_jobs`	1	Number of parallel jobs (-1 for all cores)
`random_state`	None	Random seed for reproducibility
`verbose`	0	Verbosity level

Speedy Mode Additional Parameters

Parameter	Default	Description
`speedy_graph`	'mst'	Graph type: 'mst' or 'knn'
`speedy_k_per_node`	3	K for KNN graph
`corr_subsample`	3000	Subsample size for correlation estimation
`e_step_batch`	4096	Batch size for E-step

Methods

Fitting

model.fit(df, cont_cols=None, cat_cols=None, ord_cols=None)

Prediction

clusters = model.predict(df)
proba = model.predict_proba(df)
log_lik = model.score_samples(df)

Outlier Detection

is_outlier, scores, threshold = model.detect_outliers(df, q=1.0)

Attributes (after fitting)

Attribute	Description
`pi_`	Cluster mixing proportions (K,)
`mu_`	Cluster means for continuous vars (K, p_cont)
`sig_`	Cluster stds for continuous vars (K, p_cont)
`R_`	Correlation matrices (K, p, p)
`fitted_nu_`	Estimated degrees of freedom
`loglik_`	Final log-likelihood
`bic_`	Bayesian Information Criterion
`history_`	Log-likelihood history

Example: Customer Segmentation

import pandas as pd
from mcmm import MCMMGaussianCopulaSpeedy

df = pd.read_csv('customers.csv')

results = []
for k in range(2, 8):
    model = MCMMGaussianCopulaSpeedy(
        n_components=k,
        random_state=42,
        verbose=0
    )
    model.fit(df, 
              cont_cols=['income', 'age', 'spending'],
              cat_cols=['region', 'gender'],
              ord_cols=['satisfaction'])
    results.append({'k': k, 'bic': model.bic_, 'loglik': model.loglik_})

best = min(results, key=lambda x: x['bic'])
print(f"Best K: {best['k']} (BIC: {best['bic']:.2f})")

Scalability Guidelines

Dataset Size	Recommended Mode	Cython
n < 1,000	MCMMGaussianCopula	Optional
n < 10,000	MCMMGaussianCopulaSpeedy	Recommended
n > 10,000	MCMMGaussianCopulaSpeedy + n_jobs=-1	Recommended

Citation

If you use this package in your research, please cite:

@software{pymcmm,
  author = {Yu Zhao},
  title = {pymcmm: Mixed-Copula Mixture Model for Python},
  institution = {Tokyo University of Science},
  url = {https://github.com/YuZhao20/pymcmm},
  version = {0.3.0},
  year = {2025}
}

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.1

Dec 23, 2025

0.3.0

Dec 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymcmm-0.3.1.tar.gz (20.9 kB view details)

Uploaded Dec 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pymcmm-0.3.1-py3-none-any.whl (15.2 kB view details)

Uploaded Dec 23, 2025 Python 3

File details

Details for the file pymcmm-0.3.1.tar.gz.

File metadata

Download URL: pymcmm-0.3.1.tar.gz
Upload date: Dec 23, 2025
Size: 20.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for pymcmm-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`d544680fd05024b9152aaca8e53e59a7fde3f7f47845081f21a5ed6f2b9ac582`
MD5	`4820fb42b46e387d995691a3ceaaf0b9`
BLAKE2b-256	`21047ec2acbe793216d595111c8652cd73a0efda2a9422f2df2d10273d27b3bf`

See more details on using hashes here.

File details

Details for the file pymcmm-0.3.1-py3-none-any.whl.

File metadata

Download URL: pymcmm-0.3.1-py3-none-any.whl
Upload date: Dec 23, 2025
Size: 15.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for pymcmm-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`391c3fed9e5fdf6e48a5fb6c6c73e0fcaf25234e0ca6f6a794e05f2d17d0b940`
MD5	`39f6a4d64d3f7a1a082944f68ad01132`
BLAKE2b-256	`3748f2cfb79507baeefb91b953c9d9e1a148ba62955de223dc9c73f2fb8ab92a`

See more details on using hashes here.

pymcmm 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pymcmm

Features

Installation

From GitHub (Recommended)

From PyPI

With Cython Acceleration

Quick Start

Cython Acceleration

Overview

What Gets Accelerated

Performance Benchmark

Building Cython Extensions

Verification and Benchmarking

Troubleshooting

Model Classes

MCMMGaussianCopula

MCMMGaussianCopulaSpeedy

Parameters

Speedy Mode Additional Parameters

Methods

Fitting

Prediction

Outlier Detection

Attributes (after fitting)

Example: Customer Segmentation

Scalability Guidelines

Citation

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes