Skip to main content

Mixed-Copula Mixture Model for clustering mixed-type data

Project description

pymcmm

Mixed-Copula Mixture Model (MCMM) for clustering datasets with mixed continuous, categorical, and ordinal data types.

Python 3.8+ License: MIT

Features

  • Mixed Data Types: Handle continuous, categorical, and ordinal variables simultaneously
  • Gaussian Copula: Capture complex dependencies between variables
  • Missing Values: Native support for missing data
  • Student-t Marginals: Robust to outliers with automatic degree of freedom estimation
  • Speedy Mode: Efficient computation for large datasets using sparse MST/KNN graphs
  • Cython Acceleration: Optional speedup (up to 35x) with Cython

Installation

From GitHub (Recommended)

pip install git+https://github.com/YuZhao20/pymcmm.git

From PyPI (Coming Soon)

pip install pymcmm

With Cython Acceleration

pip install git+https://github.com/YuZhao20/pymcmm.git
pip install cython
cd /path/to/pymcmm
python setup.py build_ext --inplace

Verify acceleration:

import mcmm
mcmm.check_acceleration()

Quick Start

import pandas as pd
from mcmm import MCMMGaussianCopulaSpeedy

df = pd.DataFrame({
    'income': [50000, 60000, 75000, 80000],
    'age': [25, 35, 45, 55],
    'gender': ['M', 'F', 'M', 'F'],
    'satisfaction': [1, 2, 3, 4],
})

model = MCMMGaussianCopulaSpeedy(
    n_components=2,
    cont_marginal='student_t',
    copula_likelihood='full',
    verbose=1
)

model.fit(
    df,
    cont_cols=['income', 'age'],
    cat_cols=['gender'],
    ord_cols=['satisfaction']
)

clusters = model.predict(df)
probabilities = model.predict_proba(df)

print(f"BIC: {model.bic_:.2f}")
print(f"Log-likelihood: {model.loglik_:.2f}")

Cython Acceleration

Overview

pymcmm includes optional Cython-accelerated implementations that provide significant speedups for computationally intensive operations. Cython is not required - the package automatically falls back to pure Python implementations if Cython modules are not available.

What Gets Accelerated

Component Pure Python Cython Speedup
Normal CDF/PPF scipy.stats Custom C implementation up to 10x
Student-t CDF scipy.stats Incomplete beta function up to 15x
Bivariate copula density numpy/scipy Optimized C loops up to 20x
E-step (batch) Python loops Parallel Cython up to 35x
M-step (marginals) Python loops Vectorized Cython up to 25x
Weighted correlation numpy Optimized pairwise up to 10x

Performance Benchmark

Typical speedup for a dataset with n=500, p=13, K=3:

Mode Pure Python Cython Speedup
MCMMGaussianCopula ~65s ~1.9s up to 35x
MCMMGaussianCopulaSpeedy ~45s ~1.5s up to 30x

Note: Actual speedup varies depending on hardware and dataset characteristics.

Building Cython Extensions

Prerequisites:

  • C compiler (gcc, clang, or MSVC)
  • Cython >= 0.29
  • NumPy development headers

macOS:

xcode-select --install
brew install libomp  # Optional: for parallel processing
pip install cython
python setup.py build_ext --inplace

Linux:

sudo apt-get install build-essential python3-dev
pip install cython
python setup.py build_ext --inplace

Windows:

pip install cython
python setup.py build_ext --inplace

Verification and Benchmarking

import mcmm

# Check if Cython is enabled
mcmm.check_acceleration()

# Run performance benchmark
mcmm.run_benchmark()

Troubleshooting

If Cython compilation fails:

  1. Missing compiler: Install build tools for your platform
  2. NumPy headers not found: Reinstall NumPy with pip install --force-reinstall numpy
  3. OpenMP errors on macOS: The library works without OpenMP; parallel loops will be sequential

The package will always work without Cython - just slower.

Model Classes

MCMMGaussianCopula

Full copula model with O(p^2) pairwise dependencies.

from mcmm import MCMMGaussianCopula

model = MCMMGaussianCopula(
    n_components=3,
    cont_marginal='student_t',
    copula_likelihood='full',
    max_iter=100,
    verbose=1
)

MCMMGaussianCopulaSpeedy

Optimized for large datasets using sparse graph approximation.

from mcmm import MCMMGaussianCopulaSpeedy

model = MCMMGaussianCopulaSpeedy(
    n_components=3,
    cont_marginal='student_t',
    speedy_graph='mst',
    corr_subsample=3000,
    n_jobs=-1,
    verbose=1
)

Parameters

Parameter Default Description
n_components 3 Number of clusters
cont_marginal 'student_t' Marginal for continuous vars: 'gaussian' or 'student_t'
t_nu 5.0 Initial degrees of freedom for Student-t
estimate_nu True Estimate nu from data
ord_marginal 'cumlogit' Ordinal marginal: 'cumlogit' or 'freq'
copula_likelihood 'full' Copula type: 'full' or 'pairwise'
pairwise_weight 'abs_rho' Pairwise weight: 'abs_rho' or 'uniform'
dt_mode 'mid' Discretization mode: 'mid' or 'random'
shrink_lambda 0.05 Correlation matrix shrinkage
max_iter 100 Maximum EM iterations
tol 1e-4 Convergence tolerance
n_jobs 1 Number of parallel jobs (-1 for all cores)
random_state None Random seed for reproducibility
verbose 0 Verbosity level

Speedy Mode Additional Parameters

Parameter Default Description
speedy_graph 'mst' Graph type: 'mst' or 'knn'
speedy_k_per_node 3 K for KNN graph
corr_subsample 3000 Subsample size for correlation estimation
e_step_batch 4096 Batch size for E-step

Methods

Fitting

model.fit(df, cont_cols=None, cat_cols=None, ord_cols=None)

Prediction

clusters = model.predict(df)
proba = model.predict_proba(df)
log_lik = model.score_samples(df)

Outlier Detection

is_outlier, scores, threshold = model.detect_outliers(df, q=1.0)

Attributes (after fitting)

Attribute Description
pi_ Cluster mixing proportions (K,)
mu_ Cluster means for continuous vars (K, p_cont)
sig_ Cluster stds for continuous vars (K, p_cont)
R_ Correlation matrices (K, p, p)
fitted_nu_ Estimated degrees of freedom
loglik_ Final log-likelihood
bic_ Bayesian Information Criterion
history_ Log-likelihood history

Example: Customer Segmentation

import pandas as pd
from mcmm import MCMMGaussianCopulaSpeedy

df = pd.read_csv('customers.csv')

results = []
for k in range(2, 8):
    model = MCMMGaussianCopulaSpeedy(
        n_components=k,
        random_state=42,
        verbose=0
    )
    model.fit(df, 
              cont_cols=['income', 'age', 'spending'],
              cat_cols=['region', 'gender'],
              ord_cols=['satisfaction'])
    results.append({'k': k, 'bic': model.bic_, 'loglik': model.loglik_})

best = min(results, key=lambda x: x['bic'])
print(f"Best K: {best['k']} (BIC: {best['bic']:.2f})")

Scalability Guidelines

Dataset Size Recommended Mode Cython
n < 1,000 MCMMGaussianCopula Optional
n < 10,000 MCMMGaussianCopulaSpeedy Recommended
n > 10,000 MCMMGaussianCopulaSpeedy + n_jobs=-1 Recommended

Citation

If you use this package in your research, please cite:

@software{pymcmm,
  author = {Yu Zhao},
  title = {pymcmm: Mixed-Copula Mixture Model for Python},
  institution = {Tokyo University of Science},
  url = {https://github.com/YuZhao20/pymcmm},
  version = {0.3.0},
  year = {2025}
}

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymcmm-0.3.0.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pymcmm-0.3.0-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file pymcmm-0.3.0.tar.gz.

File metadata

  • Download URL: pymcmm-0.3.0.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for pymcmm-0.3.0.tar.gz
Algorithm Hash digest
SHA256 5bebe907d8b7e95313b55c82c892bedd3244102067589a99a6d018e58c51a697
MD5 cd72992df3f065979777f518bc07536f
BLAKE2b-256 da40a4002aebe3ffaa38dfe25a1434bd3d3c054829222deb86d465dd028e0778

See more details on using hashes here.

File details

Details for the file pymcmm-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pymcmm-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 15.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for pymcmm-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e80bda53cebd586517342c6f980962dbca34ee3e4241d00f1cf79352d4a81c7a
MD5 6bffede81008ae01626696dfd1c7eb77
BLAKE2b-256 ad33d9df1782aedf1481add4a92a86b45ebdd7ad5d0ee9d6d165ff175de83ed8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page