Skip to main content

Mixed-Copula Mixture Model for clustering mixed-type data

Project description

pymcmm

Mixed-Copula Mixture Model (MCMM) for clustering datasets with mixed continuous, categorical, and ordinal data types.

Python 3.8+ License: MIT

Features

  • Mixed Data Types: Handle continuous, categorical, and ordinal variables simultaneously
  • Gaussian Copula: Capture complex dependencies between variables
  • Missing Values: Native support for missing data
  • Student-t Marginals: Robust to outliers with automatic degree of freedom estimation
  • Speedy Mode: Efficient computation for large datasets using sparse MST/KNN graphs
  • Cython Acceleration: Optional speedup (up to 35x) with Cython

Installation

From GitHub (Recommended)

pip install git+https://github.com/YuZhao20/pymcmm.git

From PyPI

pip install pymcmm

With Cython Acceleration

pip install git+https://github.com/YuZhao20/pymcmm.git
pip install cython
cd /path/to/pymcmm
python setup.py build_ext --inplace

Verify acceleration:

import mcmm
mcmm.check_acceleration()

Quick Start

import pandas as pd
from mcmm import MCMMGaussianCopulaSpeedy

df = pd.DataFrame({
    'income': [50000, 60000, 75000, 80000],
    'age': [25, 35, 45, 55],
    'gender': ['M', 'F', 'M', 'F'],
    'satisfaction': [1, 2, 3, 4],
})

model = MCMMGaussianCopulaSpeedy(
    n_components=2,
    cont_marginal='student_t',
    copula_likelihood='full',
    verbose=1
)

model.fit(
    df,
    cont_cols=['income', 'age'],
    cat_cols=['gender'],
    ord_cols=['satisfaction']
)

clusters = model.predict(df)
probabilities = model.predict_proba(df)

print(f"BIC: {model.bic_:.2f}")
print(f"Log-likelihood: {model.loglik_:.2f}")

Cython Acceleration

Overview

pymcmm includes optional Cython-accelerated implementations that provide significant speedups for computationally intensive operations. Cython is not required - the package automatically falls back to pure Python implementations if Cython modules are not available.

What Gets Accelerated

Component Pure Python Cython Speedup
Normal CDF/PPF scipy.stats Custom C implementation up to 10x
Student-t CDF scipy.stats Incomplete beta function up to 15x
Bivariate copula density numpy/scipy Optimized C loops up to 20x
E-step (batch) Python loops Parallel Cython up to 35x
M-step (marginals) Python loops Vectorized Cython up to 25x
Weighted correlation numpy Optimized pairwise up to 10x

Performance Benchmark

Typical speedup for a dataset with n=500, p=13, K=3:

Mode Pure Python Cython Speedup
MCMMGaussianCopula ~65s ~1.9s up to 35x
MCMMGaussianCopulaSpeedy ~45s ~1.5s up to 30x

Note: Actual speedup varies depending on hardware and dataset characteristics.

Building Cython Extensions

Prerequisites:

  • C compiler (gcc, clang, or MSVC)
  • Cython >= 0.29
  • NumPy development headers

macOS:

xcode-select --install
brew install libomp  # Optional: for parallel processing
pip install cython
python setup.py build_ext --inplace

Linux:

sudo apt-get install build-essential python3-dev
pip install cython
python setup.py build_ext --inplace

Windows:

pip install cython
python setup.py build_ext --inplace

Verification and Benchmarking

import mcmm

# Check if Cython is enabled
mcmm.check_acceleration()

# Run performance benchmark
mcmm.run_benchmark()

Troubleshooting

If Cython compilation fails:

  1. Missing compiler: Install build tools for your platform
  2. NumPy headers not found: Reinstall NumPy with pip install --force-reinstall numpy
  3. OpenMP errors on macOS: The library works without OpenMP; parallel loops will be sequential

The package will always work without Cython - just slower.

Model Classes

MCMMGaussianCopula

Full copula model with O(p^2) pairwise dependencies.

from mcmm import MCMMGaussianCopula

model = MCMMGaussianCopula(
    n_components=3,
    cont_marginal='student_t',
    copula_likelihood='full',
    max_iter=100,
    verbose=1
)

MCMMGaussianCopulaSpeedy

Optimized for large datasets using sparse graph approximation.

from mcmm import MCMMGaussianCopulaSpeedy

model = MCMMGaussianCopulaSpeedy(
    n_components=3,
    cont_marginal='student_t',
    speedy_graph='mst',
    corr_subsample=3000,
    n_jobs=-1,
    verbose=1
)

Parameters

Parameter Default Description
n_components 3 Number of clusters
cont_marginal 'student_t' Marginal for continuous vars: 'gaussian' or 'student_t'
t_nu 5.0 Initial degrees of freedom for Student-t
estimate_nu True Estimate nu from data
ord_marginal 'cumlogit' Ordinal marginal: 'cumlogit' or 'freq'
copula_likelihood 'full' Copula type: 'full' or 'pairwise'
pairwise_weight 'abs_rho' Pairwise weight: 'abs_rho' or 'uniform'
dt_mode 'mid' Discretization mode: 'mid' or 'random'
shrink_lambda 0.05 Correlation matrix shrinkage
max_iter 100 Maximum EM iterations
tol 1e-4 Convergence tolerance
n_jobs 1 Number of parallel jobs (-1 for all cores)
random_state None Random seed for reproducibility
verbose 0 Verbosity level

Speedy Mode Additional Parameters

Parameter Default Description
speedy_graph 'mst' Graph type: 'mst' or 'knn'
speedy_k_per_node 3 K for KNN graph
corr_subsample 3000 Subsample size for correlation estimation
e_step_batch 4096 Batch size for E-step

Methods

Fitting

model.fit(df, cont_cols=None, cat_cols=None, ord_cols=None)

Prediction

clusters = model.predict(df)
proba = model.predict_proba(df)
log_lik = model.score_samples(df)

Outlier Detection

is_outlier, scores, threshold = model.detect_outliers(df, q=1.0)

Attributes (after fitting)

Attribute Description
pi_ Cluster mixing proportions (K,)
mu_ Cluster means for continuous vars (K, p_cont)
sig_ Cluster stds for continuous vars (K, p_cont)
R_ Correlation matrices (K, p, p)
fitted_nu_ Estimated degrees of freedom
loglik_ Final log-likelihood
bic_ Bayesian Information Criterion
history_ Log-likelihood history

Example: Customer Segmentation

import pandas as pd
from mcmm import MCMMGaussianCopulaSpeedy

df = pd.read_csv('customers.csv')

results = []
for k in range(2, 8):
    model = MCMMGaussianCopulaSpeedy(
        n_components=k,
        random_state=42,
        verbose=0
    )
    model.fit(df, 
              cont_cols=['income', 'age', 'spending'],
              cat_cols=['region', 'gender'],
              ord_cols=['satisfaction'])
    results.append({'k': k, 'bic': model.bic_, 'loglik': model.loglik_})

best = min(results, key=lambda x: x['bic'])
print(f"Best K: {best['k']} (BIC: {best['bic']:.2f})")

Scalability Guidelines

Dataset Size Recommended Mode Cython
n < 1,000 MCMMGaussianCopula Optional
n < 10,000 MCMMGaussianCopulaSpeedy Recommended
n > 10,000 MCMMGaussianCopulaSpeedy + n_jobs=-1 Recommended

Citation

If you use this package in your research, please cite:

@software{pymcmm,
  author = {Yu Zhao},
  title = {pymcmm: Mixed-Copula Mixture Model for Python},
  institution = {Tokyo University of Science},
  url = {https://github.com/YuZhao20/pymcmm},
  version = {0.3.0},
  year = {2025}
}

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymcmm-0.3.1.tar.gz (20.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pymcmm-0.3.1-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file pymcmm-0.3.1.tar.gz.

File metadata

  • Download URL: pymcmm-0.3.1.tar.gz
  • Upload date:
  • Size: 20.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for pymcmm-0.3.1.tar.gz
Algorithm Hash digest
SHA256 d544680fd05024b9152aaca8e53e59a7fde3f7f47845081f21a5ed6f2b9ac582
MD5 4820fb42b46e387d995691a3ceaaf0b9
BLAKE2b-256 21047ec2acbe793216d595111c8652cd73a0efda2a9422f2df2d10273d27b3bf

See more details on using hashes here.

File details

Details for the file pymcmm-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: pymcmm-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for pymcmm-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 391c3fed9e5fdf6e48a5fb6c6c73e0fcaf25234e0ca6f6a794e05f2d17d0b940
MD5 39f6a4d64d3f7a1a082944f68ad01132
BLAKE2b-256 3748f2cfb79507baeefb91b953c9d9e1a148ba62955de223dc9c73f2fb8ab92a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page