A JIT optimized K-Prototype algorithm
Project description
KPrototype plus (kpplus)
Description
K-prototype is a clustering method invented to support both categorical and numerical variables[1]
KPrototype plus (kpplus) is a Python 3 package that is designed to increase the performance of nivoc's KPrototypes function by using Numba.
This code is part of Stockholms diabetespreventiva program.
Performance improvement
As an example, I used one of the Heart Disease Data Sets from UCI to test the performance. This data set contains 4455 rows, 7 categorical variables, and 5 numerical variables. We compare the performance between nicodv's kprototype function and k_prototype_plus.
< nicodv's kprototype >
CPU times: user 2.14 s, sys: 18.2 ms, total: 2.16 s
Wall time: 1min 41s
< k_prototype_plus >
CPU times: user 298 ms, sys: 9.24 ms, total: 308 ms
Wall time: 13.4 s
Notice: Only Cao initiation is supported as the initiation method[2].
System requirement
Installiation
pip install kpplus
Usage
from kpplus import KPrototypes_plus
model = KPrototypes_plus(n_clusters = 3, n_init = 4, gamma = None, n_jobs = -1) #initialize the model
model.fit_predict(X=df, categorical = [0,1]) #fit the data and categorical into the mdoel
model.labels_ #return the cluster_labels
model.cluster_centroids_ #return the cluster centroid points(prototypes)
model.n_iter_ #return the number of iterations
model.cost_ #return the costs
n_clusters: the number of clusters
n_init: the number of parallel oprations by using different initializations
gamma (optional): A value that controls how algorithm favours categorical variables. (By default, it is the mean std of all numeric variables)
n_jobs (optional, default=-1): The number of parallel processors. ('-1' means using all the processor)
X: 2-D numpy array (dataset)
types: A numpy array that indicates if the variable is categorical or numerical.
For example: types = [1,1,0,0,0,0]
means the first two variables are categorical and the last four variables are numerical.
Acknowledgement
I'm extremely grateful to Dr. Diego Yacaman Mendez and Dr. David Ebbevi for their support. They are two brilliant researchers who started this project with excellent knowledge of medical science, epidemiology, statistics and programming.
Reference
[1] Huang Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery. 1998;2(3):283-304. [2] Cao F, Liang J, Bai LJESwA. A new initialization method for categorical data clustering. 2009;36(7):10223-8.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.