A JIT optimized K-Prototype algorithm
Project description
KPrototype plus
Description
K-prototype is a clustering method invented to support both categorical and numerical variables[1]
KPrototype plus is a Python 3 package that is designed to increase the performance of nivoc's KPrototypes function by using Numba.
This code is part of Stockholms diabetespreventiva program.
Performance improvement As an example, I used one of the Heart Disease Data Sets from UCI to test the performance. This data set contains 4455 rows, 7 categorical variables, and 5 numerical variables. We compare the performance between nicodv's kprototype function and k_prototype_plus.
< nicodv's kprototype >
CPU times: user 2.14 s, sys: 18.2 ms, total: 2.16 s
Wall time: 1min 41s
< k_prototype_plus >
CPU times: user 298 ms, sys: 9.24 ms, total: 308 ms
Wall time: 13.4 s
Notice: Only Cao initiation is supported as the initiation method[2].
System requirement
Installiation
pip install kpplus
Usage
from kpplus import KPrototypes_plus
model = KPrototypes_plus(n_clusters = 3, n_init = 4, gamma = None, n_jobs = -1) #initialize the model
model.fit_predict(X=df, categorical = [0,1]) #fit the data and categorical into the mdoel
model.labels_ #return the cluster_labels
model.cluster_centroids_ #return the cluster centroid points(prototypes)
model.n_iter_ #return the number of iterations
model.cost_ #return the costs
n_clusters: the number of clusters
n_init: the number of parallel oprations by using different initializations
gamma (optional): A value that controls how algorithm favours categorical variables.
By default, it is the mean std of all numeric variables
n_jobs (optional, default=-1): The number of parallel processors:
'-1' means using all the processor
X: 2-D numpy array (dataset)
types: A numpy array that indicates if the variable is categorical or numerical.
For example: types = [1,1,0,0,0,0]
means the first two variables are categorical and the last four variables are numerical.
##Acknowledgement I'm extremely grateful to Dr. Diego Yacaman Mendez and Dr. David Ebbevi for their support. They are two brilliant researchers who started this project.
Reference
[1] Huang Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery. 1998;2(3):283-304. [2] Cao F, Liang J, Bai LJESwA. A new initialization method for categorical data clustering. 2009;36(7):10223-8.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.