Skip to main content

A JIT optimized K-Prototype algorithm

Project description

KPrototype plus (kpplus)

Maintenance made-with-python PyPI license

Description

K-prototype is a clustering method invented to support both categorical and numerical variables[1]

KPrototype plus (kpplus) is a Python 3 package that is designed to increase the performance of nivoc's KPrototypes function by using Numba.

This code is part of Stockholms diabetespreventiva program.

Performance improvement

As an example, I used one of the Heart Disease Data Sets from UCI to test the performance. This data set contains 4455 rows, 7 categorical variables, and 5 numerical variables. We compare the performance between nicodv's kprototype function and k_prototype_plus.

< nicodv's kprototype >
CPU times: user 2.14 s, sys: 18.2 ms, total: 2.16 s
Wall time: 1min 41s
< k_prototype_plus >
CPU times: user 298 ms, sys: 9.24 ms, total: 308 ms
Wall time: 13.4 s

Notice: Only Cao initiation is supported as the initiation method[2].

System requirement

Generic badge Generic badge Generic badge Generic badge Generic badge

Installiation

pip install kpplus

Usage

from kpplus import KPrototypes_plus
model = KPrototypes_plus(n_clusters = 3, n_init = 4, gamma = None, n_jobs = -1)  #initialize the model
model.fit_predict(X=df, categorical = [0,1])  #fit the data and categorical into the mdoel

model.labels_                          #return the cluster_labels
model.cluster_centroids_               #return the cluster centroid points(prototypes)
model.n_iter_                          #return the number of iterations
model.cost_                            #return the costs

n_clusters: the number of clusters

n_init: the number of parallel oprations by using different initializations

gamma (optional): A value that controls how algorithm favours categorical variables. (By default, it is the mean std of all numeric variables)

n_jobs (optional, default=-1): The number of parallel processors. ('-1' means using all the processor)

X: 2-D numpy array (dataset)

types: A numpy array that indicates if the variable is categorical or numerical.

For example: types = [1,1,0,0,0,0] means the first two variables are categorical and the last four variables are numerical.

Acknowledgement

I'm extremely grateful to Dr. Diego Yacaman Mendez and Dr. David Ebbevi for their support. They are two brilliant researchers who started this project with excellent knowledge of medical science, epidemiology, statistics and programming.

Reference

[1] Huang Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery. 1998;2(3):283-304. [2] Cao F, Liang J, Bai LJESwA. A new initialization method for categorical data clustering. 2009;36(7):10223-8.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kpplus-0.0.3.tar.gz (6.4 kB view hashes)

Uploaded Source

Built Distribution

kpplus-0.0.3-py3-none-any.whl (6.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page