Skip to main content

A JIT optimized K-Prototype algorithm

Project description

KPrototype plus (kpplus)

Maintenance made-with-python PyPI license

Description

K-prototype is a clustering method invented to support both categorical and numerical variables[1]

KPrototype plus (kpplus) is a Python 3 package that is designed to increase the performance of nivoc's KPrototypes function by using Numba.

This code is part of Stockholms diabetespreventiva program.

Performance improvement

As an example, I used one of the Heart Disease Data Sets from UCI to test the performance. This data set contains 4455 rows, 7 categorical variables, and 5 numerical variables. We compare the performance between nicodv's kprototype function and k_prototype_plus.

< nicodv's kprototype >
CPU times: user 2.14 s, sys: 18.2 ms, total: 2.16 s
Wall time: 1min 41s
< k_prototype_plus >
CPU times: user 298 ms, sys: 9.24 ms, total: 308 ms
Wall time: 13.4 s

Notice: Only Cao initiation is supported as the initiation method[2].

System requirement

Generic badge Generic badge Generic badge Generic badge Generic badge

Installiation

pip install kpplus

Usage

from kpplus import KPrototypes_plus
model = KPrototypes_plus(n_clusters = 3, n_init = 4, gamma = None, n_jobs = -1)  #initialize the model
model.fit_predict(X=df, categorical = [0,1])  #fit the data and categorical into the mdoel

model.labels_                          #return the cluster_labels
model.cluster_centroids_               #return the cluster centroid points(prototypes)
model.n_iter_                          #return the number of iterations
model.cost_                            #return the costs

n_clusters: the number of clusters

n_init: the number of parallel oprations by using different initializations

gamma (optional): A value that controls how algorithm favours categorical variables. (By default, it is the mean std of all numeric variables)

n_jobs (optional, default=-1): The number of parallel processors. ('-1' means using all the processor)

X: 2-D numpy array (dataset)

types: A numpy array that indicates if the variable is categorical or numerical.

For example: types = [1,1,0,0,0,0] means the first two variables are categorical and the last four variables are numerical.

Acknowledgement

I'm extremely grateful to Dr. Diego Yacaman Mendez and Dr. David Ebbevi for their support. They are two brilliant researchers who started this project with excellent knowledge of medical science, epidemiology, statistics and programming.

Reference

[1] Huang Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery. 1998;2(3):283-304. [2] Cao F, Liang J, Bai LJESwA. A new initialization method for categorical data clustering. 2009;36(7):10223-8.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kpplus-0.0.3.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kpplus-0.0.3-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file kpplus-0.0.3.tar.gz.

File metadata

  • Download URL: kpplus-0.0.3.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.2 pkginfo/1.5.0.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for kpplus-0.0.3.tar.gz
Algorithm Hash digest
SHA256 e61da60bdb39d544afe16df8d4cb5ef076b84d2f42861a09c08ed7ae56e82c2a
MD5 f7d5db2bd686ce7c6282a5f661eef8b5
BLAKE2b-256 a25cdf60622dab8168d875947c28cee33c63e72f47c6559af6baccdabac5c97f

See more details on using hashes here.

File details

Details for the file kpplus-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: kpplus-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.2 pkginfo/1.5.0.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for kpplus-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f8fe8467fa321a29aafa453b7e21bfc8caf56418fa7f7596df181857c249c670
MD5 71c9dc580027224e4fa2043a54136d06
BLAKE2b-256 b26f02e180bbe501277c31c877009d73853f6c7022efe036e4847f4cdb522699

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page