Python implementations of the kmodes and kprototypes clustering algorithms for clustering categorical data.
Project description
kmodes
Description
Python implementations of the kmodes and kprototypes clustering algorithms. Relies on numpy for a lot of the heavy lifting.
kmodes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more wellknown kmeans algorithm, which clusters numerical data based on Euclidean distance.) The kprototypes algorithm combines kmodes and kmeans and is able to cluster mixed numerical / categorical data.
Implemented are:
 kmodes [HUANG97] [HUANG98]
 kmodes with initialization based on density [CAO09]
 kprototypes [HUANG97]
The code is modeled after the clustering algorithms in scikitlearn
and has the same familiar interface.
I would love to have more people play around with this and give me feedback on my implementation. If you come across any issues in running or installing kmodes, please submit a bug report.
Enjoy!
Installation
kmodes can be installed using pip:
pip install kmodes
To upgrade to the latest version (recommended), run it like this:
pip install upgrade kmodes
Alternatively, you can build the latest development version from source:
git clone https://github.com/nicodv/kmodes.git
cd kmodes
python setup.py install
Usage
import numpy as np from kmodes.kmodes import KModes # random categorical data data = np.random.choice(20, (100, 10)) km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1) clusters = km.fit_predict(data) # Print the cluster centroids print(km.cluster_centroids_)
The examples directory showcases simple use cases of both kmodes (‘soybean.py’) and kprototypes (‘stocks.py’).
Missing / unseen data
The kmodes algorithm accepts np.NaN
values as missing values in
the X
matrix. However, users are strongly suggested to consider
filling in the missing data themselves in a way that makes sense for
the problem at hand. This is especially important in case of many missing
values.
The kmodes algorithm currently handles missing data as follows. When
fitting the model, np.NaN
values are encoded into their own
category (let’s call it “unknown values”). When predicting, the model
treats any values in X
that (1) it has not seen before during
training, or (2) are missing, as being a member of the “unknown values”
category. Simply put, the algorithm treats any missing / unseen data as
matching with each other but mismatching with nonmissing / seen data
when determining similarity between points.
The kprototypes also accepts np.NaN
values as missing values for
the categorical variables, but does not accept missing values for the
numerical values. It is up to the user to come up with a way of
handling these missing data that is appropriate for the problem at hand.
Parallel execution
The kmodes and kprototypes implementations both offer support for
multiprocessing via the
joblib library,
similar to e.g. scikitlearn’s implementation of kmeans, using the
n_jobs
parameter. It generally does not make sense to set more jobs
than there are processor cores available on your system.
This potentially speeds up any execution with more than one initialization try,
n_init > 1
, which may be helpful to reduce the execution time for
larger problems. Note that it depends on your problem whether multiprocessing
actually helps, so be sure to try that out first. You can check out the
examples for some benchmarks.
FAQ
Q: I’m seeing errors such as TypeError: '<' not supported between instances of 'str' and 'float'
when using the kprototypes
algorithm.
A: One or more of your numerical feature columns have string values in them. Make sure that all columns have consistent data types.
Q: How does kprotypes know which of my features are numerical and which are categorical?
A: You tell it which column indices are categorical using the categorical
argument. All others are assumed numerical. E.g., clusters = KPrototypes().fit_predict(X, categorical=[1, 2])
References
[HUANG97]  (1, 2) Huang, Z.: Clustering large data sets with mixed numeric and categorical values, Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore, pp. 2134, 1997. 
[HUANG98]  Huang, Z.: Extensions to the kmodes algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2(3), pp. 283304, 1998. 
[CAO09]  Cao, F., Liang, J, Bai, L.: A new initialization method for categorical data clustering, Expert Systems with Applications 36(7), pp. 1022310228., 2009. 
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size kmodes0.10.2py2.py3noneany.whl (18.4 kB)  File type Wheel  Python version py2.py3  Upload date  Hashes View 
Filename, size kmodes0.10.2.tar.gz (14.4 kB)  File type Source  Python version None  Upload date  Hashes View 
Hashes for kmodes0.10.2py2.py3noneany.whl
Algorithm  Hash digest  

SHA256  038b270aca1feebb98367894b3c87ea2b0e0107de7a8d6e0884066f4da8d97cc 

MD5  8edc25c1b72db8150b921c5a3f336fc7 

BLAKE2256  b255d8ec1ae1f7e1e202a8a4184c6852a3ee993b202b0459672c699d0ac18fc8 