Python implementations of the kmodes and kprototypes clustering algorithms for clustering categorical data.
Project description
kmodes
Description
Python implementations of the kmodes and kprototypes clustering algorithms. Relies on numpy for a lot of the heavy lifting.
kmodes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more wellknown kmeans algorithm, which clusters numerical data based on Euclidean distance.) The kprototypes algorithm combines kmodes and kmeans and is able to cluster mixed numerical / categorical data.
Implemented are:
The code is modeled after the clustering algorithms in scikitlearn
and has the same familiar interface.
I would love to have more people play around with this and give me feedback on my implementation. If you come across any issues in running or installing kmodes, please submit a bug report.
Enjoy!
Installation
kmodes can be installed using pip:
pip install kmodes
To upgrade to the latest version (recommended), run it like this:
pip install upgrade kmodes
kmodes can also conveniently be installed with conda from the condaforge channel:
conda install c condaforge kmodes
Alternatively, you can build the latest development version from source:
git clone https://github.com/nicodv/kmodes.git
cd kmodes
python setup.py install
Usage
import numpy as np
from kmodes.kmodes import KModes
# random categorical data
data = np.random.choice(20, (100, 10))
km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)
clusters = km.fit_predict(data)
# Print the cluster centroids
print(km.cluster_centroids_)
The examples directory showcases simple use cases of both kmodes (‘soybean.py’) and kprototypes (‘stocks.py’).
Parallel execution
The kmodes and kprototypes implementations both offer support for
multiprocessing via the
joblib library,
similar to e.g. scikitlearn’s implementation of kmeans, using the
n_jobs
parameter. It generally does not make sense to set more jobs
than there are processor cores available on your system.
This potentially speeds up any execution with more than one initialization try,
n_init > 1
, which may be helpful to reduce the execution time for
larger problems. Note that it depends on your problem whether multiprocessing
actually helps, so be sure to try that out first. You can check out the
examples for some benchmarks.
FAQ
Q: I’m seeing errors such as “TypeError: ‘<’ not supported between instances of ‘str’ and ‘float’” when using the kprototypes algorithm.
A: One or more of your numerical feature columns have string values in them. Make sure that all columns have consistent data types.
Q: How does kprotypes know which of my features are numerical and which are categorical?
A: You tell it which column indices are categorical using the categorical
argument. All others are assumed numerical. E.g., clusters = KPrototypes().fit_predict(X, categorical=[1, 2])
Q: I’m getting the following error, what gives? “ModuleNotFoundError: No module named ‘kmodes.kmodes’; ‘kmodes’ is not a package”.
A: Make sure your working file is not called ‘kmodes.py’, because it might overrule the kmodes
package.
Q: I’m getting the following error: “ValueError: Clustering algorithm could not initialize. Consider assigning the initial clusters manually.”
A: This is a feature, not a bug. kmodes
is telling you that it can’t make sense of the data you are presenting it. At least, not with the parameters you are setting the algorithm with. It is up to you, the data scientist, to figure out why. Some hints to possible solutions:
Run with fewer clusters as the data might not support a large number of clusters
Explore and visualize your data, checking for weird distributions, outliers, etc.
Clean and normalize the data
Increase the ratio of rows to columns
Q: I’m getting the following error: “ValueError: Input contains NaN, infinity, or a value too large for dtype(‘float64’).”
A: Following scikitlearn, the kmodes algorithm does not accept np.NaN
values in the X
matrix. Users are suggested to fill in the missing
data in a way that makes sense for the problem at hand.
Q: How would like your library to be cited?
A: Something along these lines would do nicely:
@Misc{devos2015,
author = {Nelis J. de Vos},
title = {kmodes categorical clustering library},
howpublished = {\url{https://github.com/nicodv/kmodes}},
year = {20152021}
}
References
Huang, Z.: Clustering large data sets with mixed numeric and categorical values, Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore, pp. 2134, 1997.
Huang, Z.: Extensions to the kmodes algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2(3), pp. 283304, 1998.
Cao, F., Liang, J, Bai, L.: A new initialization method for categorical data clustering, Expert Systems with Applications 36(7), pp. 1022310228., 2009.
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for kmodes0.12.2py2.py3noneany.whl
Algorithm  Hash digest  

SHA256  b764f7166dd5fe63826135ed74df796693dc7c25fc2cb8a106e14f3bfb371004 

MD5  a84e3a133416ab7c600d128aa36b0911 

BLAKE2b256  1aa80d3bf6f3340cbcb8cf4ad02c306d157af8f09ce86aadf5346e00605870dd 