Python implementations of the k-modes and k-prototypes clustering algorithms for clustering categorical data.
Python implementations of the k-modes and k-prototypes clustering algorithms. Relies on numpy for a lot of the heavy lifting.
k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on Euclidean distance.) The k-prototypes algorithm combines k-modes and k-means and is able to cluster mixed numerical / categorical data.
- k-modes [HUANG97] [HUANG98]
- k-modes with initialization based on density [CAO09]
- k-prototypes [HUANG97]
The code is modeled after the clustering algorithms in
and has the same familiar interface.
I would love to have more people play around with this and give me feedback on my implementation. If you come across any issues in running or installing kmodes, please submit a bug report.
kmodes can be installed using pip:
pip install kmodes
To upgrade to the latest version (recommended), run it like this:
pip install --upgrade kmodes
Alternatively, you can build the latest development version from source:
git clone https://github.com/nicodv/kmodes.git cd kmodes python setup.py install
import numpy as np from kmodes.kmodes import KModes # random categorical data data = np.random.choice(20, (100, 10)) km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1) clusters = km.fit_predict(data) # Print the cluster centroids print(km.cluster_centroids_)
The examples directory showcases simple use cases of both k-modes (‘soybean.py’) and k-prototypes (‘stocks.py’).
Missing / unseen data
The k-modes algorithm accepts
np.NaN values as missing values in
X matrix. However, users are strongly suggested to consider
filling in the missing data themselves in a way that makes sense for
the problem at hand. This is especially important in case of many missing
The k-modes algorithm currently handles missing data as follows. When
fitting the model,
np.NaN values are encoded into their own
category (let’s call it “unknown values”). When predicting, the model
treats any values in
X that (1) it has not seen before during
training, or (2) are missing, as being a member of the “unknown values”
category. Simply put, the algorithm treats any missing / unseen data as
matching with each other but mismatching with non-missing / seen data
when determining similarity between points.
The k-prototypes also accepts
np.NaN values as missing values for
the categorical variables, but does not accept missing values for the
numerical values. It is up to the user to come up with a way of
handling these missing data that is appropriate for the problem at hand.
|[HUANG97]||(1, 2) Huang, Z.: Clustering large data sets with mixed numeric and categorical values, Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore, pp. 21-34, 1997.|
|[HUANG98]||Huang, Z.: Extensions to the k-modes algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2(3), pp. 283-304, 1998.|
|[CAO09]||Cao, F., Liang, J, Bai, L.: A new initialization method for categorical data clustering, Expert Systems with Applications 36(7), pp. 10223-10228., 2009.|
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size & hash SHA256 hash help||File type||Python version||Upload date|
|kmodes-0.9-py2.py3-none-any.whl (15.7 kB) Copy SHA256 hash SHA256||Wheel||py2.py3|
|kmodes-0.9.tar.gz (12.4 kB) Copy SHA256 hash SHA256||Source||None|