Python implementations of the k-modes and k-prototypes clustering algorithms for clustering categorical data.

## Project description

## kmodes

### Description

Python implementations of the k-modes and k-prototypes clustering algorithms. Relies on numpy for a lot of the heavy lifting.

k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on Euclidean distance.) The k-prototypes algorithm combines k-modes and k-means and is able to cluster mixed numerical / categorical data.

Implemented are:

- k-modes [HUANG97] [HUANG98]
- k-modes with initialization based on density [CAO09]
- k-prototypes [HUANG97]

The code is modeled after the clustering algorithms in `scikit-learn`

and has the same familiar interface.

I would love to have more people play around with this and give me feedback on my implementation. If you come across any issues in running or installing kmodes, please submit a bug report.

Enjoy!

### Installation

kmodes can be installed using pip:

pip install kmodes

To upgrade to the latest version (recommended), run it like this:

pip install --upgrade kmodes

Alternatively, you can build the latest development version from source:

```
git clone https://github.com/nicodv/kmodes.git
cd kmodes
python setup.py install
```

### Usage

import numpy as np from kmodes.kmodes import KModes # random categorical data data = np.random.choice(20, (100, 10)) km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1) clusters = km.fit_predict(data) # Print the cluster centroids print(km.cluster_centroids_)

The examples directory showcases simple use cases of both k-modes (‘soybean.py’) and k-prototypes (‘stocks.py’).

#### Missing / unseen data

The k-modes algorithm accepts `np.NaN`

values as missing values in
the `X`

matrix. However, users are strongly suggested to consider
filling in the missing data themselves in a way that makes sense for
the problem at hand. This is especially important in case of many missing
values.

The k-modes algorithm currently handles missing data as follows. When
fitting the model, `np.NaN`

values are encoded into their own
category (let’s call it “unknown values”). When predicting, the model
treats any values in `X`

that (1) it has not seen before during
training, or (2) are missing, as being a member of the “unknown values”
category. Simply put, the algorithm treats any missing / unseen data as
matching with each other but mismatching with non-missing / seen data
when determining similarity between points.

The k-prototypes also accepts `np.NaN`

values as missing values for
the categorical variables, but does *not* accept missing values for the
numerical values. It is up to the user to come up with a way of
handling these missing data that is appropriate for the problem at hand.

### References

[HUANG97] | (1, 2) Huang, Z.: Clustering large data sets with mixed numeric and
categorical values, Proceedings of the First Pacific Asia Knowledge
Discovery and Data Mining Conference, Singapore, pp. 21-34, 1997. |

[HUANG98] | Huang, Z.: Extensions to the k-modes algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2(3), pp. 283-304, 1998. |

[CAO09] | Cao, F., Liang, J, Bai, L.: A new initialization method for categorical data clustering, Expert Systems with Applications 36(7), pp. 10223-10228., 2009. |

## Project details

## Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help | File type | Python version | Upload date |
---|---|---|---|

kmodes-0.9-py2.py3-none-any.whl (15.7 kB) Copy SHA256 hash SHA256 | Wheel | py2.py3 | May 2, 2018 |

kmodes-0.9.tar.gz (12.4 kB) Copy SHA256 hash SHA256 | Source | None | May 2, 2018 |