Skip to main content

Python implementations of the k-modes and k-prototypes clustering algorithms for clustering categorical data.

Project description

Version Test Status Test Coverage Codacy Badge Requirements Status Supported Python versions Github stars License

kmodes

Description

Python implementations of the k-modes and k-prototypes clustering algorithms. Relies on numpy for a lot of the heavy lifting.

k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on Euclidean distance.) The k-prototypes algorithm combines k-modes and k-means and is able to cluster mixed numerical / categorical data.

Implemented are:

The code is modeled after the clustering algorithms in scikit-learn and has the same familiar interface.

I would love to have more people play around with this and give me feedback on my implementation. If you come across any issues in running or installing kmodes, please submit a bug report.

Enjoy!

Installation

kmodes can be installed using pip:

pip install kmodes

To upgrade to the latest version (recommended), run it like this:

pip install --upgrade kmodes

Alternatively, you can build the latest development version from source:

git clone https://github.com/nicodv/kmodes.git
cd kmodes
python setup.py install

Usage

import numpy as np
from kmodes.kmodes import KModes

# random categorical data
data = np.random.choice(20, (100, 10))

km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

clusters = km.fit_predict(data)

# Print the cluster centroids
print(km.cluster_centroids_)

The examples directory showcases simple use cases of both k-modes (‘soybean.py’) and k-prototypes (‘stocks.py’).

Missing / unseen data

The k-modes algorithm accepts np.NaN values as missing values in the X matrix. However, users are strongly suggested to consider filling in the missing data themselves in a way that makes sense for the problem at hand. This is especially important in case of many missing values.

The k-modes algorithm currently handles missing data as follows. When fitting the model, np.NaN values are encoded into their own category (let’s call it “unknown values”). When predicting, the model treats any values in X that (1) it has not seen before during training, or (2) are missing, as being a member of the “unknown values” category. Simply put, the algorithm treats any missing / unseen data as matching with each other but mismatching with non-missing / seen data when determining similarity between points.

The k-prototypes also accepts np.NaN values as missing values for the categorical variables, but does not accept missing values for the numerical values. It is up to the user to come up with a way of handling these missing data that is appropriate for the problem at hand.

Parallel execution

The k-modes and k-prototypes implementations both offer support for multiprocessing via the joblib library, similar to e.g. scikit-learn’s implementation of k-means, using the n_jobs parameter. It generally does not make sense to set more jobs than there are processor cores available on your system.

This potentially speeds up any execution with more than one initialization try, n_init > 1, which may be helpful to reduce the execution time for larger problems. Note that it depends on your problem whether multiprocessing actually helps, so be sure to try that out first. You can check out the examples for some benchmarks.

FAQ

Q: I’m seeing errors such as TypeError: '<' not supported between instances of 'str' and 'float' when using the kprototypes algorithm.

A: One or more of your numerical feature columns have string values in them. Make sure that all columns have consistent data types.

Q: How does k-protypes know which of my features are numerical and which are categorical?

A: You tell it which column indices are categorical using the categorical argument. All others are assumed numerical. E.g., clusters = KPrototypes().fit_predict(X, categorical=[1, 2])

References

[HUANG97] (1,2)

Huang, Z.: Clustering large data sets with mixed numeric and categorical values, Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore, pp. 21-34, 1997.

[HUANG98]

Huang, Z.: Extensions to the k-modes algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2(3), pp. 283-304, 1998.

[CAO09]

Cao, F., Liang, J, Bai, L.: A new initialization method for categorical data clustering, Expert Systems with Applications 36(7), pp. 10223-10228., 2009.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmodes-0.10.2.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kmodes-0.10.2-py2.py3-none-any.whl (18.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file kmodes-0.10.2.tar.gz.

File metadata

  • Download URL: kmodes-0.10.2.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.23.2 CPython/3.6.5

File hashes

Hashes for kmodes-0.10.2.tar.gz
Algorithm Hash digest
SHA256 2ae2e8dbc7b317f81354b951811df911ba2875d31a45bda4c6275e5eb35b84f2
MD5 04427b217b998f83fc21eef5eda2f599
BLAKE2b-256 57724c2fd32d52e8d134df963e601970d3d22ad7f9c857fe6e4ff2640830f53b

See more details on using hashes here.

File details

Details for the file kmodes-0.10.2-py2.py3-none-any.whl.

File metadata

  • Download URL: kmodes-0.10.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 18.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.23.2 CPython/3.6.5

File hashes

Hashes for kmodes-0.10.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 038b270aca1feebb98367894b3c87ea2b0e0107de7a8d6e0884066f4da8d97cc
MD5 8edc25c1b72db8150b921c5a3f336fc7
BLAKE2b-256 b255d8ec1ae1f7e1e202a8a4184c6852a3ee993b202b0459672c699d0ac18fc8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page