Skip to main content

No project description provided

Project description

Version License Test Status Test Coverage Code Health

kmodes

Description

Python implementations of the k-modes and k-prototypes clustering algorithms. Relies on numpy for a lot of the heavy lifting.

k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on Euclidean distance.) The k-prototypes algorithm combines k-modes and k-means and is able to cluster mixed numerical / categorical data.

Implemented are:

The code is modeled after the clustering algorithms in scikit-learn and has the same familiar interface.

I would love to have more people play around with this and give me feedback on my implementation. If you come across any issues in running or installing kmodes, please submit a bug report.

Enjoy!

Installation

kmodes can be installed using pip:

pip install kmodes

Alternatively, you can build the latest development version from source:

git clone https://github.com/nicodv/kmodes.git
cd kmodes
python setup.py install

Usage

import numpy as np
from kmodes import kmodes

# random categorical data
data = np.random.choice(20, (100, 10))

km = kmodes.KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

# Print the cluster centroids
print(km.cluster_centroids_)

clusters = km.fit_predict(data)

More simple usage examples of both k-modes (‘soybean.py’) and k-prototypes (‘stocks.py’) are included in the examples directory.

Missing / unseen data

The k-modes algorithm accepts np.NaN values as missing values in the X matrix. When fitting the model, these values are encoded into their own category (let’s call it “unknown values”). When predicting, the model treats any values in X that (1) it has not seen before during training, or (2) are missing, as being a member of the “unknown values” category. Simply put, the algorithm treats any missing / unseen data as matching with each other but mismatching with non-missing / seen data when determining similarity between points.

The k-prototypes also accepts np.NaN values as missing values for the categorical variables, but does not accept missing values for the numerical values. It is up to the user to come up with a way of handling these missing data that is appropriate for the problem at hand.

References

[HUANG97] (1,2)

Huang, Z.: Clustering large data sets with mixed numeric and categorical values, Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore, pp. 21-34, 1997.

[HUANG98]

Huang, Z.: Extensions to the k-modes algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2(3), pp. 283-304, 1998.

[CAO09]

Cao, F., Liang, J, Bai, L.: A new initialization method for categorical data clustering, Expert Systems with Applications 36(7), pp. 10223-10228., 2009.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmodes-0.5.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kmodes-0.5-py2.py3-none-any.whl (14.3 kB view details)

Uploaded Python 2Python 3

File details

Details for the file kmodes-0.5.tar.gz.

File metadata

  • Download URL: kmodes-0.5.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for kmodes-0.5.tar.gz
Algorithm Hash digest
SHA256 fa1796c57dcb015f393528c969aa0fd040c42211a904cdb8c1a0941e4a4c5869
MD5 71a3ca831a14b8ff23725efe61f74843
BLAKE2b-256 3604366f7f5b3674a41f06a99edfe4cfcf620f67e1b68180e454fef0fc0635d0

See more details on using hashes here.

File details

Details for the file kmodes-0.5-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for kmodes-0.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 4c5166878acc5142062108f8bdd483ddc518cd9a8432dfa414c85524947fddd7
MD5 be11e6f128add1eb99b90a06fdb0f79e
BLAKE2b-256 53ea616ed969206d24159486687745ba48aacde6b35cc414d77ffa6708718ae1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page