This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

kmodes

Description

Python implementations of the k-modes and k-prototypes clustering algorithms. Relies on numpy for a lot of the heavy lifting.

k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on Euclidean distance.) The k-prototypes algorithm combines k-modes and k-means and is able to cluster mixed numerical / categorical data.

Implemented are:

The code is modeled after the clustering algorithms in scikit-learn and has the same familiar interface.

I would love to have more people play around with this and give me feedback on my implementation. If you come across any issues in running or installing kmodes, please submit a bug report.

Enjoy!

Installation

kmodes can be installed using pip:

pip install kmodes

To upgrade to the latest version (recommended), run it like this:

pip install --upgrade kmodes

Alternatively, you can build the latest development version from source:

git clone https://github.com/nicodv/kmodes.git
cd kmodes
python setup.py install

Usage

import numpy as np
from kmodes import kmodes

# random categorical data
data = np.random.choice(20, (100, 10))

km = kmodes.KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

clusters = km.fit_predict(data)

# Print the cluster centroids
print(km.cluster_centroids_)

More simple usage examples of both k-modes (‘soybean.py’) and k-prototypes (‘stocks.py’) are included in the examples directory.

Missing / unseen data

The k-modes algorithm accepts np.NaN values as missing values in the X matrix. However, users are strongly suggested to consider filling in the missing data themselves in a way that makes sense for the problem at hand. This is especially important in case of many missing values.

The k-modes algorithm currently handles missing data as follows. When fitting the model, np.NaN values are encoded into their own category (let’s call it “unknown values”). When predicting, the model treats any values in X that (1) it has not seen before during training, or (2) are missing, as being a member of the “unknown values” category. Simply put, the algorithm treats any missing / unseen data as matching with each other but mismatching with non-missing / seen data when determining similarity between points.

The k-prototypes also accepts np.NaN values as missing values for the categorical variables, but does not accept missing values for the numerical values. It is up to the user to come up with a way of handling these missing data that is appropriate for the problem at hand.

References

[HUANG97](1, 2) Huang, Z.: Clustering large data sets with mixed numeric and categorical values, Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore, pp. 21-34, 1997.
[HUANG98]Huang, Z.: Extensions to the k-modes algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2(3), pp. 283-304, 1998.
[CAO09]Cao, F., Liang, J, Bai, L.: A new initialization method for categorical data clustering, Expert Systems with Applications 36(7), pp. 10223-10228., 2009.
Release History

Release History

0.6

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.5

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.4

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
kmodes-0.6-py2.py3-none-any.whl (16.0 kB) Copy SHA256 Checksum SHA256 py2.py3 Wheel Jun 8, 2016
kmodes-0.6.tar.gz (11.4 kB) Copy SHA256 Checksum SHA256 Source Jun 8, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting