K-Means++ Clustering for Pandas DataFrames
Project description
K-means++ in Pandas
===================
An implementation of the [k-means++ clustering algorithm](http://en.wikipedia.org/wiki/K-means%2B%2B) using [Pandas](http://pandas.pydata.org/).
IMPORTANT NOTE
--------------
**This package should not be used in production.** The implementation of k-means++ contained therein is much slower than [that of scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). Use that instead.
The only reason why I wrote any of this is to teach myself Pandas.
Prerequisites
-------------
* Python 2.7 or lower; this is not Python 3 compatible (yet).
* [Pandas](http://pandas.pydata.org/) (obviously).
* [NumPy](http://numpy.org)
Installation
------------
If you have [pip](http://www.pip-installer.org/en/latest/installing.html), then just do
pip install k-means-plus-plus
Otherwise,
* Clone the repository:
git clone https://github.com/jackmaney/k-means-plus-plus-pandas.git
* Enter the newly-created folder containing the repo
cd k-means-plus-plus-pandas
* And run the installation manually:
python setup.py install
Usage
-----
Here are the constructor arguments:
* `data_frame`: A Pandas data frame representing the data that you wish to cluster. Rows represent observations, and columns represent variables.
* `k`: The number of clusters that you want.
* `columns=None`: A list of column names upon which you wish to cluster your data. If this argument isn't provided, then all of the columns are selected. **Note:** Columns upon which you want to cluster must be numeric and have no `numpy.nan` values.
* `max_iterations=None`: The maximum number of times that you wish to iterate k-means. If no value is provided, then the iterations continue until stability is reached (ie the cluster assignments don't change between one iteration and the next).
* `appended_column_name=None`: If this value is set with a string, then a column will be appended to your data with the given name that contains the cluster assignments (which are integers from 0 to `k-1`). If this argument is not set, then you still have access to the clusters via the `clusters` attribute.
Once you've constructed a `KMeansPlusPlus` object, then just call the `cluster` method, and everything else should happen automagically. Take a look at the `examples` folder.
TODO:
----
* Add on features that take iterations of k-means++ clusters and compares them via, eg, concordance matrices, Jaccard indices, etc.
* Given a data frame, implement the so-called [Elbow Method](http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_Elbow_Method) to take a stab at an optimal value for `k`.
* ~~Make this into a proper Python module that can be installed via pip.~~
* Python 3 compatibility (probably via six).
===================
An implementation of the [k-means++ clustering algorithm](http://en.wikipedia.org/wiki/K-means%2B%2B) using [Pandas](http://pandas.pydata.org/).
IMPORTANT NOTE
--------------
**This package should not be used in production.** The implementation of k-means++ contained therein is much slower than [that of scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). Use that instead.
The only reason why I wrote any of this is to teach myself Pandas.
Prerequisites
-------------
* Python 2.7 or lower; this is not Python 3 compatible (yet).
* [Pandas](http://pandas.pydata.org/) (obviously).
* [NumPy](http://numpy.org)
Installation
------------
If you have [pip](http://www.pip-installer.org/en/latest/installing.html), then just do
pip install k-means-plus-plus
Otherwise,
* Clone the repository:
git clone https://github.com/jackmaney/k-means-plus-plus-pandas.git
* Enter the newly-created folder containing the repo
cd k-means-plus-plus-pandas
* And run the installation manually:
python setup.py install
Usage
-----
Here are the constructor arguments:
* `data_frame`: A Pandas data frame representing the data that you wish to cluster. Rows represent observations, and columns represent variables.
* `k`: The number of clusters that you want.
* `columns=None`: A list of column names upon which you wish to cluster your data. If this argument isn't provided, then all of the columns are selected. **Note:** Columns upon which you want to cluster must be numeric and have no `numpy.nan` values.
* `max_iterations=None`: The maximum number of times that you wish to iterate k-means. If no value is provided, then the iterations continue until stability is reached (ie the cluster assignments don't change between one iteration and the next).
* `appended_column_name=None`: If this value is set with a string, then a column will be appended to your data with the given name that contains the cluster assignments (which are integers from 0 to `k-1`). If this argument is not set, then you still have access to the clusters via the `clusters` attribute.
Once you've constructed a `KMeansPlusPlus` object, then just call the `cluster` method, and everything else should happen automagically. Take a look at the `examples` folder.
TODO:
----
* Add on features that take iterations of k-means++ clusters and compares them via, eg, concordance matrices, Jaccard indices, etc.
* Given a data frame, implement the so-called [Elbow Method](http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_Elbow_Method) to take a stab at an optimal value for `k`.
* ~~Make this into a proper Python module that can be installed via pip.~~
* Python 3 compatibility (probably via six).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file k-means-plus-plus-0.1.0.tar.gz
.
File metadata
- Download URL: k-means-plus-plus-0.1.0.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe73020b4bc3701d96387584ec8aa7204acdc0e221a3793cbac0d8abed8fcde7 |
|
MD5 | b78f055e3d134a1935ac32e3eca2f631 |
|
BLAKE2b-256 | b663c9029b1a8b8915e76e49c65f363bcc812f55b5b9e282e38b3832c9c46fd8 |