Skip to main content

Computes an optimal k-means solution for small instances and small k.

Project description

Build Status License: MIT Supported Python version Stable Version

exact-kmeans

This package computes exact solutions to the $k$-means problem using integer linear programming (ILP). Since our plain ILP solution to the problem was too slow, we implemented a branch-and-bound algorithm that computes multiple ILP solutions for different cluster sizes to find the optimal cluster sizes $[c_1, \ldots, c_k]$. The ILPs are implemented using Gurobi. You will need a Gurobi license to be able to run the code.

The algorithm works as follows:

  • We run $k$-means++ on the data to get an initial upper bound for the cost of an optimal solution.
  • We try to find the largest cluster size $c_1$ that a cluster in the optimal solution may have. We use the fact that the cost of a cluster in the optimal solution is less than the initial upper bound. We do this by solving an ILP (ILP#1).
  • We iterate through all possible cluster sizes $c_1 \geq c_2 \ldots \geq c_k$ and compute for each of them a solution using an ILP (ILP#2). However, we don't actually iterate through all the sizes, but use a branch-and-bound algorithm to skip some of them if the cost exceeds the initial upper bound. You can see an example of this in the tree in the "Plot the Branch and Bound Tree" section.

Both ILPs get cluster sizes $c_1,\ldots, c_i$ with $i\leq k$ as input and compute an optimal solution to the following modification of the $k$-means problem:

  • the number of clusters is $i$
  • the clusters have sizes $c_1,\ldots, c_i$
  • if $n$ is the number of data points $n-c_1-\ldots -c_i$ points are not part of any cluster

ILP#1 gets one single cluster size $c_1$ as input and computes an optimal solution with one single cluster of size $c_1$. We use ILP#1 to compute a lower bound to the cost of a cluster of size $c_1$. If this cost is already larger than the initial upper bound, then we know that the cluster sizes of an optimal solution must be all less than $c_1$. Therefore we can discard all cluster sizes greater equal $c_1$ when searching for the cluster sizes of an optimal solution.

ILP#2 gets cluster sizes $c_1,\ldots, c_i$ with $2\leq i\leq k$ as input. If $c_1,\ldots, c_i$ are also cluster sizes which are represented in the optimal solution the cost that ILP#2 computes should be less than the initial upper bound. We add a constraint to ILP#2 bounding the cost by initial upper bound. Therefore either ILP#2 is infeasible or it outputs an optimal solution of $i$ clusters with sizes $c_1,\ldots, c_i$ and cost smaller than initial upper bound.

To search for the cluster sizes of an optimal solution we use a branch and bound approach. In a branch node of level $i$ the $i$ largest cluster sizes $c_1, \ldots, c_i$ are already fixed. The variables branching_levels and fill_cluster_sizes define the behavior on these branching nodes. If branching_levels is greater equal to the level $i$ of the node then we use ILP#2 to bound the current cost and decide if we branch, otherwise we always branch on this node. If the variable fill_cluster_sizes is set to true we compute the smallest possible remaining cluster sizes $c_{i+1},\ldots, c_{k}$ and run ILP#2 with cluster sizes $c_1,\ldots, c_k$. If the variable fill_cluster_sizes is set to false we run ILP#2 only with the fixed cluster sizes $c_1,\ldots, c_i$. Setting fill_cluster_sizes to true may lead to less branching but can increase the solving time of ILP#2.

To customize the runs, you can create a config file. The default config file is exact_kmeans/config/default.yaml. You can also pass a different config file as an argument.

  • num_processes (integer or float) sets the number of processes used. The algorithm was parallelized using the multiprocessing package, so you can set the number of processes that you want to use. If you use an integer, at most that number of processes will be taken, otherwise if you use a float, it will be a fraction of the available CPUs. If the parameter is not passed, the algorithm will use all available CPUs.
  • bound_model_params are the arguments that are passed to the ILP#1 model. Please have a look at the Gurobi documentation for more information.
  • model_params are the arguments that are passed to the ILP#2 model. Please have a look at the Gurobi documentation for more information.
  • branching_priorities (true/false) enables the use of branching priorities. If true, the priority of the $x$ variable will be higher than the other variables. According to our tests, this speeds up the solving.
  • replace_min (true/false) replaces a minimum constraint with a linear version. According to our tests, this speeds up the solving.
  • branching_levels (integer) Sets the maximum level of a branching node where we run ILP#2 to bound the cost. Since this can increase the number of ILPs that are computed, it may only make sense for small levels.
  • fill_cluster_sizes (true/false) If set to false we run ILP#2 only with cluster sizes fixed at the branching node. If set to true we fill the cluster sizes up to $k$. Setting the variable to true can result in a larger computation time for ILP#2 but may result in less branching, which can save time. For small $k$ we recommend setting it to false and for large $k$ we recommend setting it to true.

Installation

For this package to work, you need to have a working installation of Gurobi. You can get a free academic license here. Here you can find a guide on how to install it.

pip install exact-kmeans

Use within Python

from exact_kmeans import ExactKMeans
from exact_kmeans.util import JsonEncoder
from ucimlrepo import fetch_ucirepo

# For example, you can import data from the UCI repository
# but you can also use your own dataset
iris = fetch_ucirepo(id=53)

X = iris.data.features

ilp = ExactKMeans(n_clusters=3)
res = ilp.fit(X)

with open("output.json", "w") as f:
    json.dump(res, f, indent=4, cls=JsonEncoder)

# You can also print the branch and bound tree to visualize the decisions made by the algorithm
# Below you find an example of the tree and meaning of the colors
from exact_kmeans.plot_tree import plot

plot(
  nodes=res.processed_cluster_sizes,
  filename="test",
  plot_folder="plots",
  optimal_objective=res.model.ObjVal,
)

Command Line

Install poetry

curl -sSL https://install.python-poetry.org | python3 -

Install the package

poetry install

Run the program

poetry run python -m exact_kmeans --data-path iris.csv --verbose --results-path test/iris.json --k 3 --config-file exact_kmeans/config/default.yaml

Your data-path should be a file with a header containing only the comma-separated data points. Example:

0,1,2,3
5.1,3.5,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5.0,3.6,1.4,0.2
...

Plot the Branch and Bound Tree

Assuming that you have run the program and stored the results in a JSON file, you can plot the tree produced by the algorithm.

poetry run python -m exact_kmeans.plot_tree --output-json output.json --plot-folder plots

This will create a tree that looks like this:

Tree Example

The nodes can have the following colors:

  • Red: The cluster sizes were infeasible.
  • Green: The cluster sizes were feasible.
  • Gray: The node was a branching node.
  • Orange/Yellow: The node was skipped because it had a too high initial cost.
  • The optimal node has a gold border.

The nodes have the following information:

  • The cluster sizes that were tested.
  • The amount of seconds it took.
  • The cost of the solution. This is only shown if the node has $k$ clusters and is feasible.

Development

Install poetry

curl -sSL https://install.python-poetry.org | python3 -

Install the package

poetry install

Run the tests

poetry run python -m unittest discover tests -v

Disclaimer

This package requires the Gurobi Optimizer, which is a commercial optimization solver. Users must obtain their own licenses for Gurobi. You are responsible for complying with all Gurobi licensing terms. For more information, visit the Gurobi Licensing Page.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exact_kmeans-0.1.3.tar.gz (407.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

exact_kmeans-0.1.3-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file exact_kmeans-0.1.3.tar.gz.

File metadata

  • Download URL: exact_kmeans-0.1.3.tar.gz
  • Upload date:
  • Size: 407.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.10.19 Linux/6.11.0-1018-azure

File hashes

Hashes for exact_kmeans-0.1.3.tar.gz
Algorithm Hash digest
SHA256 6f807ed300ddcf022f36c8aae381369d6f5a63568632c2dd9df746416afb0990
MD5 756d7476bdfbe8c20b4990ede402c5b2
BLAKE2b-256 0b4f89d46de958c83efa2245f6d78be834bf1d74df33fbd2c2978b74e2d89069

See more details on using hashes here.

File details

Details for the file exact_kmeans-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: exact_kmeans-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.10.19 Linux/6.11.0-1018-azure

File hashes

Hashes for exact_kmeans-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3a782183be6d6a8ffa2db7dfdd9e78f78298cfe8f8a18c0a2653a77e2c29992e
MD5 e7bc82b60dce897d181706fefd101970
BLAKE2b-256 8ccb395dcd1a3be26799fa5212d2ffdc78a2e00d6a12e800beae3644f126f480

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page