Skip to main content

HG-means clustering for minimum sum-of-squares formulation

Project description

# HG-means

Source code of HG-means clustering, an efficient hybrid genetic algorithm proposed for the minimum sum-of-squares clustering (MSSC). This population-based metaheuristic uses K-means as a local search in combination with crossover, mutation, and diversification operators.

As HG-means algorithm uses K-means, we included the fundamental source files of the fast K-means implementation of Greg Hamerly (to whom we are grateful for making the source code available) in this repository, under the folder `/hamerly`. Original files and complete source code of Greg Hamerly K-means can be found at: https://github.com/ghamerly/fast-kmeans.

For the exact crossover, HG-means uses the implementation of Dlib (https://github.com/davisking/dlib) for solving an assignment problem. Dlib files are included in `/dlib-master` folder.

HG-means clustering is available as a C++ code, as well as a Python package.

## Related Article

*HG-means: A scalable hybrid genetic algorithm for minimum sum-of-squares clustering*. D. Gribel and T. Vidal, 2019. Pattern Recognition, https://doi.org/10.1016/j.patcog.2018.12.022

## Installation and Run

### C++

To run the algorithm in C++, go to `/hgmeans` folder and try the following sequence of commands:

`> make`

`> ./hgmeans 'dataset_path' pi_min n2 [nb_clusters]`

### Example

`> make`

`> ./hgmeans 'data/iris.txt' 10 5000 2 5 10`

This script executes HG-means clustering for "iris" dataset, with 10 solutions in population, a maximum of 5000 iterations, and 2, 5 and 10 clusters.

**Important:** You can provide a ground-truth file with the labels of clusters. In this case, make sure that a file with the same name of the dataset and '.label' extension is placed in the same folder of the dataset. If this file is provided, HG-means clustering will compute clustering performance metrics. See section **Data format** to check the expected data format for datasets and labels files.

### Parameters of the algorithm

`dataset_path`: The path of dataset.

`pi_min` (default = 10): Population size. Determines the size of the population in the genetic algorithm.

`n2` (default = 5000): Maximum number of iterations. Determines the total number of iterations the algorithm will take.

`[nb_clusters]`: The list with number of clusters. You can pass multiple values, separated by a single space.

### Python

HG-means is also available as a Python package.

If you use Windows, please install C++ Build tools, which can be downloaded here: https://go.microsoft.com/fwlink/?LinkId=691126

<!-- Firstly, you should have Cython installed. To install Cython, please refer to the official installation page:

https://cython.readthedocs.io/en/latest/src/quickstart/install.html -->

To install HG-means, run the following installation command:

`> python -m pip install hgmeans`

That is it! Now, open your Python interface, import the package and create an instance of HG-means. To execute it, just call function `Go()` with the corresponding parameters. See an example below:

`>>> import hgmeans`

`>>> my_demo = hgmeans.PyHGMeans()`

`>>> my_demo.Go('data/iris.txt', 10, 5000, [2,5,10])`

This script executes HG-means algorithm for "iris" dataset, with 10 solutions in population, a maximum of 5000 iterations, and 2, 5 and 10 clusters. Here the number of clusters is passed in an array, so values are separated by commas.

## Data format

**Dataset files.** In the first line of a dataset file, the number of data points (n) and the dimensionality of the data (d) is set, separated by a single space. The remaining lines correspond to the coordinates of data points. Each line contains the values of the d features of a sample, where x_ij correspond to the j-th feature of the i-th sample of the data. Each feature value is separated by a single space, as depicted in the scheme below:

| n | d | | | |
|------|------|------|-----|------|
| x_11 | x_12 | x_13 | ... | x_1d |
| x_21 | x_22 | x_23 | ... | x_2d |
| .... | .... | .... | ... | .... |
| x_n1 | x_n2 | x_n3 | ... | x_nd |

Some datasets are provided in `/data` folder in HG-means repository.

**Labels files.** The content of a labels file exhibits the cluster of each sample of the dataset according to ground-truth, where y_i correspond to the label of the i-th sample:

y_1

y_2

...

y_n

**Important**: Labels files must have '.label' extension. Some labels are provided in `/data` folder in HG-means repository.

<!-- ## Output file

After the execution of the algorithm, output files are saved in `/out` folder, with the following header:

| Pi_min | N2 | Dataset | m | SolutionCost | Time(s) |

If external clustering evaluation is done, the following header is produced:

| Pi_min | N2 | Dataset | m | SolutionCost | Time(s) | C-Rand | NMI | CentroidIndex -->

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hgmeans-2.0.tar.gz (1.7 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page