HG-means clustering for minimum sum-of-squares formulation
Project description
# HG-means
Source code of HG-means clustering, an efficient hybrid genetic algorithm proposed for the minimum sum-of-squares clustering (MSSC). This population-based metaheuristic uses K-means as a local search in combination with crossover, mutation, and diversification operators.
As HG-means algorithm uses K-means, we included the fundamental source files of the fast K-means implementation of Greg Hamerly (to whom we are grateful for making the source code available) in this repository, under the folder `/hamerly`. Original files and complete source code of Greg Hamerly K-means can be found at: https://github.com/ghamerly/fast-kmeans.
For the exact crossover, HG-means uses the implementation of Dlib (https://github.com/davisking/dlib) for solving an assignment problem. Dlib files are included in `/dlib-master` folder.
HG-means clustering is available as a C++ code, as well as a Python package.
## Related Article
*HG-means: A scalable hybrid genetic algorithm for minimum sum-of-squares clustering*. D. Gribel and T. Vidal, 2019. Pattern Recognition, https://doi.org/10.1016/j.patcog.2018.12.022
## Installation and Run
### C++
To run the algorithm in C++, go to `/hgmeans` folder and try the following sequence of commands:
`> make`
`> ./hgmeans 'dataset_path' pi_min n2 [nb_clusters]`
### Example
`> make`
`> ./hgmeans 'data/iris.txt' 10 5000 2 5 10`
This script executes HG-means clustering for "iris" dataset, with 10 solutions in population, a maximum of 5000 iterations, and 2, 5 and 10 clusters.
**Important:** You can provide a ground-truth file with the labels of clusters. In this case, make sure that a file with the same name of the dataset and '.label' extension is placed in the same folder of the dataset. If this file is provided, HG-means clustering will compute clustering performance metrics. See section **Data format** to check the expected data format for datasets and labels files.
### Parameters of the algorithm
`dataset_path`: The path of dataset.
`pi_min` (default = 10): Population size. Determines the size of the population in the genetic algorithm.
`n2` (default = 5000): Maximum number of iterations. Determines the total number of iterations the algorithm will take.
`[nb_clusters]`: The list with number of clusters. You can pass multiple values, separated by a single space.
### Python
HG-means is also available as a Python package.
If you use Windows, please install C++ Build tools, which can be downloaded here: https://go.microsoft.com/fwlink/?LinkId=691126
<!-- Firstly, you should have Cython installed. To install Cython, please refer to the official installation page:
https://cython.readthedocs.io/en/latest/src/quickstart/install.html -->
To install HG-means, run the following installation command:
`> python -m pip install hgmeans`
That is it! Now, open your Python interface, import the package and create an instance of HG-means. To execute it, just call function `Go()` with the corresponding parameters. See an example below:
`>>> import hgmeans`
`>>> my_demo = hgmeans.PyHGMeans()`
`>>> my_demo.Go('data/iris.txt', 10, 5000, [2,5,10])`
This script executes HG-means algorithm for "iris" dataset, with 10 solutions in population, a maximum of 5000 iterations, and 2, 5 and 10 clusters. Here the number of clusters is passed in an array, so values are separated by commas.
## Data format
**Dataset files.** In the first line of a dataset file, the number of data points (n) and the dimensionality of the data (d) is set, separated by a single space. The remaining lines correspond to the coordinates of data points. Each line contains the values of the d features of a sample, where x_ij correspond to the j-th feature of the i-th sample of the data. Each feature value is separated by a single space, as depicted in the scheme below:
| n | d | | | |
|------|------|------|-----|------|
| x_11 | x_12 | x_13 | ... | x_1d |
| x_21 | x_22 | x_23 | ... | x_2d |
| .... | .... | .... | ... | .... |
| x_n1 | x_n2 | x_n3 | ... | x_nd |
Some datasets are provided in `/data` folder in HG-means repository.
**Labels files.** The content of a labels file exhibits the cluster of each sample of the dataset according to ground-truth, where y_i correspond to the label of the i-th sample:
y_1
y_2
...
y_n
**Important**: Labels files must have '.label' extension. Some labels are provided in `/data` folder in HG-means repository.
<!-- ## Output file
After the execution of the algorithm, output files are saved in `/out` folder, with the following header:
| Pi_min | N2 | Dataset | m | SolutionCost | Time(s) |
If external clustering evaluation is done, the following header is produced:
| Pi_min | N2 | Dataset | m | SolutionCost | Time(s) | C-Rand | NMI | CentroidIndex -->
Source code of HG-means clustering, an efficient hybrid genetic algorithm proposed for the minimum sum-of-squares clustering (MSSC). This population-based metaheuristic uses K-means as a local search in combination with crossover, mutation, and diversification operators.
As HG-means algorithm uses K-means, we included the fundamental source files of the fast K-means implementation of Greg Hamerly (to whom we are grateful for making the source code available) in this repository, under the folder `/hamerly`. Original files and complete source code of Greg Hamerly K-means can be found at: https://github.com/ghamerly/fast-kmeans.
For the exact crossover, HG-means uses the implementation of Dlib (https://github.com/davisking/dlib) for solving an assignment problem. Dlib files are included in `/dlib-master` folder.
HG-means clustering is available as a C++ code, as well as a Python package.
## Related Article
*HG-means: A scalable hybrid genetic algorithm for minimum sum-of-squares clustering*. D. Gribel and T. Vidal, 2019. Pattern Recognition, https://doi.org/10.1016/j.patcog.2018.12.022
## Installation and Run
### C++
To run the algorithm in C++, go to `/hgmeans` folder and try the following sequence of commands:
`> make`
`> ./hgmeans 'dataset_path' pi_min n2 [nb_clusters]`
### Example
`> make`
`> ./hgmeans 'data/iris.txt' 10 5000 2 5 10`
This script executes HG-means clustering for "iris" dataset, with 10 solutions in population, a maximum of 5000 iterations, and 2, 5 and 10 clusters.
**Important:** You can provide a ground-truth file with the labels of clusters. In this case, make sure that a file with the same name of the dataset and '.label' extension is placed in the same folder of the dataset. If this file is provided, HG-means clustering will compute clustering performance metrics. See section **Data format** to check the expected data format for datasets and labels files.
### Parameters of the algorithm
`dataset_path`: The path of dataset.
`pi_min` (default = 10): Population size. Determines the size of the population in the genetic algorithm.
`n2` (default = 5000): Maximum number of iterations. Determines the total number of iterations the algorithm will take.
`[nb_clusters]`: The list with number of clusters. You can pass multiple values, separated by a single space.
### Python
HG-means is also available as a Python package.
If you use Windows, please install C++ Build tools, which can be downloaded here: https://go.microsoft.com/fwlink/?LinkId=691126
<!-- Firstly, you should have Cython installed. To install Cython, please refer to the official installation page:
https://cython.readthedocs.io/en/latest/src/quickstart/install.html -->
To install HG-means, run the following installation command:
`> python -m pip install hgmeans`
That is it! Now, open your Python interface, import the package and create an instance of HG-means. To execute it, just call function `Go()` with the corresponding parameters. See an example below:
`>>> import hgmeans`
`>>> my_demo = hgmeans.PyHGMeans()`
`>>> my_demo.Go('data/iris.txt', 10, 5000, [2,5,10])`
This script executes HG-means algorithm for "iris" dataset, with 10 solutions in population, a maximum of 5000 iterations, and 2, 5 and 10 clusters. Here the number of clusters is passed in an array, so values are separated by commas.
## Data format
**Dataset files.** In the first line of a dataset file, the number of data points (n) and the dimensionality of the data (d) is set, separated by a single space. The remaining lines correspond to the coordinates of data points. Each line contains the values of the d features of a sample, where x_ij correspond to the j-th feature of the i-th sample of the data. Each feature value is separated by a single space, as depicted in the scheme below:
| n | d | | | |
|------|------|------|-----|------|
| x_11 | x_12 | x_13 | ... | x_1d |
| x_21 | x_22 | x_23 | ... | x_2d |
| .... | .... | .... | ... | .... |
| x_n1 | x_n2 | x_n3 | ... | x_nd |
Some datasets are provided in `/data` folder in HG-means repository.
**Labels files.** The content of a labels file exhibits the cluster of each sample of the dataset according to ground-truth, where y_i correspond to the label of the i-th sample:
y_1
y_2
...
y_n
**Important**: Labels files must have '.label' extension. Some labels are provided in `/data` folder in HG-means repository.
<!-- ## Output file
After the execution of the algorithm, output files are saved in `/out` folder, with the following header:
| Pi_min | N2 | Dataset | m | SolutionCost | Time(s) |
If external clustering evaluation is done, the following header is produced:
| Pi_min | N2 | Dataset | m | SolutionCost | Time(s) | C-Rand | NMI | CentroidIndex -->
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
hgmeans-2.0.tar.gz
(1.7 MB
view hashes)