Skip to main content

External Clustering Validation Chi index

Project description

External Clustering Validation Chi Index

Contributors Forks Stargazers Issues MIT License Personal

PyPI Downloads PyPI - Python Version

About Chi Index

Chi Index is an external clustering validity index that measures the distance between the instances of a clustering result and the labels. Although clustering is an unsupervised learning machine learning technique, Chi index favours that the clusters formed have the least number of different labels.

For example, in the following image, we can see 3 different clustering solutions, in which each of the circles represents an instance of the dataset, and the color, the class to which it belongs. In A, we can see that there is a cluster that has 5 red instances, and two green instances, while in the other cluster, we have 2 red instances, 8 green instances, and 6 blue instances. In solution B, with k=3, we find that the cluster at the top of the figure has mostly red instances, the one on the left is mostly blue, and the one at the bottom has mostly green instances.

Clustering Solutions

Chi index measures the distribution of instances from the clusters formed and the number of instances of each label in them and calculates a metric based on the chi-square statistic. In the following table, we can see the chi index results for each of the clustering solutions.

k Chi Index(k)
2 0.890
3 0.925
4 0.760

As we can see, the clustering solution with the highest chi index value is k=3, which indicates that to separate instances of the same label into clusters, the optimal number of clusters is 3.

The higher the chi index value, the greater the dependency between clusters and labels, i.e. the clustering solution with the highest chi index will indicate that the instances belonging to the same class are grouped as well as possible in the clusters.

(back to top)

Getting Started

Using Chi Index is very simple, and here is how to do it in a few steps. You just need to have installed the Chi Index library available through the pip, and after that, you will need to import it into your Python application.

Installing Chi Index

The Chi index version of this repository is implemented in Python. You can use any version of Python from 3.7 onwards, although it is recommended to use 3.10. To install the library you only need to execute the following command:

pip install chi-index

Examples

There are two examples to run the library: the first one that is quite similar to other metrics such silhouette_score from sklearn, and the second one that works as a Class and includes all the k-means execution.

Note: To run this example you must have installed the chi index library by executing the command in the previous section. After that, you must download the file iris.data from the UCI repository, and place it in a folder called "data". To make it easier for you, I leave here the link: iris.data

Example 1

This is the easiest one and it's quite similar as other common metrics such as silhouette_score:

import pandas as pd
from chi_index import metrics
from sklearn import cluster
import numpy as np

def main():
    df = pd.read_csv('./test/data/iris.data', delimiter=",", header=None)
    print(df.columns)
    print(df.head())
    df.rename(columns={4: 'Class'}, inplace=True)

    X = np.array(df.drop(['Class'], axis=1))

    for clusters_num in range(2,11):        
        # Clustering stage
        kmeans_model = cluster.KMeans(n_clusters=clusters_num, n_init=100, max_iter=500, init='random').fit(X)
        labels = kmeans_model.predict(X)
        df.loc[:, 'cluster'] = labels   # saves the clustering labels into 'cluster' new column

        # chi_index_score receives the clustering result array and the class array
        valor = metrics.chi_index_score(df['cluster'], df['Class'], k=clusters_num)
        print(clusters_num , '\t', valor)


if __name__ == "__main__":
    main()

Example 2

In this case, the class include all the needed code to execute the K-means. You can copy and paste the following code that uses the Iris dataset:

import pandas as pd
from chi_index.model import ChiIndex


def main():
    df = pd.read_csv('./test/data/iris.data', delimiter=",", header=None)
    print(df.columns)
    print(df.head())
    df.rename(columns={4: 'Class'}, inplace=True)

    chi = ChiIndex(df, results_path='result')
    print(chi.list_chi)
    print(chi.optimum_chi)
    print(chi.optimum_k)
    chi.save_centroids()


if __name__ == "__main__":
    main()

If you have any problem, or you don't manage to execute the code, please contact me through DISCUSSION so I can help you.

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. Read CONTRIBUTING.md. We appreciate all kinds of help.

(back to top)

License

This project is licensed under the MIT License - see the LICENSE.md file for details

(back to top)

Contact

(back to top)

Cite this

Please, cite as: Luna-Romera JM, Martínez-Ballesteros M, García-Gutiérrez J, Riquelme JC. External clustering validity index based on chi-squared statistical test. Information Sciences (2019) 487: 1-17. https://doi.org/10.1016/j.ins.2019.02.046. (http://www.sciencedirect.com/science/article/pii/S0020025519301550)

@article{LUNAROMERA20191,
title = {External clustering validity index based on chi-squared statistical test},
journal = {Information Sciences},
volume = {487},
pages = {1-17},
year = {2019},
issn = {0020-0255},
doi = {https://doi.org/10.1016/j.ins.2019.02.046},
url = {https://www.sciencedirect.com/science/article/pii/S0020025519301550},
author = {José María Luna-Romera and María Martínez-Ballesteros and Jorge García-Gutiérrez and José C. Riquelme},
keywords = {Clustering analysis, External validity indices, Comparing clusters, Big data}
}

(back to top)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chi-index-2.1.1.tar.gz (11.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page