LSH-k-Representatives: A categorial clustering algorithm using LSH
Project description
Python implementations of the LSH-k-Representatives algorithms for clustering categorical data:
Different from k-Modes algorithm, LSH-k-Representatives define the "representatives" that keep the frequencies of all categorical values of the clusters.
Notebook Samples:
Applying LSHkRepresentatives:
https://github.com/nmtoan91/lshkrepresentatives/blob/main/notebook_sample.ipynb
Applying LSHkRepresentatives_Full (for huge dataset):
https://github.com/nmtoan91/lshkrepresentatives/blob/main/notebook_sample_LSHkRepresentatives_Full.ipynb
Normalizing unstructed normal dataset:
https://github.com/nmtoan91/lshkrepresentatives/blob/main/notebook_dataset_normalization.ipynb
Note: The dataset is auto normalized if it detect string, or disjointed data, or nan
Installation:
Using pip:
pip install lshkrepresentatives numpy scikit-learn pandas kmodes networkx termcolor
Import the packages:
import numpy as np
from LSHkRepresentatives.LSHkRepresentatives import LSHkRepresentatives
Generate a simple categorical dataset:
X = np.array([['red',0,np.nan],['green',1,1],['blue',0,0],[1,5111,1],[2,2,2],[2,6513,'rectangle'],[2,3,6565]])
LSHk-Representatives (Init):
#Init instance of LSHkRepresentatives
kreps = LSHkRepresentatives(n_clusters=2,n_init=5)
#Do clustering for dataset X
labels = kreps.fit(X)
#Print the label for dataset X
print('Labels:',labels)
#Predict label for the random instance x
x = np.array(['red',5111,0])
label = kreps.predict(x)
print(f'Cluster of object {x} is: {label}')
Outcome:
SKIP LOADING distMatrix because: False bd=None
Generating disMatrix for DILCA
Saving DILCA to: saved_dist_matrices/json/DILCA_None.json
Generating LSH hash table: hbits: 2(4) k 1 d 3 n= 7
LSH time: 0.006518099999993865 Score: 6.333333333333334 Time: 0.0003226400000130525
Labels: [1 1 1 1 0 0 0]
Cluster of object [1 2 0] is: 1
Built-in evaluattion metrics:
y = np.array([0,0,0,0,1,1,1])
kreps.CalcScore(y)
Outcome:
Purity: 1.00 NMI: 1.00 ARI: 1.00 Sil: 0.59 Acc: 1.00 Recall: 1.00 Precision: 1.00
LSHk-Representatives (Full):
This version of LSHk-Representatives target for huge dataset, the accuracy will be reduced but the speed is increase from 2 to 32 times depend on the data
X = np.array([[0,0,0],[0,1,1],[0,0,0],[1,0,1],[2,2,2],[2,3,2],[2,3,2]])
kreps = LSHkRepresentatives_Full(n_clusters=2,n_init=5)
labels = kreps.fit(X)
print('Labels:',labels)
x = np.array([1,2,0])
label = kreps.predict(x)
print(f'Cluster of object {x} is: {label}')
Built-in evaluattion metrics:
y = np.array([0,0,0,0,1,1,1])
kreps.CalcScore(y)
Out come:
SKIP LOADING distMatrix because: True bd=None
Generating disMatrix for DILCA
Saving DILCA to: saved_dist_matrices/json/DILCA_None.json
Generating LSH hash table: hbits: 2(4) k 2 d 3 n= 7
n_group=2 Average neighbors:1.0
LSH time: 0.00661619999999985 Score: 6.333333333333334 Time: 0.000932080000000024
Purity: 1.00 NMI: 1.00 ARI: 1.00 Sil: 0.59 Acc: 1.00 Recall: 1.00 Precision: 1.00
Parameters:
X: Categorical dataset
y: Labels of object (for evaluation only)
n_init: Number of initializations
n_clusters: Number of target clusters
max_iter: Maximum iterations
verbose:
random_state:
If the variable MeasureManager.IS_LOAD_AUTO is set to "True": The DILCA will get the pre-caculated matrix
Outputs:
cluster_representatives: List of final representatives
labels_: Prediction labels
cost_: Final sum of squared distance from objects to their centroids
n_iter_: Number of iterations
epoch_costs_: Average time for an initialization
References:
T. N. Mau and V.-N. Huynh, ``An LSH-based k-Representatives Clustering Method for Large Categorical Data." Neurocomputing, Volume 463, 2021, Pages 29-44, ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2021.08.050.
Bibtex:
@article{mau2021lsh,
title={An LSH-based k-representatives clustering method for large categorical data},
author={Mau, Toan Nguyen and Huynh, Van-Nam},
journal={Neurocomputing},
volume={463},
pages={29--44},
year={2021},
publisher={Elsevier}
}
pypi/github repository
https://pypi.org/project/lshkrepresentatives/
https://github.com/nmtoan91/lshkrepresentatives
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lshkrepresentatives-1.2.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0edfd1ee00170a6487c942ebd7b4387a53f993f0ec0b63d235be18df2ec952c9 |
|
MD5 | d7c99391c8bc1da9eb2ba56e6b215151 |
|
BLAKE2b-256 | fd5a30511d60539dc7525c1355be3b68d7900a32383d6ab8c71543963844cf91 |
Hashes for lshkrepresentatives-1.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8cc6a5ae9c09500c59c06376a39e770fd44975a39a3ca8823533b65b47e855b3 |
|
MD5 | 3b0e364373ed309bf971e9fbc7370338 |
|
BLAKE2b-256 | 9764a37a862072e1a7c25416060e95b51848352a47e72901f1bda25a9fa98804 |