LSH-k-Representatives: Mixed categorial and numerical (ordinal and nonordinal) data clustering algorithm algorithm
Project description
Clustering algorithm for Mixed data of categorial and numerical (ordinal and nonordinal) data using LSH.
Notebook samples:
1. LSH-k-Representatives : Clustering of categorical attributes only:
https://github.com/nmtoan91/lshkrepresentatives/blob/main/notebook_sample_clustering_categorical_data.ipynb
2. LSH-k-Prototypes : Clustering of mixed data (categorical and numerical attributes):
https://github.com/nmtoan91/lshkrepresentatives/blob/main/notebook_sample_clustering_mixed_data_type.ipynb
3. LSH-k-Representatives-Full : Clustering of HUGE categorical attributes only:
https://github.com/nmtoan91/lshkrepresentatives/blob/main/notebook_sample_LSHkRepresentatives_Full.ipynb
4. Normalizing unstructed normal dataset:
https://github.com/nmtoan91/lshkrepresentatives/blob/main/notebook_dataset_normalization.ipynb
Note 1: Different from k-Modes algorithm, LSH-k-Representatives define the "representatives" that keep the frequencies of all categorical values of the clusters. There are threee algorithms Note 2: The dataset is auto normalized if it detect string, or disjointed data, or nan
Installation:
Using pip:
pip install lshkrepresentatives numpy scikit-learn pandas networkx termcolor
Import the packages:
import numpy as np
from LSHkRepresentatives.LSHkRepresentatives import LSHkRepresentatives
Generate a simple categorical dataset:
X = np.array([['red',0,np.nan],['green',1,1],['blue',0,0],[1,5111,1],[2,2,2],[2,6513,'rectangle'],[2,3,6565]])
Using LSHk-Representatives (categorical clustering):
#Init instance of LSHkRepresentatives
kreps = LSHkRepresentatives(n_clusters=2,n_init=5)
#Do clustering for dataset X
labels = kreps.fit(X)
#Print the label for dataset X
print('Labels:',labels)
#Predict label for the random instance x
x = np.array(['red',5111,0])
label = kreps.predict(x)
print(f'Cluster of object {x} is: {label}')
Outcome:
SKIP LOADING distMatrix because: False bd=None
Generating disMatrix for DILCA
Saving DILCA to: saved_dist_matrices/json/DILCA_None.json
Generating LSH hash table: hbits: 2(4) k 1 d 3 n= 7
LSH time: 0.006518099999993865 Score: 6.333333333333334 Time: 0.0003226400000130525
Labels: [1 1 1 1 0 0 0]
Cluster of object [1 2 0] is: 1
Call built-in evaluattion metrics:
y = np.array([0,0,0,0,1,1,1])
kreps.CalcScore(y)
Outcome:
Purity: 1.00 NMI: 1.00 ARI: 1.00 Sil: 0.59 Acc: 1.00 Recall: 1.00 Precision: 1.00
Using LSHk-Prototypes (Mixed categorical and numerical attributes clustering):
For example: We have a dataset of 5 attributes (3 categorical and 2 numerical).
from LSHkRepresentatives.LSHkPrototypes import LSHkPrototypes
kprototypes = LSHkPrototypes(n_clusters=2,n_init=5)
X = np.array([['red',0,np.nan,1,1],
['green',1,1,0,0],
['blue',0,0,3,4],
[1,5111,1,1.1,1.2],
[2,2,2,29.0,38.9],
[2,6513,'rectangle',40,41.1],
['red',0,np.nan,30.4,30.1]])
attributeMasks = [0,0,0,1,1]
# attributeMasks = [0,0,0,1,1] means attributes are
# [categorial,categorial,categorial,numerical,numerical]
a = kprototypes.fit(X,attributeMasks,numerical_weight=2, categorical_weight=1)
print(a)
References:
T. N. Mau and V.-N. Huynh, ``An LSH-based k-Representatives Clustering Method for Large Categorical Data." Neurocomputing, Volume 463, 2021, Pages 29-44, ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2021.08.050.
Bibtex:
@article{mau2021lsh,
title={An LSH-based k-representatives clustering method for large categorical data},
author={Mau, Toan Nguyen and Huynh, Van-Nam},
journal={Neurocomputing},
volume={463},
pages={29--44},
year={2021},
publisher={Elsevier}
}
pypi/github repository
https://pypi.org/project/lshkrepresentatives/
https://github.com/nmtoan91/lshkrepresentatives
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lshkrepresentatives-1.2.3.tar.gz
(19.3 kB
view details)
Built Distribution
File details
Details for the file lshkrepresentatives-1.2.3.tar.gz
.
File metadata
- Download URL: lshkrepresentatives-1.2.3.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41c492cad474f8c75748c44c03fb3940e60cfab42b0568acedb42ef369878c95 |
|
MD5 | f401d80b8987609e271fdb6b5f8bfb07 |
|
BLAKE2b-256 | f77d54c5ff16b7ca3084911d2e2054d23bfc9822fbfbf5eca3c74c06e10d0c4d |
File details
Details for the file lshkrepresentatives-1.2.3-py3-none-any.whl
.
File metadata
- Download URL: lshkrepresentatives-1.2.3-py3-none-any.whl
- Upload date:
- Size: 27.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c6d3da64aca1d3c9539c0474950b95bede506039c84f6e2d0e35dfb3ce269b6 |
|
MD5 | 8386487f5204e7dca30cf446a1dc6a6c |
|
BLAKE2b-256 | 290ea21f5a642a4fdca5cc35023dd107fe2570757e5c0ce0e9ad80c92bbd50ae |