Skip to main content

RBHC

Project description

RBHC

CircleCI codecov PyPI version

Recursive Binary Hierarchical Clustering

This code is for accomplishing recursive binary hierarchical clustering of data
K-Means algorithm is applied on the initial dataset and a binary partition is created after which using chi square score statistic, the feature (event) that was responsible for the partition is found out. The remaining clusters are further divided recursively using the above approach until the cluster size reaches 1 or the silhouette score reaches the threshold value

Installation

Prerequisites: python3

pip install RBHC

Usage

from RBHC import clustering
clustering(dataFilePath,thresholdValue)
  • dataFilePath = Path to data file Check data file structure
  • thresholdValue = Silhouette value threshold (optional parameter and default in program is 0.65)

Return value from this function is a json with a tree structure that is generated with following important fields

  • name = Name of cluster node (string)
  • parent = Name of it's parent node (string)
  • size = Size of cluster (integer)
  • children = Tree structure of subtree (List)
  • clusterCreated = If clustering has been successful (Boolean)

To see a sample of this return value run clustering over sample dataset provided and print output or check visualisation/sampleData.json

If you want to run this program in an interactive manner in a jupyter notebook run this command in root directory jupyter notebook and then it opens up in localhost

Statistics

Once program runs then clustering statistics are stored in statistics/hierarchical/nameOfDataFile/ and for each sub cluster created stats are stored in a .json file and attributes are following

  • ClusterId = Identifier of a sub cluster L=Level G=Number of cluster in that level counted left to right
  • Size = Size of cluster
  • Primary feature cluster created by = Name of feature which is responsible primarily for this cluster formation
  • Features chi score = Shows chi score of all features in that cluster
  • Stats on cluster by each feature = Stats of each feature in this cluster
  • Ids = All instances that are part of cluster and names are derived from column[0] of data file

Visualisation

Copy visualisation folder to directory where clustering is being used
In visualisation folder nameOfDataFile.json will be created for clustering visualisation
Run this in visualisation folder python -m http.server 8888 and then in web browser open http://localhost:8888/

Data File Structure

IDS         | feature1    |                     | featureN
------------|-------------|---------------------|-----------------
ID1         |  value1     |                     |  valueN
            |             |                     |  
            |             |                     |
            |             |                     |  

All data files should be stored in data folder and check data folder for a sample .csv data file

Contribution and license

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

RBHC-1.0.1.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

RBHC-1.0.1-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file RBHC-1.0.1.tar.gz.

File metadata

  • Download URL: RBHC-1.0.1.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.8.0 tqdm/4.45.0 CPython/3.7.6

File hashes

Hashes for RBHC-1.0.1.tar.gz
Algorithm Hash digest
SHA256 38c85f11d53c5a3f13789b93d039c703a7c259c3285c5d6e4d9be63b0b722d7e
MD5 422fa485e427441aed8b0fcf5ac8d0ec
BLAKE2b-256 b35ae88406b04026ada1cfb0df42da89c52581f9cd137243da9fcdc380f40db9

See more details on using hashes here.

File details

Details for the file RBHC-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: RBHC-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.8.0 tqdm/4.45.0 CPython/3.7.6

File hashes

Hashes for RBHC-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 20516de7a1a47fd8ddaed0beb7f9cb3bf1e5371835093b4441861abdae4c645f
MD5 1a1d7a9b8037d17b41e030ce5717545f
BLAKE2b-256 633e8aba4472c98ee77fceed8e417e38d876bbc20d3a0bb599ee851b8fd49d04

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page