Skip to main content

easy interface for ensemble clustering

Project description

flexible-clustering-tree


What’s this?

In the context of clustering task, flexible-clustering-tree provides you easy and controllable clustering framework.

|image0|

Background

Let’s suppose, you have huge data. You’d like to observe data as easy as possible.

Hierarchical clustering is a way to see clustering tree. However, hierarchical clustering tends to fall into local optimization.

So, you need other clustering method. But at the same time, you wanna observe your data with tree structure style.

here, flexible-clustering-tree could give you simple way from data into tree viewer(d3 based)

You could set any kinds of clustering algorithm such as Kmeans, DBSCAN, Spectral-Clustering.

Multi feature and Multi clustering

During making a tree, you might want use various kind of clustering algorithm. For example, you use Kmeans for the 1st later of a tree, and DBSCAN for the 2nd layer of a tree.

And you might also use various kind of feature type depending on a layer of a tree. For example, in the context of document clustering, “title” of news for the 1st layer, and “whole text” for the 2nd layer.

The example below, this is a clustering tree of 20-news data set.

  • 1st layer(red highlight) is done with HDBSCAN clustering, and feature is dense vector of Subject text, which is converted by word2vec model.
  • 2nd layer(blue highlight) is done with Kmeans, and feature is sparse vector of whole text(BOW).

You could design your clustering tree as you want!

|image1|

Both are possible flexible-clustering-tree!

Contribution

  • Easy interface(scikit-learn way) from data(feature matrix) into a tree viewer
  • Possible to make various clustering algorithms ensemble
  • Possible to set various feature types

How to use?

.. code:: python

from flexible_clustering_tree import FeatureMatrixObject, MultiFeatureMatrixObject from flexible_clustering_tree import ClusteringOperator, MultiClusteringOperator from flexible_clustering_tree import FlexibleClustering import numpy import codecs

set feature matrix

an input of 1st layer is list of string

input_string = ['d'] * 10 + ['e'] * 10 + ['c'] * 10 + ['a'] * 10 + ['b'] * 10 + ['f'] * 50 f_obj_1st = FeatureMatrixObject(0, feature_strings=input_string)

an input of 2nd layer is the dense matrix (100, 300)

f_obj_2nd = FeatureMatrixObject(1, matrix_object=numpy.random.rand(100, 300))

an input of 3rd layer is the dense matrix (100, 50)

f_obj_3rd = FeatureMatrixObject(2, matrix_object=numpy.random.rand(100, 50)) dict_index2label = {i: "label-{}".format(i) for i in range(0, 100)} multi_feature_matrix = MultiFeatureMatrixObject( [f_obj_1st, f_obj_2nd, f_obj_3rd], dict_index2label=dict_index2label )

set clustering operation

from sklearn.cluster.k_means_ import KMeans from hdbscan.hdbscan_ import HDBSCAN from flexible_clustering_tree import StringAggregation c_operation_1st = ClusteringOperator(0, -1, StringAggregation()) c_operation_2nd = ClusteringOperator(1, 10, KMeans(10)) c_operation_3rd = ClusteringOperator(2, -1, HDBSCAN()) multi_clustering = MultiClusteringOperator([c_operation_1st, c_operation_2nd])

run flexible clustering

clustering_runner = FlexibleClustering(max_depth=5) index2cluster_no = clustering_runner.fit_transform(multi_feature_matrix, multi_clustering) html = clustering_runner.clustering_tree.to_html()

output to html

with codecs.open("out.html", "w", "utf-8") as f: f.write(html)

you're also able to generate tables via Pandas.

import pandas table_objects = clustering_runner.clustering_tree.to_objects() print(pandas.DataFrame(table_objects['cluster_information'])) print(pandas.DataFrame(table_objects['leaf_information']))

The output of pandas table is below.

The relation-table of clusters is in cluster_information.

::

   cluster_id  parent_id  depth_level  clustering_method

0 0 -1 1 StringAggregation 1 1 -1 1 StringAggregation 2 2 -1 1 StringAggregation 3 3 -1 1 StringAggregation 4 4 -1 1 StringAggregation 5 5 -1 1 StringAggregation 6 6 5 2 KMeans 7 7 5 2 KMeans 8 8 5 2 KMeans 9 9 5 2 KMeans 10 10 5 2 KMeans 11 11 5 2 KMeans 12 12 5 2 KMeans 13 13 5 2 KMeans 14 14 5 2 KMeans 15 15 5 2 KMeans

The relation-table of leaf nodes is in leaf_information.

::

   leaf_id  cluster_id     label  args

0 0 0 label-0 None 1 1 0 label-1 None 2 2 0 label-2 None 3 3 0 label-3 None 4 4 0 label-4 None .. ... ... ... ... 95 95 14 label-95 None 96 96 8 label-96 None 97 97 13 label-97 None 98 98 10 label-98 None 99 99 12 label-99 None [100 rows x 4 columns]

You could see examples at /examples.

setup

.. code:: bash

pip install flexible_clustering_tree

or close this repository

.. code:: bash

python setup.py install

For Developers

Environment

  • Python >= 3.x

Dev/Test environment by Docker

You build dev/test environment with Docker container. Here is simple procedure,

  1. build docker image
  2. start docker container
  3. run test in the container

.. code:: bash

$ cd tests $ docker-compose build $ docker-compose up $ docker run --name test-container -v pwd:/codes/flexible-clustering-tree/ -dt tests_dev_env $ docker exec -it test-container python /codes/flexible-clustering-tree/setup.py test

If you’re using pycharm professional edition, you could call a docker-compose file as Python interpreter.

.. |image0| image:: https://user-images.githubusercontent.com/1772712/47308081-9980cd00-d66b-11e8-98c0-a275db021cd7.gif .. |image1| image:: https://user-images.githubusercontent.com/1772712/47308468-abaf3b00-d66c-11e8-9a08-26facc39e80e.png

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flexible_clustering_tree-0.21.tar.gz (51.1 kB view details)

Uploaded Source

Built Distribution

flexible_clustering_tree-0.21-py3-none-any.whl (49.3 kB view details)

Uploaded Python 3

File details

Details for the file flexible_clustering_tree-0.21.tar.gz.

File metadata

  • Download URL: flexible_clustering_tree-0.21.tar.gz
  • Upload date:
  • Size: 51.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8

File hashes

Hashes for flexible_clustering_tree-0.21.tar.gz
Algorithm Hash digest
SHA256 bf15002fe0edd6d3b1a580e536c4cf356a63e0167eba8b133e55d449d3cc9550
MD5 15a1e39856bc4f08603b0eff532ee815
BLAKE2b-256 1a82bdf9309d0cd4f8a0c6d356feb76889faba81ad36429552451c64e6824c74

See more details on using hashes here.

File details

Details for the file flexible_clustering_tree-0.21-py3-none-any.whl.

File metadata

  • Download URL: flexible_clustering_tree-0.21-py3-none-any.whl
  • Upload date:
  • Size: 49.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8

File hashes

Hashes for flexible_clustering_tree-0.21-py3-none-any.whl
Algorithm Hash digest
SHA256 b9d320b32dc9c90f9d35fc4c8d60f21a027b77f59abc257891511eda7a73cd14
MD5 721c3e023f0915229465cc89e42832f8
BLAKE2b-256 a7433672c1b458afaec869351c5a1e32a0d095ee5585bb4c2c9d2cd50ca9ae95

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page