a hierachical clustering algorithm based on information theory
Project description
Introduction
This repo contains code to compute the principal sequence of partition for Dilworth truncation function. Currently, two methods are available:
- Dilworth truncation based on graph maximal flow
- using paramatric maximal flow
Both method relies on Lemon Library to compute maximal flow for graph.
Agglomerative clustering method is implemented but cannot be used in general case because of FWRobust method is cursed with float accuracy.
How to build
main.cpp
is the user-implemented part. Run cmake
will generate build recipe for building this cpp file.
CMake with GTest
Testing is disabled by default, which requires gtest library. To enable it, run cmake
with -DENABLE_TESTING=ON
Lemon library
Use brute force search to solve submodualr function minimization(SFM) problem. For set with more than 10 elements, it is impractical. We use graph maximal flow(MF) to solve the special SFM problem. MF requires lemon library, which is disabled by default. To enable it, run cmake
with -DUSE_LEMON=ON
.
boost library
This project uses boost library in two places. Firstly, the main.cpp
uses boost-option to parse command-line arguments. But this feature is optional. To use it, run cmake
with -DUSE_BOOST_OPTION=ON
. Secondly, boost-python is used to make the procedure callable from Python.
Python binding
Disabled by default. The binding requires boost-python library. To enable it, run cmake
with -DUSE_PYTHON=ON
To make it independent of boost dynamic library, static linking should be enabled in CMAKE configuration.
To package the library, use python setup.py bdist_wheel
.
Demo code
import psp # classify the three data points shown in the above figure
g = psp.PyGraph(3, [(0,1,1),(1,2,1),(0,2,5)]) # index started from zero, similarity is 5 for vertex 0 and 2
g.run(False) # use maximal flow algorithm to classify them
print(g.get_critical_values()) # [2,5]
print(g.get_category(2)) # get the result which has at least 2 categories, which is [1,0,1]
Further experiment
In the directory utility, we make two simple experiments. The first is plot_art.py
,
which plots the clustering results for two artificial datasets.
The second one is empirical_compare.py
, which tests the info-clustering algorithm on 5 datasets
and compare the results with kmeans, affinity propagation, spectral clustering and agglomerative clustering.
For more detail, see experiment.
Parametric Dilworth Truncation(pdt) implementation
We provide another alternative implementation, which can be used similar to PyGraph.
import psp
g = psp.PyGraphPDT(3, [(0,1,1),(1,2,1),(0,2,5)]) # index started from zero, similarity is 5 for vertex 0 and 2
g.run() # use maximal flow algorithm to classify them
print(g.get_critical_values()) # [2,5]
print(g.get_category(2)) # get the result which has at least 2 categories, which is [0,1,0]
Reference
- [2016] Info-Clustering: A Mathematical Theory for Data Clustering
- https://github.com/ktrmnm/SFM
- [2010] A Faster Algorithm for Computing the Principal Sequence of Partitions of a Graph
- [2017] Info-Clustering: An Efficient Algorithm by Network Information Flow
ChangeLog
- Version 1.1: expose
Gaussian2DGraph
(C++) class, which can be used directly in python. - Version 1.2: expose
PyGraph
(C++) class, which is high customizable in python.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for info_cluster-0.3-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a39fd1e2cd36103189eb9a9619993314389ef1ed9e26d57a3b8301419a935789 |
|
MD5 | 92844b7a92f9e820ee14a29d12da6fd4 |
|
BLAKE2b-256 | e5c9782429df121d1c21e7a18ffb09a3786c5265de145d122af60cc3daf71043 |
Hashes for info_cluster-0.3-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0483a8d9b8d2eee9013f964d0cc309e47262d17db621c8a279f27ba75a726c05 |
|
MD5 | 54d85365139ad7cf36a869831e67325c |
|
BLAKE2b-256 | 80bc38536d948e10e04f6395563b445ff8b9b37ef93691a22503b63ac267947d |
Hashes for info_cluster-0.3-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35dfcf7eb0cd38d8800be515650586096c078c3e50cd476eb2c3b7eacb3c89af |
|
MD5 | a75e60c993b05f12f61c3e2ed6665f29 |
|
BLAKE2b-256 | 76b9d76ffd40e99355ac321a3d283588330100d03db91371c5b68ca812bcb2ef |