A parameter-free fast clustering algorithm.
Project description
First Integer Neighbor Clustering Hierarchy (FINCH) Algorithm
The repository contains our Python and Matlab code for the proposed FINCH clustering algorithm described in our Efficient Parameter-free Clustering Using First Neighbor Relations CVPR 2019 oral paper.
@inproceedings{finch,
author = {M. Saquib Sarfraz and Vivek Sharma and Rainer Stiefelhagen},
title = {Efficient Parameter-free Clustering Using First Neighbor Relations},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {8934--8943}
year = {2019}
}
Requirements
Python : numpy, scipy, sklearn
Matlab : 2017 or above: may run on earlier versions as well
Optional. Install PyNNDescent to get first neighbours for large data
Usage:
Python : typically you would run:
from finch import FINCH
c, num_clust, req_c = FINCH(data)
You can set options e.g., required number of cluster or distance etc,
c, num_clust, req_c = FINCH(data, initial_rank=None, req_clust=None, distance='cosine', verbose=True)
See also readme in python repo.
Matlab : Please go in the path where you copied this folder or add its path to your Matlab path.
[c, num_clust]= FINCH(data, initial_rank, verbose);
Input:
- data: data Matrix (feature vecotrs in rows)
- initial_rank [Optional]: Nx1 (1-neighbour) indices vector. Pass empty [] to compute the 1st neighbor via pdist or flann
- verbos : printing some output
Output:
- c: N x P matrix Each column vector contains cluster labels for each partition P
- num_clust: shows total number of cluster in each partition P
In Matlab typically you would run:
[c, num_clust]= FINCH(data,[], 1);
Example: Cluster the STL-10 data (13000 images of 10 object classes. We provide the used 2048 CNN resnet features. Please load the data in Matlab from /data/STL_10/data.mat. This has 13000 vectors stored as a matrix of size (13000,2048), each vector is 2048 dimensional.
Now cluster it using FINCH, run the above command with tic toc to see the runtime. The run time includes computing first neighbours via exact distance and every thing.
On our machine it was < 3 seconds, it should be about the same depending upon if your machine has similar specs.
it returns the cluster labels for each partition in the variable c which will be of size (N x numPartitions) e.g., (13000, 5) in this case. Each column in array c provides cluster labels for that partition.
num_clust provides how many cluster it has produced in each partition or step of the run.
As you see: num_clust = [2061, 177, 37, 10, 2]
inidicating it found 2061 clusters in step 1, 177 in step 2 and so on to 10 clusters in step 4. You can pick the respective cluster labels for the data in the returned array c. For example, c(:,4) will provide labels for 10 clustering result.
[Evaluation]: The true labels for this data are also provided in the same repo in label.mat file. You can run any performance metric e.g., to compute NMI metric use nmi.mat function (provided in utils) and run nmi(labels, c(:,4)), or compute BCubed Fscore with b3(labels, c(:,4)).
Similarly you can run FINCH on other datasets e.g., we provide the used mnist10k and mice protein data.
Required number of Clusters
We provide a very simple approximation to refine one of the FINCH partition to come down to required number of clusters. However, please note this is not recommended and provided here just for completeness. One could use better ways to refine a partition, if needed.
Python: Set it in the req_clust
input option.
Matlab: req_c = req_numclust(c, data,req_clust)
Example:
req_c= req_numclust(c(:,3), data, 15)
Provides the required 15 clusters by refining the respective FINCH partition (37 clusters), from FINCH returned c mat in the above example. See function help for more details.
Note:
To run it on large data (for which computing pairwise distance matrix won't fit in your memory), install pynndescent (pip install pynndescent) or flann library for matlab (In matlab, see flann_nn.m). Note that you can also change when to switch to using Approximate NN in python [finch.py] line 17 or in the matlab function [clustRank.m] line 10.
Finally, if you use FINCH on 2D toy data (for visualization) then use euclidean distance. In python [finch.py] as input option or in matlab, the file [clustRank.m] line 11. as here you have xy coordinates of each point to cluster.
The code and FINCH algorithm is not meant for commercial use. Please contact the author below for licensing information.
M. Saquib Sarfraz (saquib.sarfraz@kit.edu)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file finch-clust-0.1.0.tar.gz
.
File metadata
- Download URL: finch-clust-0.1.0.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0042350b04627b159698785334d6501f845b93066b661bedc167f13e53ff2007 |
|
MD5 | 486ef294f76fe67588269c5e35896abc |
|
BLAKE2b-256 | b7dbb52a60377b51d0de6735ee0b13b0e062780b3f14cc6c8b0440e636d06fdf |