Skip to main content

Single-linkage connectivity clustering by cosine similarity

Project description

A fast algorithm computing connectivity components (clusters) of elements based on their feature cosine similarity. Two elements are connected by an edge if their cosine similarity is above a given threshold. Accepts either a numpy 2D array or a scipy column-sparse (CSR) matrix.

Install from PyPI with pip install silicon-clustering

Tomáš Gavenčiak gavento@ucw.cz

Usage example

import silicon, numpy
# 1000 rows, 10 features
data = numpy.random.rand(1000, 10)
# The ensemble normalizes rows of the data by default
# Choose high verbosity (via logging), cosine similarity >=0.97
ens = silicon.CosineClustering(data, sim_threshold=0.97, verbosity=2)
ens.run()
# (... progress reports)
ens.clusters[0]
# <Cluster no 0, size 13>
ens.clusters_by_size()[0]
# <Cluster no 36, size 22>
# With pyplot you can see the projected points
import matplotlib.pyplot as plt
ens.plot(); plt.show()
# or the individual clusters
ens.clusters_by_size()[0].plot(ens); plt.show()

Details

The algorithm uses several tricks to speed up the computation compared to traditional all-pair scalar products via matrix multiplication:

  • The data is projected into cell_dims principal components (PCA) by feature. The projection divides the data into cells of size self.distance so only rows from adjacent cells have a chance to be similar enough. Then the vectors of the rows of the adjacent cells are multiplied.

  • This would still mean that some cells are too big for matrix multiplication so a second trick is used before the cell multiplication: nibbling at cca 1000 random points of the dataset. For a random center row, the similarities to all other rows are computed and all similar points are clustered together (possibly merging existing clusters). This has the effect of pre-clustering most dense points of the dataset (esp. repeated values) - dense clusters have a good chance to be hit with a center, eliminating most of the mass of the cluster (as well as the respective cells). The number of nibble iterations should be tuned according to data (to reasonably decrease the cell size).

  • To combine the two tricks, a portion of the clustered points (e.g. 10%) together with all the unclustered points are considered for adjacent cell multiplication. The 10% returned points should ensure that the nibbled clusters are linked with any points not hit by the nibble but close to and in effect belonging to the clusters.

Since not all nibbled rows are used in the adjacent cell scalar product, the algorithm may miss few individual cluster connections at nibble ball boundaries, but we found it unlikely in practice.

The algorithm clusters 10000 gaussian points in 30 dimensions in under a second.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

silicon-clustering-0.3.1.tar.gz (11.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

silicon_clustering-0.3.1-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

silicon_clustering-0.3.1-py2-none-any.whl (16.6 kB view details)

Uploaded Python 2

File details

Details for the file silicon-clustering-0.3.1.tar.gz.

File metadata

File hashes

Hashes for silicon-clustering-0.3.1.tar.gz
Algorithm Hash digest
SHA256 28e94dfbd8d0806436b1222ec8658622d21e08354e957e1b52b8f0cc957dc20a
MD5 52652a872477815fac0dbcd74c3b934d
BLAKE2b-256 330725632217badc10da1bcbff428722b5c6b36dfcfccd4720e570bca0d96fba

See more details on using hashes here.

File details

Details for the file silicon_clustering-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for silicon_clustering-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0e221d3e68c0a9b62ed42039dfadd54c610f962b313fc15952b76b6a68e4b446
MD5 eb491ac399a008fad835cd9dbf198c93
BLAKE2b-256 327a916fe5913e8d9dd02b5d2061e980bfc0ee7305949deaf9d4c94f570d3f17

See more details on using hashes here.

File details

Details for the file silicon_clustering-0.3.1-py2-none-any.whl.

File metadata

File hashes

Hashes for silicon_clustering-0.3.1-py2-none-any.whl
Algorithm Hash digest
SHA256 264ceb7e3df1f39813e25a65276a624ec5540df1136f030e4fe7715ccca47f64
MD5 7b93e3c87abbdc6c09d7f6e9c200d150
BLAKE2b-256 195cf323c6338361548ba082f01369b95b00ac46346eef1103aeb1bcf2094e4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page