Python library for document clustering
Project description
soyclustering: Python clustering algorithm library for document clustering
This package is implementation of Improving spherical kmeans for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling (Kim et al., 2020).
Cosine is more effective than Euclidean when measuring the distance between two highdimensional (or sometimes, sparse) documents vectors. However, scikitlearn kmeans package provides only Euclidean based kmeans. Additionally, labeling clusters can be very helpful for interpreting the clustering results.
Spherical kmeans works well both with sparse vector representation such as BagofWords model or distributed representation such as Doc2Vec or other document embedding methods.
In lower dimensional vector space, Silhouette score method is useful to define the number of clusters (k
).
However Silhouette score method does not work well in a highdimensional vector space such as BagofWords and Doc2Vec model space.
One of the heuristic methods to define the number of clusters is to train kmeans with large k
first and subsequently merge similar ones.
This method is also useful for preventing the Uniform Effect, which causes the size of all clusters to be similar.
soyclustering
provides Spherical kmeans (kmeans which uses Cosine distance as a distance metric) and keyword extractionbased clustering labeling function.
Furthermore, the function for visualizing pairwise distances between centroids will help you to check whether redundant clusters exist, allowing you to remove redundant clusters by merging them.
soyclustering
also provides a fast initialization method that is effective in a highdimensional vector space.
When the size of the input data is large, the initialization method sets k to be 1000.
The below table decsribes relative speedup of proposed initialization method with alpha=3 compared to kmeans++ (x times faster)
Dataset name  num rows  num features  num of nonzero (percent)  k=10  k=20  k=50  k=100 

A6 blogs  63,153  91,302  18,051,341 (0.313 %)  x 265  x 257  x 213  x 150 
Tucson blogs  105,755  81,497  29,192,999 (0.339 %)  x 388  x 440  x 306  x 244 
Sonata blogs  229,253  85,129  60,861,803 (0.312 $)  x 785  x 841  x 614  x 495 
IMDb reviews  1,228,348  68,049  181,411,713 (0.217 %)  x 803  x 714  x 1988  x 1787 
Reuter RCV1  804,414  47,236  60,915,113 (0.160 %)  x 439  x 713  x 850  x 772 
MovieLens 20M  138,493  131,262  20,000,263 (0.110 %)  x 202  x 213  x 214  x 184 
Yelp reviews  5,261,669  27,247  365,341,887 (0.255 %)  x 368  x 908  x 1508  x 2917 
Usage
You can read a matrix market file as follows. Note that the file must include tokenized outputs. Although the spherical kmeans function can be used for inputs in distributed representation such as Doc2Vec, our cluster labeling algorithm works only for BagofWords model.
from scipy.io import mmread
x = mmread(mm_file).tocsr()
Sperical kmeans can be used as follows. init='similar_cut' indicates our initializer that is effective in a highdimensional vector space. If you want to preserve the sparsity of the centroid vector, you can set minimum_df_factor. Other interfaces are similar to those of scikitlearn kmeans function. With fit_predict, you can retrieve the labels from the clustering result.
from soyclustering import SphericalKMeans
spherical_kmeans = SphericalKMeans(
n_clusters=1000,
max_iter=10,
verbose=1,
init='similar_cut',
sparsity='minimum_df',
minimum_df_factor=0.05
)
labels = spherical_kmeans.fit_predict(x)
When the verbose mode is set, computation speed and the level of sparsity during the initizalition and each iteration is printed.
initialization_time=1.218108 sec, sparsity=0.00796
n_iter=1, changed=29969, inertia=15323.440, iter_time=4.435 sec, sparsity=0.116
n_iter=2, changed=5062, inertia=11127.620, iter_time=4.466 sec, sparsity=0.108
n_iter=3, changed=2179, inertia=10675.314, iter_time=4.463 sec, sparsity=0.105
n_iter=4, changed=1040, inertia=10491.637, iter_time=4.449 sec, sparsity=0.103
n_iter=5, changed=487, inertia=10423.503, iter_time=4.437 sec, sparsity=0.103
n_iter=6, changed=297, inertia=10392.490, iter_time=4.483 sec, sparsity=0.102
n_iter=7, changed=178, inertia=10373.646, iter_time=4.442 sec, sparsity=0.102
n_iter=8, changed=119, inertia=10362.625, iter_time=4.449 sec, sparsity=0.102
n_iter=9, changed=78, inertia=10355.905, iter_time=4.438 sec, sparsity=0.102
n_iter=10, changed=80, inertia=10350.703, iter_time=4.452 sec, sparsity=0.102
Cluster labeling can be used to intrepret the clustering results. The proportion_keywords
function of soyclustering
uses a keyword extractionbased method to return keywords describing each cluster. For its input arguments, you need to provide cluster centroid vectors, a list of vocabularies (as str) and labels.
from soyclustering import proportion_keywords
centers = spherical_kmeans.cluster_centers_
idx2vocab = ['list', 'of', 'str', 'vocab']
keywords = proportion_keywords(centers, labels, index2word=idx2vocab)
The table in below is the results of cluster labels from a trained kmeans with k=1,000 and 1.2M documents.
The meaning of cluster (Human label) 
Algorithm based extracted clustering labels 

The movie “Titanic"  iceberg, zane, sinking, titanic, rose, winslet, camerons, 1997, leonardo, leo, ship, cameron, dicaprio, kate, tragedy, jack, di saster, james, romance, love, effects, special, story, people, best, ever, made 
Heros in Marvle comics (Avengers)  zemo, chadwick, boseman, bucky, panther, holland, cap, infinity, mcu, russo, civil, bvs, antman, winter, ultron, airport, ave ngers, marvel, captain, superheroes, soldier, stark, evans, america, iron, spiderman, downey, tony, superhero, heroes 
Alien movies such as Coverfield or District 9  skyline, jarrod, balfour, strause, invasion, independence, cloverfield, angeles, district, los, worlds, aliens, alien, la, budget, scifi, battle, cgi, day, effects, war, special, ending, bad, better, why, they, characters, their, people 
Horror movies  gayheart, loretta, candyman, legends, urban, witt, campus, tara, reid, legend, alicia, englund, leto, rebecca, jared, scream, murders, slasher, helen, killer, student, college, students, teen, summer, cut, horror, final, sequel, scary 
The movie “The Matrix"  neo, morpheus, neos, oracle, trinity, zion, architect, hacker, reloaded, revolutions, wachowski, fishburne, machines, agents, matrix, keanu, smith, reeves, agent, jesus, machine, computer, humans, fighting, fight, world, cool, real, special, effects 
Setting a large k
leads to redundant clusters. You can identify these redundant clusters by carefully examining the pairwise distance between the clusters.
from soyclustering import visualize_pairwise_distance
# visualize pairwise distance matrix
fig = visualize_pairwise_distance(centers, max_dist=.7, sort=True)
If you find redundant clusters, you can easily merge them into a single cluster.
from soyclustering import merge_close_clusters
group_centers, groups = merge_close_clusters(centers, labels, max_dist=.5)
fig = visualize_pairwise_distance(group_centers, max_dist=.7, sort=True)
After merging, you can check the size of dark squares in the diagonal entries of the pairwise distance matrix. If the redundant clusters are indeed successfully merged, the number of dark sqaures in the diagonal entries should have been reduced.
The function merge_close_clusters
groups similar clusters, in which the distance between them is smaller than max_dist
.
The centroid of the new cluster is a weighted average of original centroid vectors.
From the variable groups
, you can return the cluster indices prior and after merging.
for group in groups:
print(group)
[0, 19, 57, 68, 88, 115, 202, 223, 229, 237]
[1]
[2]
[3, 4, 5, 8, 12, 14, 16, 18, 20, 22, 26, 28, ...]
[6, 25, 29, 32, 37, 43, 45, 48, 53, 56, 65, ...]
[7, 17, 34, 41, 52, 59, 76, 79, 84, 87, 93, ...]
[9, 15, 24, 47, 51, 97]
[10, 100, 139]
[11, 23, 251]
...
See more
In addition, the repository kmeans_to_pyLDAvis
provides kmeans visualization using PyLDAVis
References
 Kim, H., Kim, H. K., & Cho, S. (2020). Improving spherical kmeans for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling. Expert Systems with Applications, 150, 113288.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for soyclustering0.2.0py3noneany.whl
Algorithm  Hash digest  

SHA256  e07a9025b85b7d0b6886b64854b57ef3934834756c2bf8a1b6aeff36348a63f1 

MD5  0e7d3354ab614e3c186343e8cda90e88 

BLAKE2b256  56da383104eb15776319add42a216f377d76b4a6d0fe4b6b21fce507a8c27607 