Skip to main content

Python library for document clustering

Project description

soyclustering: Python clustering algorithm library for document clustering

This package is implementation of Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling (Kim et al., 2020).

Cosine is more effective than Euclidean when measuring the distance between two high-dimensional (or sometimes, sparse) documents vectors. However, scikit-learn k-means package provides only Euclidean based k-means. Additionally, labeling clusters can be very helpful for interpreting the clustering results.

Spherical k-means works well both with sparse vector representation such as Bag-of-Words model or distributed representation such as Doc2Vec or other document embedding methods. In lower dimensional vector space, Silhouette score method is useful to define the number of clusters (k). However Silhouette score method does not work well in a high-dimensional vector space such as Bag-of-Words and Doc2Vec model space. One of the heuristic methods to define the number of clusters is to train k-means with large k first and subsequently merge similar ones. This method is also useful for preventing the Uniform Effect, which causes the size of all clusters to be similar.

soyclustering provides Spherical k-means (k-means which uses Cosine distance as a distance metric) and keyword extraction-based clustering labeling function. Furthermore, the function for visualizing pairwise distances between centroids will help you to check whether redundant clusters exist, allowing you to remove redundant clusters by merging them. soyclustering also provides a fast initialization method that is effective in a high-dimensional vector space. When the size of the input data is large, the initialization method sets k to be 1000. The below table decsribes relative speed-up of proposed initialization method with alpha=3 compared to kmeans++ (x times faster)

Dataset name num rows num features num of nonzero (percent) k=10 k=20 k=50 k=100
A6 blogs 63,153 91,302 18,051,341 (0.313 %) x 265 x 257 x 213 x 150
Tucson blogs 105,755 81,497 29,192,999 (0.339 %) x 388 x 440 x 306 x 244
Sonata blogs 229,253 85,129 60,861,803 (0.312 $) x 785 x 841 x 614 x 495
IMDb reviews 1,228,348 68,049 181,411,713 (0.217 %) x 803 x 714 x 1988 x 1787
Reuter RCV1 804,414 47,236 60,915,113 (0.160 %) x 439 x 713 x 850 x 772
MovieLens 20M 138,493 131,262 20,000,263 (0.110 %) x 202 x 213 x 214 x 184
Yelp reviews 5,261,669 27,247 365,341,887 (0.255 %) x 368 x 908 x 1508 x 2917

Usage

You can read a matrix market file as follows. Note that the file must include tokenized outputs. Although the spherical k-means function can be used for inputs in distributed representation such as Doc2Vec, our cluster labeling algorithm works only for Bag-of-Words model.

from scipy.io import mmread
x = mmread(mm_file).tocsr()

Sperical k-means can be used as follows. init='similar_cut' indicates our initializer that is effective in a high-dimensional vector space. If you want to preserve the sparsity of the centroid vector, you can set minimum_df_factor. Other interfaces are similar to those of scikit-learn k-means function. With fit_predict, you can retrieve the labels from the clustering result.

from soyclustering import SphericalKMeans
spherical_kmeans = SphericalKMeans(
    n_clusters=1000,
    max_iter=10,
    verbose=1,
    init='similar_cut',
    sparsity='minimum_df',
    minimum_df_factor=0.05
)

labels = spherical_kmeans.fit_predict(x)

When the verbose mode is set, computation speed and the level of sparsity during the initizalition and each iteration is printed.

initialization_time=1.218108 sec, sparsity=0.00796
n_iter=1, changed=29969, inertia=15323.440, iter_time=4.435 sec, sparsity=0.116
n_iter=2, changed=5062, inertia=11127.620, iter_time=4.466 sec, sparsity=0.108
n_iter=3, changed=2179, inertia=10675.314, iter_time=4.463 sec, sparsity=0.105
n_iter=4, changed=1040, inertia=10491.637, iter_time=4.449 sec, sparsity=0.103
n_iter=5, changed=487, inertia=10423.503, iter_time=4.437 sec, sparsity=0.103
n_iter=6, changed=297, inertia=10392.490, iter_time=4.483 sec, sparsity=0.102
n_iter=7, changed=178, inertia=10373.646, iter_time=4.442 sec, sparsity=0.102
n_iter=8, changed=119, inertia=10362.625, iter_time=4.449 sec, sparsity=0.102
n_iter=9, changed=78, inertia=10355.905, iter_time=4.438 sec, sparsity=0.102
n_iter=10, changed=80, inertia=10350.703, iter_time=4.452 sec, sparsity=0.102

Cluster labeling can be used to intrepret the clustering results. The proportion_keywords function of soyclustering uses a keyword extraction-based method to return keywords describing each cluster. For its input arguments, you need to provide cluster centroid vectors, a list of vocabularies (as str) and labels.

from soyclustering import proportion_keywords

centers = spherical_kmeans.cluster_centers_
idx2vocab = ['list', 'of', 'str', 'vocab']
keywords = proportion_keywords(centers, labels, index2word=idx2vocab)

The table in below is the results of cluster labels from a trained k-means with k=1,000 and 1.2M documents.

The meaning of cluster
(Human label)
Algorithm based extracted clustering labels
The movie “Titanic" iceberg, zane, sinking, titanic, rose, winslet, camerons, 1997, leonardo, leo, ship, cameron, dicaprio, kate, tragedy, jack, di saster, james, romance, love, effects, special, story, people, best, ever, made
Heros in Marvle comics (Avengers) zemo, chadwick, boseman, bucky, panther, holland, cap, infinity, mcu, russo, civil, bvs, antman, winter, ultron, airport, ave ngers, marvel, captain, superheroes, soldier, stark, evans, america, iron, spiderman, downey, tony, superhero, heroes
Alien movies such as Cover-field or District 9 skyline, jarrod, balfour, strause, invasion, independence, cloverfield, angeles, district, los, worlds, aliens, alien, la, budget, scifi, battle, cgi, day, effects, war, special, ending, bad, better, why, they, characters, their, people
Horror movies gayheart, loretta, candyman, legends, urban, witt, campus, tara, reid, legend, alicia, englund, leto, rebecca, jared, scream, murders, slasher, helen, killer, student, college, students, teen, summer, cut, horror, final, sequel, scary
The movie “The Matrix" neo, morpheus, neos, oracle, trinity, zion, architect, hacker, reloaded, revolutions, wachowski, fishburne, machines, agents, matrix, keanu, smith, reeves, agent, jesus, machine, computer, humans, fighting, fight, world, cool, real, special, effects

Setting a large k leads to redundant clusters. You can identify these redundant clusters by carefully examining the pairwise distance between the clusters.

from soyclustering import visualize_pairwise_distance

# visualize pairwise distance matrix
fig = visualize_pairwise_distance(centers, max_dist=.7, sort=True)

If you find redundant clusters, you can easily merge them into a single cluster.

from soyclustering import merge_close_clusters

group_centers, groups = merge_close_clusters(centers, labels, max_dist=.5)
fig = visualize_pairwise_distance(group_centers, max_dist=.7, sort=True)

After merging, you can check the size of dark squares in the diagonal entries of the pairwise distance matrix. If the redundant clusters are indeed successfully merged, the number of dark sqaures in the diagonal entries should have been reduced.

The function merge_close_clusters groups similar clusters, in which the distance between them is smaller than max_dist. The centroid of the new cluster is a weighted average of original centroid vectors. From the variable groups, you can return the cluster indices prior and after merging.

for group in groups:
    print(group)
[0, 19, 57, 68, 88, 115, 202, 223, 229, 237]
[1]
[2]
[3, 4, 5, 8, 12, 14, 16, 18, 20, 22, 26, 28, ...]
[6, 25, 29, 32, 37, 43, 45, 48, 53, 56, 65, ...]
[7, 17, 34, 41, 52, 59, 76, 79, 84, 87, 93, ...]
[9, 15, 24, 47, 51, 97]
[10, 100, 139]
[11, 23, 251]
...

See more

In addition, the repository kmeans_to_pyLDAvis provides k-means visualization using PyLDAVis

References

  • Kim, H., Kim, H. K., & Cho, S. (2020). Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling. Expert Systems with Applications, 150, 113288.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soyclustering-0.2.0.tar.gz (18.1 kB view hashes)

Uploaded Source

Built Distribution

soyclustering-0.2.0-py3-none-any.whl (28.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page