Python library for document clustering

Project description

soyclustering: Python clustering algorithm library for document clustering

This package is implementation of Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling (Kim et al., 2020).

Cosine is more effective than Euclidean when measuring the distance between two high-dimensional (or sometimes, sparse) documents vectors. However, scikit-learn k-means package provides only Euclidean based k-means. Additionally, labeling clusters can be very helpful for interpreting the clustering results.

Spherical k-means works well both with sparse vector representation such as Bag-of-Words model or distributed representation such as Doc2Vec or other document embedding methods. In lower dimensional vector space, Silhouette score method is useful to define the number of clusters (k). However Silhouette score method does not work well in a high-dimensional vector space such as Bag-of-Words and Doc2Vec model space. One of the heuristic methods to define the number of clusters is to train k-means with large k first and subsequently merge similar ones. This method is also useful for preventing the Uniform Effect, which causes the size of all clusters to be similar.

soyclustering provides Spherical k-means (k-means which uses Cosine distance as a distance metric) and keyword extraction-based clustering labeling function. Furthermore, the function for visualizing pairwise distances between centroids will help you to check whether redundant clusters exist, allowing you to remove redundant clusters by merging them. soyclustering also provides a fast initialization method that is effective in a high-dimensional vector space. When the size of the input data is large, the initialization method sets k to be 1000. The below table decsribes relative speed-up of proposed initialization method with alpha=3 compared to kmeans++ (x times faster)

Dataset name	num rows	num features	num of nonzero (percent)	k=10	k=20	k=50	k=100
A6 blogs	63,153	91,302	18,051,341 (0.313 %)	x 265	x 257	x 213	x 150
Tucson blogs	105,755	81,497	29,192,999 (0.339 %)	x 388	x 440	x 306	x 244
Sonata blogs	229,253	85,129	60,861,803 (0.312 $)	x 785	x 841	x 614	x 495
IMDb reviews	1,228,348	68,049	181,411,713 (0.217 %)	x 803	x 714	x 1988	x 1787
Reuter RCV1	804,414	47,236	60,915,113 (0.160 %)	x 439	x 713	x 850	x 772
MovieLens 20M	138,493	131,262	20,000,263 (0.110 %)	x 202	x 213	x 214	x 184
Yelp reviews	5,261,669	27,247	365,341,887 (0.255 %)	x 368	x 908	x 1508	x 2917

Usage

You can read a matrix market file as follows. Note that the file must include tokenized outputs. Although the spherical k-means function can be used for inputs in distributed representation such as Doc2Vec, our cluster labeling algorithm works only for Bag-of-Words model.

from scipy.io import mmread
x = mmread(mm_file).tocsr()

Sperical k-means can be used as follows. init='similar_cut' indicates our initializer that is effective in a high-dimensional vector space. If you want to preserve the sparsity of the centroid vector, you can set minimum_df_factor. Other interfaces are similar to those of scikit-learn k-means function. With fit_predict, you can retrieve the labels from the clustering result.

from soyclustering import SphericalKMeans
spherical_kmeans = SphericalKMeans(
    n_clusters=1000,
    max_iter=10,
    verbose=1,
    init='similar_cut',
    sparsity='minimum_df',
    minimum_df_factor=0.05
)

labels = spherical_kmeans.fit_predict(x)

When the verbose mode is set, computation speed and the level of sparsity during the initizalition and each iteration is printed.

initialization_time=1.218108 sec, sparsity=0.00796
n_iter=1, changed=29969, inertia=15323.440, iter_time=4.435 sec, sparsity=0.116
n_iter=2, changed=5062, inertia=11127.620, iter_time=4.466 sec, sparsity=0.108
n_iter=3, changed=2179, inertia=10675.314, iter_time=4.463 sec, sparsity=0.105
n_iter=4, changed=1040, inertia=10491.637, iter_time=4.449 sec, sparsity=0.103
n_iter=5, changed=487, inertia=10423.503, iter_time=4.437 sec, sparsity=0.103
n_iter=6, changed=297, inertia=10392.490, iter_time=4.483 sec, sparsity=0.102
n_iter=7, changed=178, inertia=10373.646, iter_time=4.442 sec, sparsity=0.102
n_iter=8, changed=119, inertia=10362.625, iter_time=4.449 sec, sparsity=0.102
n_iter=9, changed=78, inertia=10355.905, iter_time=4.438 sec, sparsity=0.102
n_iter=10, changed=80, inertia=10350.703, iter_time=4.452 sec, sparsity=0.102

Cluster labeling can be used to intrepret the clustering results. The proportion_keywords function of soyclustering uses a keyword extraction-based method to return keywords describing each cluster. For its input arguments, you need to provide cluster centroid vectors, a list of vocabularies (as str) and labels.

from soyclustering import proportion_keywords

centers = spherical_kmeans.cluster_centers_
idx2vocab = ['list', 'of', 'str', 'vocab']
keywords = proportion_keywords(centers, labels, index2word=idx2vocab)

The table in below is the results of cluster labels from a trained k-means with k=1,000 and 1.2M documents.

The meaning of cluster (Human label)	Algorithm based extracted clustering labels
The movie “Titanic"	iceberg, zane, sinking, titanic, rose, winslet, camerons, 1997, leonardo, leo, ship, cameron, dicaprio, kate, tragedy, jack, di saster, james, romance, love, effects, special, story, people, best, ever, made
Heros in Marvle comics (Avengers)	zemo, chadwick, boseman, bucky, panther, holland, cap, infinity, mcu, russo, civil, bvs, antman, winter, ultron, airport, ave ngers, marvel, captain, superheroes, soldier, stark, evans, america, iron, spiderman, downey, tony, superhero, heroes
Alien movies such as Cover-field or District 9	skyline, jarrod, balfour, strause, invasion, independence, cloverfield, angeles, district, los, worlds, aliens, alien, la, budget, scifi, battle, cgi, day, effects, war, special, ending, bad, better, why, they, characters, their, people
Horror movies	gayheart, loretta, candyman, legends, urban, witt, campus, tara, reid, legend, alicia, englund, leto, rebecca, jared, scream, murders, slasher, helen, killer, student, college, students, teen, summer, cut, horror, final, sequel, scary
The movie “The Matrix"	neo, morpheus, neos, oracle, trinity, zion, architect, hacker, reloaded, revolutions, wachowski, fishburne, machines, agents, matrix, keanu, smith, reeves, agent, jesus, machine, computer, humans, fighting, fight, world, cool, real, special, effects

Setting a large k leads to redundant clusters. You can identify these redundant clusters by carefully examining the pairwise distance between the clusters.

from soyclustering import visualize_pairwise_distance

# visualize pairwise distance matrix
fig = visualize_pairwise_distance(centers, max_dist=.7, sort=True)

If you find redundant clusters, you can easily merge them into a single cluster.

from soyclustering import merge_close_clusters

group_centers, groups = merge_close_clusters(centers, labels, max_dist=.5)
fig = visualize_pairwise_distance(group_centers, max_dist=.7, sort=True)

After merging, you can check the size of dark squares in the diagonal entries of the pairwise distance matrix. If the redundant clusters are indeed successfully merged, the number of dark sqaures in the diagonal entries should have been reduced.

The function merge_close_clusters groups similar clusters, in which the distance between them is smaller than max_dist. The centroid of the new cluster is a weighted average of original centroid vectors. From the variable groups, you can return the cluster indices prior and after merging.

for group in groups:
    print(group)

[0, 19, 57, 68, 88, 115, 202, 223, 229, 237]
[1]
[2]
[3, 4, 5, 8, 12, 14, 16, 18, 20, 22, 26, 28, ...]
[6, 25, 29, 32, 37, 43, 45, 48, 53, 56, 65, ...]
[7, 17, 34, 41, 52, 59, 76, 79, 84, 87, 93, ...]
[9, 15, 24, 47, 51, 97]
[10, 100, 139]
[11, 23, 251]
...

See more

In addition, the repository kmeans_to_pyLDAvis provides k-means visualization using PyLDAVis

References

Kim, H., Kim, H. K., & Cho, S. (2020). Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling. Expert Systems with Applications, 150, 113288.

Project details

Release history Release notifications | RSS feed

This version

0.2.0

May 3, 2020

0.1.0

Aug 20, 2019

0.0.4

May 25, 2018

0.0.3

May 25, 2018

0.0.2

May 25, 2018

0.0.1

May 25, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soyclustering-0.2.0.tar.gz (18.1 kB view details)

Uploaded May 3, 2020 Source

Built Distribution

soyclustering-0.2.0-py3-none-any.whl (28.8 kB view details)

Uploaded May 3, 2020 Python 3

File details

Details for the file soyclustering-0.2.0.tar.gz.

File metadata

Download URL: soyclustering-0.2.0.tar.gz
Upload date: May 3, 2020
Size: 18.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.44.1 CPython/3.7.7

File hashes

Hashes for soyclustering-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`fce7ed92671b26846114bfd893107d8fdbc8297bd035a5b8ad2999d066f1fb43`
MD5	`3c567fad7cdc63a2ec653030925cb507`
BLAKE2b-256	`b256730d8fe97cb2775bde52768d0cbe08cee76ef8c163bd95754d2a631ffdc9`

See more details on using hashes here.

Provenance

File details

Details for the file soyclustering-0.2.0-py3-none-any.whl.

File metadata

Download URL: soyclustering-0.2.0-py3-none-any.whl
Upload date: May 3, 2020
Size: 28.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.44.1 CPython/3.7.7

File hashes

Hashes for soyclustering-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e07a9025b85b7d0b6886b64854b57ef3934834756c2bf8a1b6aeff36348a63f1`
MD5	`0e7d3354ab614e3c186343e8cda90e88`
BLAKE2b-256	`56da383104eb15776319add42a216f377d76b4a6d0fe4b6b21fce507a8c27607`