Skip to main content

Clustering algorithms for text collections

Project description

Document Clustering for Selective Search

This repository contains an open source implementation of the SB² K-means clustering algorithm for document collections. It supports a standard KLD-based distance metric, but also query-biased distance metric QKLD and query-biased centroid initialization QInit. Tokenization and vectorization is performed using scikit-learn, and we implemented the clustering algorithm efficiently (and parallelized wherever possible) using custom Cython code.

Installation

Simply install the module from PyPI using:

pip install document_clustering

To locally install the dependencies and document_clustering module using Poetry, run:

poetry install

Usage

The clustering API closely follows the scikit-learn API:

from document_clustering import Clustering

num_clusters = 5
clustering = Clustering(num_clusters).fit(['document A', 'document B'])

mapping = clustering.transform(['document C'])

To use the QKLD distance metric (instead of the default KLD), supply the metric parameter and supply queries in the call to fit.

clustering = Clustering(num_clusters, metric='qkld')
clustering.fit(['document A', 'document B'], ['query 1', 'query 2'])

mapping = clustering.transform(['document C'])

To use the QInit centroid initialization, supply the centroid_init and glove_vectors parameters, and again supply queries in the fit call:

clustering = Clustering(num_clusters, centroid_init='qinit', glove_vectors='/path/to/glove.6B.100d.txt')
clustering.fit(['document A', 'document B'], ['query 1', 'query 2'])

mapping = clustering.transform(['document C'])

Finally, to perform size-bounded clustering, you can use the split_large_shards and merge_small_shards parameters.

clustering = Clustering(num_clusters, split_large_shards=True, merge_small_shards=True)
clustering.fit(['document A', 'document B'])

mapping = clustering.transform(['document C'])

For a full example, check out the script we used to cluster the TREC CAsT corpus.

References

  1. Kulkarni, A. 2013. Efficient and Effective Large-scale Search. Carnegie Mellon University.
  2. Dai, Z. et al. 2016. Query-Biased Partitioning for Selective Search. CIKM 2016.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_clustering-0.1.2.tar.gz (164.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_clustering-0.1.2-cp310-cp310-manylinux_2_35_x86_64.whl (579.2 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

File details

Details for the file document_clustering-0.1.2.tar.gz.

File metadata

  • Download URL: document_clustering-0.1.2.tar.gz
  • Upload date:
  • Size: 164.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.17.4-76061704-generic

File hashes

Hashes for document_clustering-0.1.2.tar.gz
Algorithm Hash digest
SHA256 bbf953720af850680ef2226bbd6ba7a6f091a452f7cf370a73965c5b38d525b7
MD5 ec2c5d5f594b9d96f8ce5e3658dbeb20
BLAKE2b-256 6f7707d3572e5282c8ee3549f827b8cec723ec73ee7f85fd0bb2a50c26ff3894

See more details on using hashes here.

File details

Details for the file document_clustering-0.1.2-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for document_clustering-0.1.2-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 a819709171cc54ce33226ae3387bf1dff018f832ef3220d3d1f793a37a361a97
MD5 02b5b1c2e0b6d92964cecc3afd74d437
BLAKE2b-256 cb114dbbd2f9321c44d701180d98078720563d01b732eb90fe253ca7ea214d62

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page