Skip to main content

Bayesian nonparametric toolkit for text clustering, analysis, and benchmarking with advanced embedding models and statistical validation.

Project description

CI Coverage Docs

Clusterium is a Bayesian nonparametric toolkit for text clustering, analysis, and benchmarking that leverages state-of-the-art embedding models and statistical validation techniques.

Features

  • Dirichlet Process Clustering: Implements the Dirichlet Process for text clustering

  • Pitman-Yor Process Clustering: Implements the Pitman-Yor Process for text clustering with improved performance

  • Evaluation: Evaluates clustering results using a variety of metrics, including Silhouette Score, Davies-Bouldin Index, and Power-law Analysis

  • Visualization: Generates plots of cluster size distributions

Quick Start

# Install the package
pip install clusx

# Basic clustering with default parameters
clusx cluster --input your_data.txt

# Evaluate clustering results
clusx evaluate \
  --input your_data.txt \
  --dp-clusters output/clusters_output_dp.csv \
  --pyp-clusters output/clusters_output_pyp.csv

That’s it! The tool uses optimized default parameters and saves all outputs to the output directory.

For interactive visualization during evaluation, add the --show-plot option:

clusx evaluate \
  --input your_data.txt \
  --dp-clusters output/clusters_output_dp.csv \
  --pyp-clusters output/clusters_output_pyp.csv \
  --show-plot

Python API Example

from clusx.clustering import DirichletProcess, PitmanYorProcess
from clusx.clustering.utils import load_data

# Load data
texts = load_data("your_data.txt")

# Perform clustering with default parameters
dp = DirichletProcess(alpha=0.5)  # Dirichlet Process
clusters_dp, _ = dp.fit(texts)

pyp = PitmanYorProcess(alpha=0.3, sigma=0.3)  # Pitman-Yor Process
clusters_pyp, _ = pyp.fit(texts)

# Print number of clusters found
print(f"DP found {len(set(clusters_dp))} clusters")
print(f"PYP found {len(set(clusters_pyp))} clusters")

For more advanced usage, including saving results and evaluation, see the Usage Guide.

Project Information

Clusterium is released under the MIT License, its documentation lives at Read the Docs, the code on GitHub, and the latest release on PyPI. It’s rigorously tested on Python 3.11+.

If you’d like to contribute to Clusterium you’re most welcome!

Support

Should you have any question, any remark, or if you find a bug, or if there is something you can’t do with the Clusterium, please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusx-0.5.0.tar.gz (48.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clusx-0.5.0-py3-none-any.whl (35.6 kB view details)

Uploaded Python 3

File details

Details for the file clusx-0.5.0.tar.gz.

File metadata

  • Download URL: clusx-0.5.0.tar.gz
  • Upload date:
  • Size: 48.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for clusx-0.5.0.tar.gz
Algorithm Hash digest
SHA256 3ec9d60dfdd0ec920ada0c382e09379110da00641416ec437251be40c58d3d80
MD5 1007f33041dd928e8fa4906c165c3240
BLAKE2b-256 bb9587bf877cdaa52b27bb7494206ce7698ae718b377cbf28eeb64d2c451eefc

See more details on using hashes here.

Provenance

The following attestation bundles were made for clusx-0.5.0.tar.gz:

Publisher: cd.yml on sergeyklay/clusterium

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clusx-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: clusx-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 35.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for clusx-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b11cc0bddd2516ebd2cd6bcaf4ea7ec6b3a18506855b237884a4da85f6ce1b3f
MD5 6ea5d0108c6dddddb11524e2281bea33
BLAKE2b-256 061f2322a37bed3d78acae710bae7cf50fc36c3eca577e2d86151d8e8d08f10c

See more details on using hashes here.

Provenance

The following attestation bundles were made for clusx-0.5.0-py3-none-any.whl:

Publisher: cd.yml on sergeyklay/clusterium

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page