Skip to main content

Bayesian nonparametric toolkit for text clustering, analysis, and benchmarking with advanced embedding models and statistical validation.

Project description

CI Coverage Docs

Clusterium is a Bayesian nonparametric toolkit for text clustering, analysis, and benchmarking that leverages state-of-the-art embedding models and statistical validation techniques.

Features

  • Dirichlet Process Clustering: Implements the Dirichlet Process for text clustering

  • Pitman-Yor Process Clustering: Implements the Pitman-Yor Process for text clustering with improved performance

  • Evaluation: Evaluates clustering results using a variety of metrics, including Silhouette Score, Davies-Bouldin Index, and Power-law Analysis

  • Visualization: Generates plots of cluster size distributions

Quick Start

# Install the package
pip install clusx

# Basic clustering with default parameters
clusx cluster --input your_data.txt

# Evaluate clustering results
clusx evaluate \
  --input your_data.txt \
  --dp-clusters output/clusters_output_dp.csv \
  --pyp-clusters output/clusters_output_pyp.csv

That’s it! The tool uses optimized default parameters and saves all outputs to the output directory.

For interactive visualization during evaluation, add the --show-plot option:

clusx evaluate \
  --input your_data.txt \
  --dp-clusters output/clusters_output_dp.csv \
  --pyp-clusters output/clusters_output_pyp.csv \
  --show-plot

Python API Example

from clusx.clustering import DirichletProcess, PitmanYorProcess
from clusx.clustering.utils import load_data

# Load data
texts = load_data("your_data.txt")

# Perform clustering with default parameters
dp = DirichletProcess(alpha=0.5, kappa=0.3)  # Default parameters
clusters_dp = dp.fit_predict(texts)

pyp = PitmanYorProcess(alpha=0.3, kappa=0.3, sigma=0.3)  # Default parameters
clusters_pyp = pyp.fit_predict(texts)

# Print number of clusters found
print(f"DP found {len(set(clusters_dp))} clusters")
print(f"PYP found {len(set(clusters_pyp))} clusters")

For more advanced usage, including saving results and evaluation, see the Usage Guide.

Project Information

Clusterium is released under the MIT License, its documentation lives at Read the Docs, the code on GitHub, and the latest release on PyPI. It’s rigorously tested on Python 3.11+.

If you’d like to contribute to Clusterium you’re most welcome!

Support

Should you have any question, any remark, or if you find a bug, or if there is something you can’t do with the Clusterium, please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusx-0.6.0.tar.gz (52.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clusx-0.6.0-py3-none-any.whl (38.7 kB view details)

Uploaded Python 3

File details

Details for the file clusx-0.6.0.tar.gz.

File metadata

  • Download URL: clusx-0.6.0.tar.gz
  • Upload date:
  • Size: 52.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for clusx-0.6.0.tar.gz
Algorithm Hash digest
SHA256 cd9e54ede30485a2242a96c8161936f4bba9c83b60c1fcb8dee95b790c0e368d
MD5 d4c973394b6571df33a99405fbbfbced
BLAKE2b-256 f2f0b488c68ceb9f4fdab3803220b72884f0259217c5540c9e23b683adb7292d

See more details on using hashes here.

Provenance

The following attestation bundles were made for clusx-0.6.0.tar.gz:

Publisher: cd.yml on sergeyklay/clusterium

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clusx-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: clusx-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 38.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for clusx-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e148735c537c3f0852194038889e875170eb6c9998aa009e876af96d4da93d0e
MD5 7ee9e84e61872ee34a5821c5b57108bc
BLAKE2b-256 e40377b238352dd984ec58261c99a426a38a5f441fe21a47d1d8114ce1d1f56c

See more details on using hashes here.

Provenance

The following attestation bundles were made for clusx-0.6.0-py3-none-any.whl:

Publisher: cd.yml on sergeyklay/clusterium

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page