Skip to main content

Bayesian nonparametric toolkit for text clustering, analysis, and benchmarking with advanced embedding models and statistical validation.

Project description

CI Coverage Docs

Clusterium is a Bayesian nonparametric toolkit for text clustering, analysis, and benchmarking that leverages state-of-the-art embedding models and statistical validation techniques.

Features

  • Dirichlet Process Clustering: Implements the Dirichlet Process for text clustering

  • Pitman-Yor Process Clustering: Implements the Pitman-Yor Process for text clustering with improved performance

  • Evaluation: Evaluates clustering results using a variety of metrics, including Silhouette Score, Davies-Bouldin Index, and Power-law Analysis

  • Visualization: Generates plots of cluster size distributions

Quick Start

# Install the package
pip install clusx

# Basic clustering with default parameters
clusx cluster --input your_data.txt

# Evaluate clustering results
clusx evaluate \
  --input your_data.txt \
  --dp-clusters output/clusters_output_dp.csv \
  --pyp-clusters output/clusters_output_pyp.csv

That’s it! The tool uses optimized default parameters and saves all outputs to the output directory.

For interactive visualization during evaluation, add the --show-plot option:

clusx evaluate \
  --input your_data.txt \
  --dp-clusters output/clusters_output_dp.csv \
  --pyp-clusters output/clusters_output_pyp.csv \
  --show-plot

Python API Example

from clusx.clustering import DirichletProcess, PitmanYorProcess
from clusx.clustering.utils import load_data

# Load data
texts = load_data("your_data.txt")

# Perform clustering with default parameters
dp = DirichletProcess(alpha=0.5)  # Dirichlet Process
clusters_dp, _ = dp.fit(texts)

pyp = PitmanYorProcess(alpha=0.3, sigma=0.3)  # Pitman-Yor Process
clusters_pyp, _ = pyp.fit(texts)

# Print number of clusters found
print(f"DP found {len(set(clusters_dp))} clusters")
print(f"PYP found {len(set(clusters_pyp))} clusters")

For more advanced usage, including saving results and evaluation, see the Usage Guide.

Project Information

Clusterium is released under the MIT License, its documentation lives at Read the Docs, the code on GitHub, and the latest release on PyPI. It’s rigorously tested on Python 3.11+.

If you’d like to contribute to Clusterium you’re most welcome!

Support

Should you have any question, any remark, or if you find a bug, or if there is something you can’t do with the Clusterium, please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusx-0.4.0.tar.gz (40.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clusx-0.4.0-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file clusx-0.4.0.tar.gz.

File metadata

  • Download URL: clusx-0.4.0.tar.gz
  • Upload date:
  • Size: 40.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for clusx-0.4.0.tar.gz
Algorithm Hash digest
SHA256 b9e86589f8737fd2cc2bebbc8fd6049c2d3230f7c2c10208fe2f75fb486376d7
MD5 46acd04ad347ce7a7359d2f33aad4f74
BLAKE2b-256 0326fc619da158008512b4f7848b54d9270b46016c2a66ef31a2423dd831ba3c

See more details on using hashes here.

Provenance

The following attestation bundles were made for clusx-0.4.0.tar.gz:

Publisher: cd.yml on sergeyklay/clusterium

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clusx-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: clusx-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for clusx-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a2c8e5164f488820cd038a0e96719964fd8e007486ef9262358cf9dd86416da4
MD5 20f59733fe334bce96c4bfa853542c4a
BLAKE2b-256 338f7fc7db96ff93bcc8d55a4c7bc803d455305254766d201fefd39f69228c33

See more details on using hashes here.

Provenance

The following attestation bundles were made for clusx-0.4.0-py3-none-any.whl:

Publisher: cd.yml on sergeyklay/clusterium

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page