Skip to main content

Tool for clustering, analyzing, and benchmarking text data with advanced embeddings and statistical validation.

Project description

CI Coverage Docs

Clusterium is a toolkit for clustering, analyzing, and benchmarking text data using state-of-the-art embedding models and clustering algorithms.

Features

  • Dirichlet Process Clustering: Implements the Dirichlet Process for text clustering

  • Pitman-Yor Process Clustering: Implements the Pitman-Yor Process for text clustering with improved performance

  • Evaluation: Evaluates clustering results using a variety of metrics, including Silhouette Score, Davies-Bouldin Index, and Power-law Analysis

  • Visualization: Generates plots of cluster size distributions

Quick Start

pip install clusx

# Run clustering
clusx --input your_data.csv --column your_column --output clusters.csv

# Evaluate clustering results and generate visualizations
clusx evaluate \
  --input input.csv \
  --column your_column \
  --dp-clusters output_dp.csv \
  --pyp-clusters output_pyp.csv \
  --plot

Python API Example

from clusx.clustering import DirichletProcess
from clusx.clustering.utils import load_data_from_csv, save_clusters_to_json

# Load data
texts, data = load_data_from_csv("your_data.csv", column="your_column")

# Perform clustering
dp = DirichletProcess(alpha=1.0)
clusters, params = dp.fit(texts)

# Save results
save_clusters_to_json("clusters.json", texts, clusters, "DP", data)

Project Information

Clusterium is released under the MIT License, its documentation lives at Read the Docs, the code on GitHub, and the latest release on PyPI. It’s rigorously tested on Python 3.11+.

If you’d like to contribute to Clusterium you’re most welcome!

Support

Should you have any question, any remark, or if you find a bug, or if there is something you can’t do with the Clusterium, please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusx-0.3.2.tar.gz (33.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clusx-0.3.2-py3-none-any.whl (27.9 kB view details)

Uploaded Python 3

File details

Details for the file clusx-0.3.2.tar.gz.

File metadata

  • Download URL: clusx-0.3.2.tar.gz
  • Upload date:
  • Size: 33.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for clusx-0.3.2.tar.gz
Algorithm Hash digest
SHA256 fccb6cb002b0085597e0e769fe4c9cedd30780d9514b94a3059321131c137037
MD5 6c16e8706a22b0d4f620b66964ebe418
BLAKE2b-256 d4cfab9899cbdb17e5cd16d2fdbb400f8a0a1bd23a5310c5787639bb84e8cb98

See more details on using hashes here.

File details

Details for the file clusx-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: clusx-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 27.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for clusx-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0a5d6bf7d758643de811bbc0fb932ab7576e6adb5c2499edcc7faa9985771079
MD5 298b0e2b7b2e7b1d0877d18bc5396dba
BLAKE2b-256 0866d57b90753d3838348e7b6dc4b4b65655e20e68a450866cd2120857b3a6b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page