Skip to main content

Tool for clustering, analyzing, and benchmarking text data with advanced embeddings and statistical validation.

Project description

CI Coverage Docs

Clusterium is a toolkit for clustering, analyzing, and benchmarking text data using state-of-the-art embedding models and clustering algorithms.

Features

  • Dirichlet Process Clustering: Implements the Dirichlet Process for text clustering

  • Pitman-Yor Process Clustering: Implements the Pitman-Yor Process for text clustering with improved performance

  • Evaluation: Evaluates clustering results using a variety of metrics, including Silhouette Score, Davies-Bouldin Index, and Power-law Analysis

  • Visualization: Generates plots of cluster size distributions

Quick Start

pip install clusx

# Run clustering
clusx --input your_data.csv --column your_column --output clusters.csv

# Evaluate clustering results and generate visualizations
clusx evaluate \
  --input input.csv \
  --column your_column \
  --dp-clusters output_dp.csv \
  --pyp-clusters output_pyp.csv \
  --plot

Python API Example

from clusx.clustering import DirichletProcess
from clusx.clustering.utils import load_data_from_csv, save_clusters_to_json

# Load data
texts, data = load_data_from_csv("your_data.csv", column="your_column")

# Perform clustering
dp = DirichletProcess(alpha=1.0)
clusters, params = dp.fit(texts)

# Save results
save_clusters_to_json("clusters.json", texts, clusters, "DP", data)

Project Information

Clusterium is released under the MIT License, its documentation lives at Read the Docs, the code on GitHub, and the latest release on PyPI. It’s rigorously tested on Python 3.11+.

If you’d like to contribute to Clusterium you’re most welcome!

Support

Should you have any question, any remark, or if you find a bug, or if there is something you can’t do with the Clusterium, please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusx-0.3.3.tar.gz (33.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clusx-0.3.3-py3-none-any.whl (27.9 kB view details)

Uploaded Python 3

File details

Details for the file clusx-0.3.3.tar.gz.

File metadata

  • Download URL: clusx-0.3.3.tar.gz
  • Upload date:
  • Size: 33.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for clusx-0.3.3.tar.gz
Algorithm Hash digest
SHA256 a5c8b595d1db8aa3adf75840192b2191e1fbf125ef5b739f66374e7725722268
MD5 73985bab400c892b5eb15f6e7e6c96f0
BLAKE2b-256 f11eb5f66763dd4c6de397b4b461e3591535e2250d588bd96dcabfbbb5627bcf

See more details on using hashes here.

File details

Details for the file clusx-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: clusx-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 27.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for clusx-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fac9de46f9729783dfb630069eb2f16986cdc89e617f7900be52e526dd366f46
MD5 6f769c5c65c157bf88324e11ceca8a46
BLAKE2b-256 e8d8bbea0541e66e5391a49cb2480742bfc3642636a5d97c3185776319247891

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page