Bayesian nonparametric toolkit for text clustering, analysis, and benchmarking with advanced embedding models and statistical validation.
Project description
Clusterium is a Bayesian nonparametric toolkit for text clustering, analysis, and benchmarking that leverages state-of-the-art embedding models and statistical validation techniques.
Features
Dirichlet Process Clustering: Implements the Dirichlet Process for text clustering
Pitman-Yor Process Clustering: Implements the Pitman-Yor Process for text clustering with improved performance
Evaluation: Evaluates clustering results using a variety of metrics, including Silhouette Score, Davies-Bouldin Index, and Power-law Analysis
Visualization: Generates plots of cluster size distributions
Quick Start
# Install the package
pip install clusx
# Basic clustering with default parameters
clusx cluster --input your_data.txt
# Evaluate clustering results
clusx evaluate \
--input your_data.txt \
--dp-clusters output/clusters_output_dp.csv \
--pyp-clusters output/clusters_output_pyp.csv
That’s it! The tool uses optimized default parameters and saves all outputs to the output directory.
For interactive visualization during evaluation, add the --show-plot option:
clusx evaluate \
--input your_data.txt \
--dp-clusters output/clusters_output_dp.csv \
--pyp-clusters output/clusters_output_pyp.csv \
--show-plot
Python API Example
from clusx.clustering import DirichletProcess, PitmanYorProcess
from clusx.clustering.utils import load_data
# Load data
texts = load_data("your_data.txt")
# Perform clustering with default parameters
dp = DirichletProcess(alpha=0.5, kappa=0.3) # Default parameters
clusters_dp = dp.fit_predict(texts)
pyp = PitmanYorProcess(alpha=0.3, kappa=0.3, sigma=0.3) # Default parameters
clusters_pyp = pyp.fit_predict(texts)
# Print number of clusters found
print(f"DP found {len(set(clusters_dp))} clusters")
print(f"PYP found {len(set(clusters_pyp))} clusters")
For more advanced usage, including saving results and evaluation, see the Usage Guide.
Project Information
Clusterium is released under the MIT License, its documentation lives at Read the Docs, the code on GitHub, and the latest release on PyPI. It’s rigorously tested on Python 3.11+.
If you’d like to contribute to Clusterium you’re most welcome!
Support
Should you have any question, any remark, or if you find a bug, or if there is something you can’t do with the Clusterium, please open an issue.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clusx-0.6.0.tar.gz.
File metadata
- Download URL: clusx-0.6.0.tar.gz
- Upload date:
- Size: 52.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd9e54ede30485a2242a96c8161936f4bba9c83b60c1fcb8dee95b790c0e368d
|
|
| MD5 |
d4c973394b6571df33a99405fbbfbced
|
|
| BLAKE2b-256 |
f2f0b488c68ceb9f4fdab3803220b72884f0259217c5540c9e23b683adb7292d
|
Provenance
The following attestation bundles were made for clusx-0.6.0.tar.gz:
Publisher:
cd.yml on sergeyklay/clusterium
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
clusx-0.6.0.tar.gz -
Subject digest:
cd9e54ede30485a2242a96c8161936f4bba9c83b60c1fcb8dee95b790c0e368d - Sigstore transparency entry: 183088763
- Sigstore integration time:
-
Permalink:
sergeyklay/clusterium@05d2fba2903d8c4a17c74cc9d3ab2af093a7ce37 -
Branch / Tag:
refs/tags/0.6.0 - Owner: https://github.com/sergeyklay
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
cd.yml@05d2fba2903d8c4a17c74cc9d3ab2af093a7ce37 -
Trigger Event:
push
-
Statement type:
File details
Details for the file clusx-0.6.0-py3-none-any.whl.
File metadata
- Download URL: clusx-0.6.0-py3-none-any.whl
- Upload date:
- Size: 38.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e148735c537c3f0852194038889e875170eb6c9998aa009e876af96d4da93d0e
|
|
| MD5 |
7ee9e84e61872ee34a5821c5b57108bc
|
|
| BLAKE2b-256 |
e40377b238352dd984ec58261c99a426a38a5f441fe21a47d1d8114ce1d1f56c
|
Provenance
The following attestation bundles were made for clusx-0.6.0-py3-none-any.whl:
Publisher:
cd.yml on sergeyklay/clusterium
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
clusx-0.6.0-py3-none-any.whl -
Subject digest:
e148735c537c3f0852194038889e875170eb6c9998aa009e876af96d4da93d0e - Sigstore transparency entry: 183088764
- Sigstore integration time:
-
Permalink:
sergeyklay/clusterium@05d2fba2903d8c4a17c74cc9d3ab2af093a7ce37 -
Branch / Tag:
refs/tags/0.6.0 - Owner: https://github.com/sergeyklay
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
cd.yml@05d2fba2903d8c4a17c74cc9d3ab2af093a7ce37 -
Trigger Event:
push
-
Statement type: