Automatic text embedding clustering with eigen-gap k estimation and Cohesion Ratio metric
Project description
autokluster
Auto-k spectral clustering for text embeddings with automatic cluster count estimation and quality metrics.
Why autokluster?
Traditional clustering algorithms like K-Means require you to specify the number of clusters k beforehand. This is problematic when you don't know how many natural groups exist in your data.
autokluster solves this by:
- Automatically estimating the optimal
kusing eigen-gap analysis - Providing a Cohesion Ratio metric to assess clustering quality
- Being optimized for text embeddings from models like sentence-transformers
Installation
pip install autokluster
With sentence-transformers support:
pip install autokluster[embeddings]
Quick Start
from autokluster import cluster
# Your text embeddings (n_samples, n_features)
embeddings = ...
# Automatic clustering - k is found automatically
result = cluster(embeddings)
print(f"Found {result.k} clusters") # e.g., 7
print(f"Cohesion Ratio: {result.cohesion_ratio:.2f}") # e.g., 1.84 (>1 = good)
print(f"Labels: {result.labels}") # [0, 2, 1, 0, 3, ...]
With fixed k
result = cluster(embeddings, k=5) # Force 5 clusters
CLI
# Automatic k estimation
autokluster --input embeddings.npy --output clusters.json
# Fixed k
autokluster --input embeddings.npy --k 5
# Detailed output with eigenvalues
autokluster --input embeddings.npy --format detailed
How It Works
1. Spectral Clustering Pipeline
Embeddings → Cosine Similarity → Normalized Laplacian → Eigendecomposition → K-Means
2. Automatic k Estimation (Eigen-gap)
The algorithm analyzes gaps between consecutive eigenvalues of the Laplacian matrix. A significant gap indicates a natural cluster boundary:
δᵢ = |λᵢ - λᵢ₋₁| / moving_average(λ)
k = first i where δᵢ > threshold
3. Cohesion Ratio (ρ_C)
Measures clustering quality by comparing intra-cluster similarity to global similarity:
ρ_C = μ_intra / μ_global
- ρ_C = 1: Clusters are no more cohesive than random (bad)
- ρ_C > 1: Clusters are cohesive (good)
- ρ_C > 2: Highly cohesive clusters (excellent)
API Reference
cluster(embeddings, k=None, min_k=2, max_k=50, random_state=None)
| Parameter | Type | Default | Description |
|---|---|---|---|
embeddings |
ndarray | required | Input embeddings (n_samples, n_features) |
k |
int | None | None | Number of clusters. If None, estimated automatically |
min_k |
int | 2 | Minimum k for auto-estimation |
max_k |
int | 50 | Maximum k for auto-estimation |
random_state |
int | None | None | Random seed for reproducibility |
ClusterResult
| Attribute | Type | Description |
|---|---|---|
k |
int | Number of clusters |
labels |
ndarray | Cluster assignments (n_samples,) |
cohesion_ratio |
float | Quality metric (higher is better) |
eigenvalues |
ndarray | Laplacian eigenvalues |
eigengap_index |
int | None | Index where eigen-gap was detected |
cluster_sizes |
list[int] | Number of samples per cluster |
Requirements
- Python 3.10+
- numpy >= 1.24
- scipy >= 1.10
- scikit-learn >= 1.3
References
- Cohesion Ratio metric: arXiv:2511.19350
- Spectral clustering tutorial: von Luxburg (2007)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
# Clone the repo
git clone https://github.com/FabienCadoret/autokluster.git
cd autokluster
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Check linting
ruff check src/
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autokluster-0.2.0.tar.gz.
File metadata
- Download URL: autokluster-0.2.0.tar.gz
- Upload date:
- Size: 115.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d17cf5d3774b8cc7cac9c45a8200ea19482800588f1f49967fb9a09f5b2ad1b
|
|
| MD5 |
919d841b3b031b1e2272b858a1a5679d
|
|
| BLAKE2b-256 |
3d3c1ca05ad1a48b5d9ba1cc555bd524fb115f010d112646b2f346d643211663
|
Provenance
The following attestation bundles were made for autokluster-0.2.0.tar.gz:
Publisher:
publish.yml on FabienCadoret/autokluster
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autokluster-0.2.0.tar.gz -
Subject digest:
1d17cf5d3774b8cc7cac9c45a8200ea19482800588f1f49967fb9a09f5b2ad1b - Sigstore transparency entry: 928411493
- Sigstore integration time:
-
Permalink:
FabienCadoret/autokluster@565c59d657f5ad910fa1540245ebfcccf70cf262 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/FabienCadoret
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@565c59d657f5ad910fa1540245ebfcccf70cf262 -
Trigger Event:
push
-
Statement type:
File details
Details for the file autokluster-0.2.0-py3-none-any.whl.
File metadata
- Download URL: autokluster-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc654b6623bb5116b1ae3055d63057106533aebff3dbdb68cd82cfa687a62695
|
|
| MD5 |
42c1024d9ec14dd872d9f8e854719eab
|
|
| BLAKE2b-256 |
4155b7988a698a4727fdc9108d8520cd2e33962e76f83b2ad0088ac774ef805a
|
Provenance
The following attestation bundles were made for autokluster-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on FabienCadoret/autokluster
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autokluster-0.2.0-py3-none-any.whl -
Subject digest:
bc654b6623bb5116b1ae3055d63057106533aebff3dbdb68cd82cfa687a62695 - Sigstore transparency entry: 928411494
- Sigstore integration time:
-
Permalink:
FabienCadoret/autokluster@565c59d657f5ad910fa1540245ebfcccf70cf262 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/FabienCadoret
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@565c59d657f5ad910fa1540245ebfcccf70cf262 -
Trigger Event:
push
-
Statement type: