Skip to main content

Automatic text embedding clustering with eigen-gap k estimation and Cohesion Ratio metric

Project description

autokluster

CI codecov Python 3.10+ License: MIT PyPI version

Auto-k spectral clustering for text embeddings with automatic cluster count estimation and quality metrics.

Why autokluster?

Traditional clustering algorithms like K-Means require you to specify the number of clusters k beforehand. This is problematic when you don't know how many natural groups exist in your data.

autokluster solves this by:

  • Automatically estimating the optimal k using eigen-gap analysis
  • Providing a Cohesion Ratio metric to assess clustering quality
  • Being optimized for text embeddings from models like sentence-transformers

Installation

pip install autokluster

With sentence-transformers support:

pip install autokluster[embeddings]

Quick Start

from autokluster import cluster

# Your text embeddings (n_samples, n_features)
embeddings = ...

# Automatic clustering - k is found automatically
result = cluster(embeddings)

print(f"Found {result.k} clusters")           # e.g., 7
print(f"Cohesion Ratio: {result.cohesion_ratio:.2f}")  # e.g., 1.84 (>1 = good)
print(f"Labels: {result.labels}")             # [0, 2, 1, 0, 3, ...]

With fixed k

result = cluster(embeddings, k=5)  # Force 5 clusters

CLI

# Automatic k estimation
autokluster --input embeddings.npy --output clusters.json

# Fixed k
autokluster --input embeddings.npy --k 5

# Detailed output with eigenvalues
autokluster --input embeddings.npy --format detailed

How It Works

1. Spectral Clustering Pipeline

Embeddings → Cosine Similarity → Normalized Laplacian → Eigendecomposition → K-Means

2. Automatic k Estimation (Eigen-gap)

The algorithm analyzes gaps between consecutive eigenvalues of the Laplacian matrix. A significant gap indicates a natural cluster boundary:

δᵢ = |λᵢ - λᵢ₋₁| / moving_average(λ)
k = first i where δᵢ > threshold

3. Cohesion Ratio (ρ_C)

Measures clustering quality by comparing intra-cluster similarity to global similarity:

ρ_C = μ_intra / μ_global
  • ρ_C = 1: Clusters are no more cohesive than random (bad)
  • ρ_C > 1: Clusters are cohesive (good)
  • ρ_C > 2: Highly cohesive clusters (excellent)

API Reference

cluster(embeddings, k=None, min_k=2, max_k=50, random_state=None)

Parameter Type Default Description
embeddings ndarray required Input embeddings (n_samples, n_features)
k int | None None Number of clusters. If None, estimated automatically
min_k int 2 Minimum k for auto-estimation
max_k int 50 Maximum k for auto-estimation
random_state int | None None Random seed for reproducibility

ClusterResult

Attribute Type Description
k int Number of clusters
labels ndarray Cluster assignments (n_samples,)
cohesion_ratio float Quality metric (higher is better)
eigenvalues ndarray Laplacian eigenvalues
eigengap_index int | None Index where eigen-gap was detected
cluster_sizes list[int] Number of samples per cluster

Requirements

  • Python 3.10+
  • numpy >= 1.24
  • scipy >= 1.10
  • scikit-learn >= 1.3

References

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

# Clone the repo
git clone https://github.com/FabienCadoret/autokluster.git
cd autokluster

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Check linting
ruff check src/

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autokluster-0.2.0.tar.gz (115.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autokluster-0.2.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file autokluster-0.2.0.tar.gz.

File metadata

  • Download URL: autokluster-0.2.0.tar.gz
  • Upload date:
  • Size: 115.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autokluster-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1d17cf5d3774b8cc7cac9c45a8200ea19482800588f1f49967fb9a09f5b2ad1b
MD5 919d841b3b031b1e2272b858a1a5679d
BLAKE2b-256 3d3c1ca05ad1a48b5d9ba1cc555bd524fb115f010d112646b2f346d643211663

See more details on using hashes here.

Provenance

The following attestation bundles were made for autokluster-0.2.0.tar.gz:

Publisher: publish.yml on FabienCadoret/autokluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autokluster-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: autokluster-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autokluster-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bc654b6623bb5116b1ae3055d63057106533aebff3dbdb68cd82cfa687a62695
MD5 42c1024d9ec14dd872d9f8e854719eab
BLAKE2b-256 4155b7988a698a4727fdc9108d8520cd2e33962e76f83b2ad0088ac774ef805a

See more details on using hashes here.

Provenance

The following attestation bundles were made for autokluster-0.2.0-py3-none-any.whl:

Publisher: publish.yml on FabienCadoret/autokluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page