Skip to main content

k-LLMmeans clustering algorithm

Project description

k-llmmeans

Scikit-learn compatible implementation of k-LLMmeans for text clustering with summary-based centroids.

This package adapts the original research code into an estimator API you can use with familiar fit, predict, and fit_predict workflows.

What This Package Provides

  • kLLMmeans estimator implementing BaseEstimator + ClusterMixin
  • scikit-learn style methods:
    • fit(X)
    • predict(X)
    • fit_predict(X)
  • configurable document embedding function (embedding_fn)
  • configurable cluster summarization function (summarizer_fn) or DSPy-backed LLM summarization
  • optional precomputed embedding support for faster iterative experimentation

Installation

pip install k-llmmeans

Or from source:

pip install -e .

Quick Start

import dspy
from k_llmmeans import kLLMmeans

# Option 1: pass an LM directly to the estimator
lm = dspy.LM("openai/gpt-5-mini")

docs = [
    "How to optimize SQL queries for large tables?",
    "What is the best way to tune a random forest model?",
    "PostgreSQL index strategy for analytics workloads",
    "Cross-validation tips for imbalanced classification",
]

model = kLLMmeans(
    n_clusters=2,
    llm=lm,
    max_llm_iter=5,
    random_state=0,
)

labels = model.fit_predict(docs)
print(labels)
print(model.summaries_)  # human-readable cluster summaries

Using Custom Embeddings and Summarization

You can fully control both the embedding and summarization steps:

from sentence_transformers import SentenceTransformer
from k_llmmeans import kLLMmeans

encoder = SentenceTransformer("all-MiniLM-L6-v2")

def embedding_fn(texts: list[str]):
    return encoder.encode(texts)

def summarizer_fn(cluster_texts: list[str]) -> str:
    # Replace with your own deterministic or LLM summarizer
    return " | ".join(cluster_texts[:2])

model = kLLMmeans(
    n_clusters=3,
    embedding_fn=embedding_fn,
    summarizer_fn=summarizer_fn,
)

model.fit(["text a", "text b", "text c", "text d"])

API Notes

  • Input X should be list[str].
  • The estimator stores standard fitted attributes such as:
    • labels_
    • cluster_centers_
    • n_iter_
  • Additional clustering interpretability attributes:
    • summaries_
    • summary_embeddings_
    • summaries_evolution_
    • centroids_evolution_

Citation

If you use this package in research or production work, please cite the original paper:

@article{diazrodriguez2025summaries,
  title={Summaries as Centroids for Interpretable and Scalable Text Clustering},
  author={Diaz-Rodriguez, Jairo},
  journal={arXiv preprint arXiv:2502.09667},
  year={2025}
}

Paper URL: https://arxiv.org/abs/2502.09667

Acknowledgment

This package is a scikit-learn compatible adaptation of the original project: https://github.com/jairoadiazr/k-LLMmeans

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

k_llmmeans-0.1.3.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

k_llmmeans-0.1.3-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file k_llmmeans-0.1.3.tar.gz.

File metadata

  • Download URL: k_llmmeans-0.1.3.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for k_llmmeans-0.1.3.tar.gz
Algorithm Hash digest
SHA256 8281b241fafcfd97e4dd842b2059109cb90866bcea9f2dc40999cdcdae984075
MD5 d263e4d0ac95f514356b9b414e15d83a
BLAKE2b-256 52f4d5c43b2688d2663cf23c47fe24d663f24492c608576499d4dc16b7fc027f

See more details on using hashes here.

File details

Details for the file k_llmmeans-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: k_llmmeans-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for k_llmmeans-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 dbed6f16305234b5fccb779354912d0a4381f8b83c7af14ae6422935a22a1126
MD5 28b4a132009f818cd2d21fe5bb77c607
BLAKE2b-256 7f393c678fc4142a6f5a6e8c32a3cb68eb8a4b184deeb6b8fa0094f40e86e42c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page