k-LLMmeans clustering algorithm
Project description
k-llmmeans
Scikit-learn compatible implementation of k-LLMmeans for text clustering with summary-based centroids.
This package adapts the original research code into an estimator API you can use with familiar fit, predict, and fit_predict workflows.
- Original implementation: jairoadiazr/k-LLMmeans
- Paper: Summaries as Centroids for Interpretable and Scalable Text Clustering (arXiv:2502.09667)
What This Package Provides
kLLMmeansestimator implementingBaseEstimator+ClusterMixin- scikit-learn style methods:
fit(X)predict(X)fit_predict(X)
- configurable document embedding function (
embedding_fn) - configurable cluster summarization function (
summarizer_fn) or DSPy-backed LLM summarization - optional precomputed embedding support for faster iterative experimentation
Installation
pip install k-llmmeans
Or from source:
pip install -e .
Quick Start
import dspy
from k_llmmeans import kLLMmeans
# Option 1: pass an LM directly to the estimator
lm = dspy.LM("openai/gpt-5-mini")
docs = [
"How to optimize SQL queries for large tables?",
"What is the best way to tune a random forest model?",
"PostgreSQL index strategy for analytics workloads",
"Cross-validation tips for imbalanced classification",
]
model = kLLMmeans(
n_clusters=2,
llm=lm,
max_llm_iter=5,
random_state=0,
)
labels = model.fit_predict(docs)
print(labels)
print(model.summaries_) # human-readable cluster summaries
Using Custom Embeddings and Summarization
You can fully control both the embedding and summarization steps:
from sentence_transformers import SentenceTransformer
from k_llmmeans import kLLMmeans
encoder = SentenceTransformer("all-MiniLM-L6-v2")
def embedding_fn(texts: list[str]):
return encoder.encode(texts)
def summarizer_fn(cluster_texts: list[str]) -> str:
# Replace with your own deterministic or LLM summarizer
return " | ".join(cluster_texts[:2])
model = kLLMmeans(
n_clusters=3,
embedding_fn=embedding_fn,
summarizer_fn=summarizer_fn,
)
model.fit(["text a", "text b", "text c", "text d"])
API Notes
- Input
Xshould belist[str]. - The estimator stores standard fitted attributes such as:
labels_cluster_centers_n_iter_
- Additional clustering interpretability attributes:
summaries_summary_embeddings_summaries_evolution_centroids_evolution_
Citation
If you use this package in research or production work, please cite the original paper:
@article{diazrodriguez2025summaries,
title={Summaries as Centroids for Interpretable and Scalable Text Clustering},
author={Diaz-Rodriguez, Jairo},
journal={arXiv preprint arXiv:2502.09667},
year={2025}
}
Paper URL: https://arxiv.org/abs/2502.09667
Acknowledgment
This package is a scikit-learn compatible adaptation of the original project: https://github.com/jairoadiazr/k-LLMmeans
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file k_llmmeans-0.1.3.tar.gz.
File metadata
- Download URL: k_llmmeans-0.1.3.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8281b241fafcfd97e4dd842b2059109cb90866bcea9f2dc40999cdcdae984075
|
|
| MD5 |
d263e4d0ac95f514356b9b414e15d83a
|
|
| BLAKE2b-256 |
52f4d5c43b2688d2663cf23c47fe24d663f24492c608576499d4dc16b7fc027f
|
File details
Details for the file k_llmmeans-0.1.3-py3-none-any.whl.
File metadata
- Download URL: k_llmmeans-0.1.3-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbed6f16305234b5fccb779354912d0a4381f8b83c7af14ae6422935a22a1126
|
|
| MD5 |
28b4a132009f818cd2d21fe5bb77c607
|
|
| BLAKE2b-256 |
7f393c678fc4142a6f5a6e8c32a3cb68eb8a4b184deeb6b8fa0094f40e86e42c
|