Skip to main content

k-LLMmeans clustering algorithm

Project description

k-llmmeans

Scikit-learn compatible implementation of k-LLMmeans for text clustering with summary-based centroids.

This package adapts the original research code into an estimator API you can use with familiar fit, predict, and fit_predict workflows.

What This Package Provides

  • kLLMmeans estimator implementing BaseEstimator + ClusterMixin
  • scikit-learn style methods:
    • fit(X)
    • predict(X)
    • fit_predict(X)
  • configurable document embedding function (embedding_fn)
  • configurable cluster summarization function (summarizer_fn) or DSPy-backed LLM summarization
  • optional precomputed embedding support for faster iterative experimentation

Installation

pip install k-llmmeans

Or from source:

pip install -e .

Quick Start

import dspy
from k_llmmeans import kLLMmeans

# Option 1: pass an LM directly to the estimator
lm = dspy.LM("openai/gpt-5-mini")

docs = [
    "How to optimize SQL queries for large tables?",
    "What is the best way to tune a random forest model?",
    "PostgreSQL index strategy for analytics workloads",
    "Cross-validation tips for imbalanced classification",
]

model = kLLMmeans(
    n_clusters=2,
    llm=lm,
    max_llm_iter=5,
    random_state=0,
)

labels = model.fit_predict(docs)
print(labels)
print(model.summaries_)  # human-readable cluster summaries

Using Custom Embeddings and Summarization

You can fully control both the embedding and summarization steps:

from sentence_transformers import SentenceTransformer
from k_llmmeans import kLLMmeans

encoder = SentenceTransformer("all-MiniLM-L6-v2")

def embedding_fn(texts: list[str]):
    return encoder.encode(texts)

def summarizer_fn(cluster_texts: list[str]) -> str:
    # Replace with your own deterministic or LLM summarizer
    return " | ".join(cluster_texts[:2])

model = kLLMmeans(
    n_clusters=3,
    embedding_fn=embedding_fn,
    summarizer_fn=summarizer_fn,
)

model.fit(["text a", "text b", "text c", "text d"])

API Notes

  • Input X should be list[str].
  • The estimator stores standard fitted attributes such as:
    • labels_
    • cluster_centers_
    • n_iter_
  • Additional clustering interpretability attributes:
    • summaries_
    • summary_embeddings_
    • summaries_evolution_
    • centroids_evolution_

Citation

If you use this package in research or production work, please cite the original paper:

@article{diazrodriguez2025summaries,
  title={Summaries as Centroids for Interpretable and Scalable Text Clustering},
  author={Diaz-Rodriguez, Jairo},
  journal={arXiv preprint arXiv:2502.09667},
  year={2025}
}

Paper URL: https://arxiv.org/abs/2502.09667

Acknowledgment

This package is a scikit-learn compatible adaptation of the original project: https://github.com/jairoadiazr/k-LLMmeans

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

k_llmmeans-0.1.2.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

k_llmmeans-0.1.2-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file k_llmmeans-0.1.2.tar.gz.

File metadata

  • Download URL: k_llmmeans-0.1.2.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for k_llmmeans-0.1.2.tar.gz
Algorithm Hash digest
SHA256 69b83bd51bb9809d113559821a6351f8edbb79df48dc177d69f98d5bbd35f793
MD5 5f0cba0d3f447fb0cabfeae50031551d
BLAKE2b-256 fcc6971d00e5af2471e42574273314474afc99a8d90b525ef5bfa6229e82cff5

See more details on using hashes here.

File details

Details for the file k_llmmeans-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: k_llmmeans-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for k_llmmeans-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 130d7d6ca1d3102983b3eca446717293f3fa6238350f35b2bec2b5e7e35dba3f
MD5 7b3aa6878b88e4d1f537d0310a1e64cf
BLAKE2b-256 830b69394ab2877d80278f5fc231b804e3fe884a718843a28c8b5db8dd066ca7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page