Skip to main content

k-LLMmeans clustering algorithm

Project description

k-llmmeans

Scikit-learn compatible implementation of k-LLMmeans for text clustering with summary-based centroids.

This package adapts the original research code into an estimator API you can use with familiar fit, predict, and fit_predict workflows.

What This Package Provides

  • kLLMmeans estimator implementing BaseEstimator + ClusterMixin
  • scikit-learn style methods:
    • fit(X)
    • predict(X)
    • fit_predict(X)
  • configurable document embedding function (embedding_fn)
  • configurable cluster summarization function (summarizer_fn) or DSPy-backed LLM summarization
  • optional precomputed embedding support for faster iterative experimentation

Installation

pip install k-llmmeans

Or from source:

pip install -e .

Quick Start

import dspy
from k_llmmeans import kLLMmeans

# Option 1: pass an LM directly to the estimator
lm = dspy.LM("openai/gpt-5-mini")

docs = [
    "How to optimize SQL queries for large tables?",
    "What is the best way to tune a random forest model?",
    "PostgreSQL index strategy for analytics workloads",
    "Cross-validation tips for imbalanced classification",
]

model = kLLMmeans(
    n_clusters=2,
    llm=lm,
    max_llm_iter=5,
    random_state=0,
)

labels = model.fit_predict(docs)
print(labels)
print(model.summaries_)  # human-readable cluster summaries

Using Custom Embeddings and Summarization

You can fully control both the embedding and summarization steps:

from sentence_transformers import SentenceTransformer
from k_llmmeans import kLLMmeans

encoder = SentenceTransformer("all-MiniLM-L6-v2")

def embedding_fn(texts: list[str]):
    return encoder.encode(texts)

def summarizer_fn(cluster_texts: list[str]) -> str:
    # Replace with your own deterministic or LLM summarizer
    return " | ".join(cluster_texts[:2])

model = kLLMmeans(
    n_clusters=3,
    embedding_fn=embedding_fn,
    summarizer_fn=summarizer_fn,
)

model.fit(["text a", "text b", "text c", "text d"])

API Notes

  • Input X should be list[str].
  • The estimator stores standard fitted attributes such as:
    • labels_
    • cluster_centers_
    • n_iter_
  • Additional clustering interpretability attributes:
    • summaries_
    • summary_embeddings_
    • summaries_evolution_
    • centroids_evolution_

Citation

If you use this package in research or production work, please cite the original paper:

@article{diazrodriguez2025summaries,
  title={Summaries as Centroids for Interpretable and Scalable Text Clustering},
  author={Diaz-Rodriguez, Jairo},
  journal={arXiv preprint arXiv:2502.09667},
  year={2025}
}

Paper URL: https://arxiv.org/abs/2502.09667

Acknowledgment

This package is a scikit-learn compatible adaptation of the original project: https://github.com/jairoadiazr/k-LLMmeans

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

k_llmmeans-0.1.1.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

k_llmmeans-0.1.1-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file k_llmmeans-0.1.1.tar.gz.

File metadata

  • Download URL: k_llmmeans-0.1.1.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for k_llmmeans-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8f5ecf162b2886e9e459ed10df2571ee191d1bdacca57c1152146ddaa7263eef
MD5 9b30ab2077279a503eba19eb7a1eb662
BLAKE2b-256 a06f6ed6cc70c2f67f26a8d6bfd55a074eeadd997870f5c234d01a19047568ce

See more details on using hashes here.

File details

Details for the file k_llmmeans-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: k_llmmeans-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for k_llmmeans-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 41867a146a7746f154309412748eca116afd97af7aa81be54b3f1d6b5173842f
MD5 67137b94c1289ee57737bdd98131d5ff
BLAKE2b-256 7ba781f77f33aef0780fc6d45a194a2c73a66c3294c98663f8f760058d24acfe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page