Skip to main content

k-LLMmeans clustering algorithm

Project description

k-llmmeans

Scikit-learn compatible implementation of k-LLMmeans for text clustering with summary-based centroids.

This package adapts the original research code into an estimator API you can use with familiar fit, predict, and fit_predict workflows.

What This Package Provides

  • kLLMmeans estimator implementing BaseEstimator + ClusterMixin
  • scikit-learn style methods:
    • fit(X)
    • predict(X)
    • fit_predict(X)
  • configurable document embedding function (embedding_fn)
  • configurable cluster summarization function (summarizer_fn) or DSPy-backed LLM summarization
  • optional precomputed embedding support for faster iterative experimentation

Installation

pip install k-llmmeans

Or from source:

pip install -e .

Quick Start

import dspy
from k_llmmeans import kLLMmeans

# Option 1: pass an LM directly to the estimator
lm = dspy.LM("openai/gpt-5-mini")

docs = [
    "How to optimize SQL queries for large tables?",
    "What is the best way to tune a random forest model?",
    "PostgreSQL index strategy for analytics workloads",
    "Cross-validation tips for imbalanced classification",
]

model = kLLMmeans(
    n_clusters=2,
    llm=lm,
    max_llm_iter=5,
    random_state=0,
)

labels = model.fit_predict(docs)
print(labels)
print(model.summaries_)  # human-readable cluster summaries

Using Custom Embeddings and Summarization

You can fully control both the embedding and summarization steps:

from sentence_transformers import SentenceTransformer
from k_llmmeans import kLLMmeans

encoder = SentenceTransformer("all-MiniLM-L6-v2")

def embedding_fn(texts: list[str]):
    return encoder.encode(texts)

def summarizer_fn(cluster_texts: list[str]) -> str:
    # Replace with your own deterministic or LLM summarizer
    return " | ".join(cluster_texts[:2])

model = kLLMmeans(
    n_clusters=3,
    embedding_fn=embedding_fn,
    summarizer_fn=summarizer_fn,
)

model.fit(["text a", "text b", "text c", "text d"])

API Notes

  • Input X should be list[str].
  • The estimator stores standard fitted attributes such as:
    • labels_
    • cluster_centers_
    • n_iter_
  • Additional clustering interpretability attributes:
    • summaries_
    • summary_embeddings_
    • summaries_evolution_
    • centroids_evolution_

Citation

If you use this package in research or production work, please cite the original paper:

@article{diazrodriguez2025summaries,
  title={Summaries as Centroids for Interpretable and Scalable Text Clustering},
  author={Diaz-Rodriguez, Jairo},
  journal={arXiv preprint arXiv:2502.09667},
  year={2025}
}

Paper URL: https://arxiv.org/abs/2502.09667

Acknowledgment

This package is a scikit-learn compatible adaptation of the original project: https://github.com/jairoadiazr/k-LLMmeans

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

k_llmmeans-0.1.0.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

k_llmmeans-0.1.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file k_llmmeans-0.1.0.tar.gz.

File metadata

  • Download URL: k_llmmeans-0.1.0.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for k_llmmeans-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c9b9ef2710f3f875df4c9fc654237925797281de04fea28698a1346a84baf241
MD5 0f52d792f67da6e68b63167877fe7d30
BLAKE2b-256 0b46a1149c689ac1e293c28c65364f02f182b5fdf60e3be913352bccdfd3f3d3

See more details on using hashes here.

File details

Details for the file k_llmmeans-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: k_llmmeans-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for k_llmmeans-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e926abe180e5aa5209b29990f950cdfa163b1678216a379e4d220337c0291315
MD5 32b33fbc9e6d67e35c5d5a3d0763279d
BLAKE2b-256 9e9b5e1e1555bbfaa9e6e6268d2f8cc15a55522b92231222564d25f4e3ca2d48

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page