Skip to main content

LLM plugin for clustering embeddings

Project description

llm-cluster

PyPI Changelog Tests License

LLM plugin for clustering embeddings.

Installation

Install this plugin in the same environment as LLM.

llm install llm-cluster

Usage

The plugin adds a new command, llm cluster. This command takes the name of an embedding collection and the number of clusters to return.

First, use paginate-json and jq to populate a collection. I this case we are embedding the title and body of every issue in the llm repository, and storing the result in a issues.db database:

paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
    --database issues.db --store

The --store flag causes the content to be stored in the database along with the embedding vectors.

Now we can cluster those embeddings into 10 groups:

llm cluster llm-issues 10 \
  -d issues.db

If you omit the -d option the default embeddings database will be used.

The output should look something like this (truncated):

[
  {
    "id": "2",
    "items": [
      {
        "id": "1650662628",
        "content": "Initial design"
      },
      {
        "id": "1650682379",
        "content": "Log prompts and responses to SQLite"
      }
    ]
  },
  {
    "id": "4",
    "items": [
      {
        "id": "1650760699",
        "content": "llm web command - launches a web server"
      },
      {
        "id": "1759659476",
        "content": "`llm models` command"
      },
      {
        "id": "1784156919",
        "content": "`llm.get_model(alias)` helper"
      }
    ]
  },
  {
    "id": "7",
    "items": [
      {
        "id": "1650765575",
        "content": "--code mode for outputting code"
      },
      {
        "id": "1659086298",
        "content": "Accept PROMPT from --stdin"
      },
      {
        "id": "1714651657",
        "content": "Accept input from standard in"
      }
    ]
  }
]

The content displayed is truncated to 100 characters. Pass --truncate 0 to disable truncation, or --truncate X to truncate to X characters.

Generating summaries for each cluster

The --summary flag will cause the plugin to generate a summary for each cluster, by passing the content of the items (truncated according to the --truncate option) through a prompt to a Large Language Model.

This feature is still experimental. You should experiment with custom prompts to improve the quality of your summaries.

Since this can run a large amount of text through a LLM this can be expensive, depending on which model you are using.

This feature only works for embeddings that have had their associated content stored in the database using the --store flag.

You can use it like this:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary

This uses the default prompt and the default model.

To use a different model, e.g. GPT-4, pass the --model option:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary \
  --model gpt-4

The default prompt used is:

Short, concise title for this cluster of related documents.

To use a custom prompt, pass --prompt:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary \
  --model gpt-4 \
  --prompt 'Summarize this in a short line in the style of a bored, angry panda'

A "summary" key will be added to each cluster, containing the generated summary.

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment:

cd llm-cluster
python3 -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm-cluster-0.2.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

llm_cluster-0.2-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file llm-cluster-0.2.tar.gz.

File metadata

  • Download URL: llm-cluster-0.2.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for llm-cluster-0.2.tar.gz
Algorithm Hash digest
SHA256 9e79db0bd3f7feb3f73afdac1caf5947da5fb2f43bdfced36549cc349d26bc28
MD5 6b0f513944d4ed0153c81023910a4f66
BLAKE2b-256 4871f7f96688d4935ee20c89058365eea875273a361df00350a8d5f5f59bd721

See more details on using hashes here.

File details

Details for the file llm_cluster-0.2-py3-none-any.whl.

File metadata

  • Download URL: llm_cluster-0.2-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for llm_cluster-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ffabb64eb97a264a414b9312d82d8a649e05c3d73b93e943fafc46975ada42cb
MD5 afd4d4a232514b6f3ae2b0ee0f162078
BLAKE2b-256 56ff3d156c6ed478fdd095b398fce8f50e95b23deecafbfd4372666f4b98fa08

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page