Skip to main content

LLM plugin for clustering embeddings

Project description

llm-cluster

PyPI Changelog Tests License

LLM plugin for clustering embeddings.

Installation

Install this plugin in the same environment as LLM.

llm install llm-cluster

Usage

The plugin adds a new command, llm cluster. This command takes the name of an embedding collection and the number of clusters to return.

First, use paginate-json and jq to populate a collection. I this case we are embedding the title and body of every issue in the llm repository, and storing the result in a issues.db database:

paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
    --database issues.db --store

The --store flag causes the content to be stored in the database along with the embedding vectors.

Now we can cluster those embeddings into 10 groups:

llm cluster llm-issues 10 \
  -d issues.db

If you omit the -d option the default embeddings database will be used.

The output should look something like this (truncated):

[
  {
    "id": "2",
    "items": [
      {
        "id": "1650662628",
        "content": "Initial design"
      },
      {
        "id": "1650682379",
        "content": "Log prompts and responses to SQLite"
      }
    ]
  },
  {
    "id": "4",
    "items": [
      {
        "id": "1650760699",
        "content": "llm web command - launches a web server"
      },
      {
        "id": "1759659476",
        "content": "`llm models` command"
      },
      {
        "id": "1784156919",
        "content": "`llm.get_model(alias)` helper"
      }
    ]
  },
  {
    "id": "7",
    "items": [
      {
        "id": "1650765575",
        "content": "--code mode for outputting code"
      },
      {
        "id": "1659086298",
        "content": "Accept PROMPT from --stdin"
      },
      {
        "id": "1714651657",
        "content": "Accept input from standard in"
      }
    ]
  }
]

Generating summaries for each cluster

The --summary flag will cause the plugin to generate a summary for each cluster, by passing the first 100 characters of stored content through a prompt to a Large Language Model.

This feature is still experimental. You should experiment with custom prompts to improve the quality of your summaries.

Since it can result in running a large amount of text through a LLM it can also be expensive, depending on which model you are using.

This only works against embeddings that have had their associated content stored in the database using the --store flag.

You can use it like this:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary

This uses the default prompt and the default model.

To use a different model, e.g. GPT-4, pass the --model option:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary \
  --model gpt-4

To use a custom prompt, pass --prompt:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary \
  --model gpt-4 \
  --prompt 'Summarize this in a short line in the style of a bored, angry panda'

A "summary" key will be added to each cluster, containing the generated summary.

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment:

cd llm-cluster
python3 -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm-cluster-0.1.tar.gz (9.0 kB view details)

Uploaded Source

Built Distribution

llm_cluster-0.1-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file llm-cluster-0.1.tar.gz.

File metadata

  • Download URL: llm-cluster-0.1.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for llm-cluster-0.1.tar.gz
Algorithm Hash digest
SHA256 0aa34f3c74257c499f379625f9e61fc546125ce6b1ed2b6fcbfa0c827fbf13c3
MD5 90d28b00a45dfd0af647652918dbaf7d
BLAKE2b-256 496a6e4d14f4bf77dc3845af67bb351e4d651a85228481cc31683791ab2c3fee

See more details on using hashes here.

File details

Details for the file llm_cluster-0.1-py3-none-any.whl.

File metadata

  • Download URL: llm_cluster-0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for llm_cluster-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4f5fd7ee8bfc74ea10dcc8f573200746679b6fa8bc521007218e80cd667bfde7
MD5 66e8041f9b2cc30404e12d91d1e770f1
BLAKE2b-256 811a66de1a8ad79faa9f484ed7cd89712aa839d7b164dadb8fb5925937a822d9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page