A library for efficient semi-supervised clustering with large language models.
Project description
Few Shot Clustering
Setup
You can either install a wheel via Pip or install from source.
Install via pip
pip install few-shot-clustering
Install from Source:
git submodule update --init
pip install -e .
Other dependencies
This repository also requires torch
if you use the Keyphrase Clustering method. This is not currently included in the pip installation for users to install custom Torch packages on their own machine/GPU, but this code was tested with:
pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
Dependencies
"scikit-learn",
"matplotlib",
"torch",
"numpy",
"openai",
"sentence_transformers",
"jsonlines",
"ortools",
"tqdm"
Run LLM-based clustering algorithms
LLM Pairwise Constraint Clustering
Here's an example of how to run pairwise constraint clustering using an LLM to generate the constraints, from scratch, on the CLINC dataset. First, write a prompt for generating pairwise constraints:
prompt = """You are tasked with clustering queries for a task-oriented dialog system based on whether they express the same general user intent. To do this, you will be given pairs of user queries and asked if they express the same general user need or intent.
Your task will be considered successful if the queries are clustered into groups that consistently express the same general intent.
Utterance #1: what's the spanish word for pasta
Utterance #2: how would they say butter in zambia
Given this context, do utterance #1 and utterance #2 likely express the same general intent? Yes
Utterance #1: roll those dice once
Utterance #2: can you roll an eight sided die and tell me what it comes up as
Given this context, do utterance #1 and utterance #2 likely express the same general intent? No
Utterance #1: how soon milk expires
Utterance #2: can you roll an eight sided die and tell me what it comes up as
Given this context, do utterance #1 and utterance #2 likely express the same general intent? Yes
Utterance #1: nice seeing you bye
Utterance #2: what was the date of my last car appointment
Given this context, do utterance #1 and utterance #2 likely express the same general intent? No"""
Now, use this prompt to call the OpenAI API and create clusters (note that you'll need to set your OPENAI_API_KEY
before doing this step).
from few_shot_clustering.wrappers import LLMPairwiseClustering
from few_shot_clustering.dataloaders import load_clinc
# You can provide an optional file to cache the extracted features,
# since these are a bit expensive to compute. Example:
# cache_path = "/tmp/clinc_feature_cache.pkl"
#
# This is not necessary, as shown below.
cache_path = None
features, labels, documents = load_clinc(cache_path)
prompt_suffix = "express the same general intent?"
text_type = "Utterance"
cluster_assignments, constraints = LLMPairwiseClustering(features, documents, 150, prompt, text_type, prompt_suffix, max_feedback_given=100, pckmeans_w=0.4, cache_file="/tmp/clinc_cache_file.json", constraint_selection_algorithm="SimilarityFinder", kmeans_init="k-means++")
from few_shot_clustering.eval_utils import cluster_acc
import numpy as np
print(f"Accuracy: {cluster_acc(np.array(cluster_assignments), np.array(labels))}")
LLM Keyphrase Expansion Clustering
We'll again show a from-scratch example on CLINC. First, write a prompt for generating keyphrases for each datapoint:
prompt = """I am trying to cluster task-oriented dialog system queries based on whether they express the same general user intent. To help me with this, for a given user query, provide a comprehensive set of keyphrases that could describe this query's intent. These keyphrases should be distinct from those that might describe queries with different intents. Generate the set of keyphrases as a JSON-formatted list.
Query: "how would you say fly in italian"
Keyphrases: ["translation", "translate"]
Query: "what does assiduous mean"
Keyphrases: ["definition", "define"]
Query: "find my cellphone for me!"
Keyphrases: ["location", "find", "locate", "tracking", "track"]"""
Now we can call the OpenAI API (after setting OPENAI_API_KEY
) to generate keyphrases and create clusters:
from few_shot_clustering.wrappers import LLMKeyphraseClustering
from InstructorEmbedding import INSTRUCTOR
from few_shot_clustering.dataloaders import load_clinc
# You can provide an optional file to cache the extracted features,
# since these are a bit expensive to compute. Example:
# cache_path = "/tmp/clinc_feature_cache.pkl"
#
# This is not necessary, as shown below.
cache_path = None
features, labels, documents = load_clinc(cache_path)
prompt_suffix = "express the same general intent?"
text_type = "Query"
encoder_model = INSTRUCTOR('hkunlp/instructor-large')
cluster_assignments = LLMKeyphraseClustering(features, documents, 150, prompt, text_type, encoder_model=encoder_model, prompt_for_encoder="Represent keyphrases for topic classification:", cache_file="/tmp/clinc_expansion_cache_file.json")
from few_shot_clustering.eval_utils import cluster_acc
import numpy as np
print(f"Accuracy: {cluster_acc(np.array(cluster_assignments), np.array(labels))}")
Citation
Found this useful? Please cite
@misc{prompt2model,
title={Prompt2Model: Generating Deployable Models from Natural Language Instructions},
author={Vijay Viswanathan and Chenyang Zhao and Amanda Bertsch and Tongshuang Wu and Graham Neubig},
booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP) Demo Track},
address = {Singapore},
month = {November},
url = {https://arxiv.org/abs/2308.12261},
year = {2023}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for few_shot_clustering-0.0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0de56e6adbc208d0b8474074a000238809e79c0b33d63bac660416492c75b1b4 |
|
MD5 | 5bb064edc83a262655cff1e146b4235c |
|
BLAKE2b-256 | a209679c85c686c9bd5d5387bc107ccd56c82d817261ac93daf487c4f9572aa1 |
Hashes for few_shot_clustering-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20576245fe0c54eb62ece0a3c8746705b6806592d133b24f0b0ff21fc64cf734 |
|
MD5 | 45c50454eb22ff17d4461f31ec29ea26 |
|
BLAKE2b-256 | c1f811a3dcd46bae50dbb5d3c304e248a19883d609183edc3ae81c299af49448 |