ChatIntents automatically clusters and labels short text intent messages.
Project description
chat-intents
ChatIntents provides a method for automatically clustering and applying descriptive group labels to short text documents containing dialogue intents. It uses UMAP for performing dimensionality reduction on user-supplied document embeddings and HDSBCAN for performing the clustering. Hyperparameters are automatically tuned by performing a Bayesian search (using hyperopt) on a constrained optimization of an objective function using user-supplied bounds.
See associated Medium post for additional description and motivation.
Installation
Installation can be done using PyPI:
pip install chatintents
Note: Depending on your system setup and environment, you may encounter an error associated with the pip install of HDSBCAN (failure to build the hdbscan wheel). This is a known issue with HDSCAN and has several possible solutions. If you are already using a conda virtual environment, an easy solution is to conda install HDBSCAN before installing the chatintents package using:
conda install -c conda-forge hdbscan
Sentence embeddings
The chatintents
package doesn't include or specify how to create the sentence embeddings of the documents. Two popular pre-trained embedding models, as shown in the tutorial notebook, are the Unversal Sentence Encoder (USE) and Sentence Transformers.
Sentence Transformers can be installed by:
pip install -U sentence-transformers
Universal Sentence Encoder requires installing
pip install tensorflow
pip install --upgrade tensorflow-hub
Quick Start
The below example uses a Sentence Transformer model to embed the messages and create a model instance:
import chatintents
from chatintents import ChatIntents
from sentence_transformers import SentenceTransformer
all_intents = list(docs['text'])
model = SentenceTransformer('all-mpnet-base-v2')
embeddings = model.encode(all_intents)
model = ChatIntents(embeddings, 'st1')
Creating a ChatIntents instance requires inputs of an embedding representation of all documents and a short-text string description of the model (no spaces).
Generating clusters
Methods are provided for generating clusters using user-supplied hyperparameters, from random search, and from a Bayesian search.
User-supplied hyperparameters and manually scoring
clusters = model.generate_clusters(n_neighbors = 15,
n_components = 5,
min_cluster_size = 5,
min_samples = None,
random_state=42)
labels, cost = model.score_clusters(clusters)
Random search
To run 100 evaluations of randomly-selected hyperparameter values within user-supplied ranges:
space = {
"n_neighbors": range(12,16),
"n_components": range(3,7),
"min_cluster_size": range(2,15),
"min_samples": range(2,15)
}
df_random = model.random_search(space, 100)
Bayesian search
Perform a Bayesian search of the hyperparameter space using hyperopt and user-supplied upper and lower bounds for the number of expected clusters:
hspace = {
"n_neighbors": hp.choice('n_neighbors', range(3,16)),
"n_components": hp.choice('n_components', range(3,16)),
"min_cluster_size": hp.choice('min_cluster_size', range(2,16)),
"min_samples": None,
"random_state": 42
}
label_lower = 30
label_upper = 100
max_evals = 100
model.bayesian_search(space=hspace,
label_lower=label_lower,
label_upper=label_upper,
max_evals=max_evals)
Running the bayesian_search
method on a model instance saves the best parameters and best clusters to that instance as variables. For example:
>>> model.best_params
{'min_cluster_size': 5,
'min_samples': None,
'n_components': 11,
'n_neighbors': 3,
'random_state': 42}
Applying labels to best clusters from Bayesian search
After running the bayesian_search
method to identify the best clusters for a given embedding model, descriptive labels can then be applied with:
df_summary, labeled_docs = model.apply_and_summarize_labels(docs[['text']])
This yields two results. The df_summary
dataframe summarizing the count and descriptive label of each group:
and the labeled_docs
dataframe with each document in the dataset and it's associated cluster number and descriptiive label:
Evaluating performance if ground truth is known
Two methods are also supplied for evaluating and comparing the performance of different models if the ground truth labels happen to be known:
models = [model_use, model_st1, model_st2, model_st3]
df_comparison, labeled_docs_all_models = chatintents.evaluate_models(docs[['text',
'category']],
models)
An example df_comparison
dataframe comparing model performance is shown below:
Tutorial
See this tutorial notebook for an example of using the chatintents
package for comparing four different models on a dataset.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for chatintents-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c638ec5cbb04ba02db65a14f3c07db49587edc2e69e895380f14a75ea200e10 |
|
MD5 | 17a2a7f41f3922a509781c63799f841e |
|
BLAKE2b-256 | db69a42b5c955e2f53129fc730ed55cb48ae0010fa620563913ca4012a674cdb |