ChatIntents automatically clusters and labels short text intent messages. Spacy model changed from english to french.
Project description
chat-intents
ChatIntents provides a method for automatically clustering and applying descriptive group labels to short text documents containing dialogue intents. It uses UMAP for performing dimensionality reduction on user-supplied document embeddings and HDSBCAN for performing the clustering. Hyperparameters are automatically tuned by performing a Bayesian search (using hyperopt) on a constrained optimization of an objective function using user-supplied bounds.
See the associated Medium post for additional description and motivation.
Installation
Installation can be done using PyPI:
pip install chatintents
Note: Depending on your system setup and environment, you may encounter an error associated with the pip install of HDSBCAN (failure to build the hdbscan wheel). This is a known issue with HDSCAN and has several possible solutions. If you are already using a conda virtual environment, an easy solution is to conda install HDBSCAN before installing the chatintents package using:
conda install -c conda-forge hdbscan
Sentence embeddings
The chatintents
package doesn't include or specify how to create the sentence embeddings of the documents. Two popular pre-trained embedding models, as shown in the tutorial notebook, are the Unversal Sentence Encoder (USE) and Sentence Transformers.
Sentence Transformers can be installed by:
pip install -U sentence-transformers
Universal Sentence Encoder requires installing
pip install tensorflow
pip install --upgrade tensorflow-hub
Quick Start
The below example uses a Sentence Transformer model to embed the messages and create a model instance:
import chatintents
from chatintents import ChatIntents
from sentence_transformers import SentenceTransformer
all_intents = list(docs['text'])
model = SentenceTransformer('all-mpnet-base-v2')
embeddings = model.encode(all_intents)
model = ChatIntents(embeddings, 'st1')
Creating a ChatIntents instance requires inputs of an embedding representation of all documents and a short-text string description of the model (no spaces).
Generating clusters
Methods are provided for generating clusters using user-supplied hyperparameters, from random search, and from a Bayesian search.
User-supplied hyperparameters and manually scoring
clusters = model.generate_clusters(n_neighbors = 15,
n_components = 5,
min_cluster_size = 5,
min_samples = None,
random_state=42)
labels, cost = model.score_clusters(clusters)
Random search
To run 100 evaluations of randomly-selected hyperparameter values within user-supplied ranges:
space = {
"n_neighbors": range(12,16),
"n_components": range(3,7),
"min_cluster_size": range(2,15),
"min_samples": range(2,15)
}
df_random = model.random_search(space, 100)
Bayesian search
Perform a Bayesian search of the hyperparameter space using hyperopt and user-supplied upper and lower bounds for the number of expected clusters:
hspace = {
"n_neighbors": hp.choice('n_neighbors', range(3,16)),
"n_components": hp.choice('n_components', range(3,16)),
"min_cluster_size": hp.choice('min_cluster_size', range(2,16)),
"min_samples": None,
"random_state": 42
}
label_lower = 30
label_upper = 100
max_evals = 100
model.bayesian_search(space=hspace,
label_lower=label_lower,
label_upper=label_upper,
max_evals=max_evals)
Running the bayesian_search
method on a model instance saves the best parameters and best clusters to that instance as variables. For example:
>>> model.best_params
{'min_cluster_size': 5,
'min_samples': None,
'n_components': 11,
'n_neighbors': 3,
'random_state': 42}
Applying labels to best clusters from Bayesian search
After running the bayesian_search
method to identify the best clusters for a given embedding model, descriptive labels can then be applied with:
df_summary, labeled_docs = model.apply_and_summarize_labels(docs[['text']])
This yields two results. The df_summary
dataframe summarizing the count and descriptive label of each group:
and the labeled_docs
dataframe with each document in the dataset and it's associated cluster number and descriptiive label:
Evaluating performance if ground truth is known
Two methods are also supplied for evaluating and comparing the performance of different models if the ground truth labels happen to be known:
models = [model_use, model_st1, model_st2, model_st3]
df_comparison, labeled_docs_all_models = chatintents.evaluate_models(docs[['text',
'category']],
models)
An example df_comparison
dataframe comparing model performance is shown below:
Tutorial
See this tutorial notebook for an example of using the chatintents
package for comparing four different models on a dataset.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wikit_chatintents-0.0.5.tar.gz
.
File metadata
- Download URL: wikit_chatintents-0.0.5.tar.gz
- Upload date:
- Size: 12.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4b7dc535ed2046b34c9584e85c272a89fbf7ec384398055a6c19c3acea1f4f7 |
|
MD5 | 514c46df63a75f1e4eeb0a0cfcbbe4e6 |
|
BLAKE2b-256 | 0e74aaac14439097ed2b69dd03ff34766170d9d1a1f89a843da42a6d2450ece9 |
File details
Details for the file wikit_chatintents-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: wikit_chatintents-0.0.5-py3-none-any.whl
- Upload date:
- Size: 10.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b904d1e24a2eaa97e9b231db14c34acdcb47ef359a35f3912ccadd857e3cd347 |
|
MD5 | 18248a0423a0e7d2260339f7370a2ed0 |
|
BLAKE2b-256 | 402544054ed6e8d1b9f02eabf5946be8cd0ca184d02d7bf8613d730dafa26a73 |