ChatIntents automatically clusters and labels short text intent messages. Spacy model changed from english to french.

These details have not been verified by PyPI

Project links

Homepage

Project description

chat-intents

ChatIntents provides a method for automatically clustering and applying descriptive group labels to short text documents containing dialogue intents. It uses UMAP for performing dimensionality reduction on user-supplied document embeddings and HDSBCAN for performing the clustering. Hyperparameters are automatically tuned by performing a Bayesian search (using hyperopt) on a constrained optimization of an objective function using user-supplied bounds.

See the associated Medium post for additional description and motivation.

Installation

Installation can be done using PyPI:

pip install chatintents

Note: Depending on your system setup and environment, you may encounter an error associated with the pip install of HDSBCAN (failure to build the hdbscan wheel). This is a known issue with HDSCAN and has several possible solutions. If you are already using a conda virtual environment, an easy solution is to conda install HDBSCAN before installing the chatintents package using:

conda install -c conda-forge hdbscan

Sentence embeddings

The chatintents package doesn't include or specify how to create the sentence embeddings of the documents. Two popular pre-trained embedding models, as shown in the tutorial notebook, are the Unversal Sentence Encoder (USE) and Sentence Transformers.

Sentence Transformers can be installed by:

pip install -U sentence-transformers

Universal Sentence Encoder requires installing

pip install tensorflow
pip install --upgrade tensorflow-hub

Quick Start

The below example uses a Sentence Transformer model to embed the messages and create a model instance:

import chatintents
from chatintents import ChatIntents

from sentence_transformers import SentenceTransformer

all_intents = list(docs['text'])
model = SentenceTransformer('all-mpnet-base-v2')
embeddings = model.encode(all_intents)

model = ChatIntents(embeddings, 'st1')

Creating a ChatIntents instance requires inputs of an embedding representation of all documents and a short-text string description of the model (no spaces).

Generating clusters

Methods are provided for generating clusters using user-supplied hyperparameters, from random search, and from a Bayesian search.

User-supplied hyperparameters and manually scoring

clusters = model.generate_clusters(n_neighbors = 15, 
                                   n_components = 5, 
                                   min_cluster_size = 5, 
                                   min_samples = None,
                                   random_state=42)

labels, cost = model.score_clusters(clusters)

Random search

To run 100 evaluations of randomly-selected hyperparameter values within user-supplied ranges:

space = {
        "n_neighbors": range(12,16),
        "n_components": range(3,7),
        "min_cluster_size": range(2,15),
        "min_samples": range(2,15)
    }

df_random = model.random_search(space, 100)

Bayesian search

Perform a Bayesian search of the hyperparameter space using hyperopt and user-supplied upper and lower bounds for the number of expected clusters:

hspace = {
    "n_neighbors": hp.choice('n_neighbors', range(3,16)),
    "n_components": hp.choice('n_components', range(3,16)),
    "min_cluster_size": hp.choice('min_cluster_size', range(2,16)),
    "min_samples": None,
    "random_state": 42
}

label_lower = 30
label_upper = 100
max_evals = 100

model.bayesian_search(space=hspace,
                      label_lower=label_lower, 
                      label_upper=label_upper, 
                      max_evals=max_evals)

Running the bayesian_search method on a model instance saves the best parameters and best clusters to that instance as variables. For example:

>>> model.best_params

{'min_cluster_size': 5,
 'min_samples': None,
 'n_components': 11,
 'n_neighbors': 3,
 'random_state': 42}

Applying labels to best clusters from Bayesian search

After running the bayesian_search method to identify the best clusters for a given embedding model, descriptive labels can then be applied with:

df_summary, labeled_docs = model.apply_and_summarize_labels(docs[['text']])

This yields two results. The df_summary dataframe summarizing the count and descriptive label of each group:

alt text

and the labeled_docs dataframe with each document in the dataset and it's associated cluster number and descriptiive label:

alt text

Evaluating performance if ground truth is known

Two methods are also supplied for evaluating and comparing the performance of different models if the ground truth labels happen to be known:

models = [model_use, model_st1, model_st2, model_st3]

df_comparison, labeled_docs_all_models = chatintents.evaluate_models(docs[['text', 
                                                                           'category']],
                                                                           models)

An example df_comparison dataframe comparing model performance is shown below:

alt text

Tutorial

See this tutorial notebook for an example of using the chatintents package for comparing four different models on a dataset.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.5

May 4, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikit_chatintents-0.0.5.tar.gz (12.7 kB view details)

Uploaded May 4, 2023 Source

Built Distribution

wikit_chatintents-0.0.5-py3-none-any.whl (10.0 kB view details)

Uploaded May 4, 2023 Python 3

File details

Details for the file wikit_chatintents-0.0.5.tar.gz.

File metadata

Download URL: wikit_chatintents-0.0.5.tar.gz
Upload date: May 4, 2023
Size: 12.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for wikit_chatintents-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`a4b7dc535ed2046b34c9584e85c272a89fbf7ec384398055a6c19c3acea1f4f7`
MD5	`514c46df63a75f1e4eeb0a0cfcbbe4e6`
BLAKE2b-256	`0e74aaac14439097ed2b69dd03ff34766170d9d1a1f89a843da42a6d2450ece9`

See more details on using hashes here.

File details

Details for the file wikit_chatintents-0.0.5-py3-none-any.whl.

File metadata

Download URL: wikit_chatintents-0.0.5-py3-none-any.whl
Upload date: May 4, 2023
Size: 10.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for wikit_chatintents-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b904d1e24a2eaa97e9b231db14c34acdcb47ef359a35f3912ccadd857e3cd347`
MD5	`18248a0423a0e7d2260339f7370a2ed0`
BLAKE2b-256	`402544054ed6e8d1b9f02eabf5946be8cd0ca184d02d7bf8613d730dafa26a73`