Skip to main content

ChatIntents automatically clusters and labels short text intent messages.

Project description

chat-intents

ChatIntents provides a method for automatically clustering and applying descriptive group labels to short text documents containing dialogue intents. It uses UMAP for performing dimensionality reduction on user-supplied document embeddings and HDSBCAN for performing the clustering. Hyperparameters are automatically tuned by performing a Bayesian search (using hyperopt) on a constrained optimization of an objective function using user-supplied bounds.

See associated Medium post for additional description and motivation.

Installation

Installation can be done using PyPI:

pip install chatintents

Note: Depending on your system setup and environment, you may encounter an error associated with the pip install of HDSBCAN (failure to build the hdbscan wheel). This is a known issue with HDSCAN and has several possible solutions. If you are already using a conda virtual environment, an easy solution is to conda install HDBSCAN before installing the chatintents package using:

conda install -c conda-forge hdbscan

Sentence embeddings

The chatintents package doesn't include or specify how to create the sentence embeddings of the documents. Two popular pre-trained embedding models, as shown in the tutorial notebook, are the Unversal Sentence Encoder (USE) and Sentence Transformers.

Sentence Transformers can be installed by:

pip install -U sentence-transformers

Universal Sentence Encoder requires installing

pip install tensorflow
pip install --upgrade tensorflow-hub

Quick Start

The below example uses a Sentence Transformer model to embed the messages and create a model instance:

import chatintents
from chatintents import ChatIntents

from sentence_transformers import SentenceTransformer

all_intents = list(docs['text'])
model = SentenceTransformer('all-mpnet-base-v2')
embeddings = model.encode(all_intents)

model = ChatIntents(embeddings, 'st1')

Creating a ChatIntents instance requires inputs of an embedding representation of all documents and a short-text string description of the model (no spaces).

Generating clusters

Methods are provided for generating clusters using user-supplied hyperparameters, from random search, and from a Bayesian search.

User-supplied hyperparameters and manually scoring

clusters = model.generate_clusters(n_neighbors = 15, 
                                   n_components = 5, 
                                   min_cluster_size = 5, 
                                   min_samples = None,
                                   random_state=42)

labels, cost = model.score_clusters(clusters)

Random search

To run 100 evaluations of randomly-selected hyperparameter values within user-supplied ranges:

space = {
        "n_neighbors": range(12,16),
        "n_components": range(3,7),
        "min_cluster_size": range(2,15),
        "min_samples": range(2,15)
    }

df_random = model.random_search(space, 100)

Bayesian search

Perform a Bayesian search of the hyperparameter space using hyperopt and user-supplied upper and lower bounds for the number of expected clusters:

hspace = {
    "n_neighbors": hp.choice('n_neighbors', range(3,16)),
    "n_components": hp.choice('n_components', range(3,16)),
    "min_cluster_size": hp.choice('min_cluster_size', range(2,16)),
    "min_samples": None,
    "random_state": 42
}

label_lower = 30
label_upper = 100
max_evals = 100

model.bayesian_search(space=hspace,
                      label_lower=label_lower, 
                      label_upper=label_upper, 
                      max_evals=max_evals)

Running the bayesian_search method on a model instance saves the best parameters and best clusters to that instance as variables. For example:

>>> model.best_params

{'min_cluster_size': 5,
 'min_samples': None,
 'n_components': 11,
 'n_neighbors': 3,
 'random_state': 42}

Applying labels to best clusters from Bayesian search

After running the bayesian_search method to identify the best clusters for a given embedding model, descriptive labels can then be applied with:

df_summary, labeled_docs = model.apply_and_summarize_labels(docs[['text']])

This yields two results. The df_summary dataframe summarizing the count and descriptive label of each group:

alt text

and the labeled_docs dataframe with each document in the dataset and it's associated cluster number and descriptiive label:

alt text

Evaluating performance if ground truth is known

Two methods are also supplied for evaluating and comparing the performance of different models if the ground truth labels happen to be known:

models = [model_use, model_st1, model_st2, model_st3]

df_comparison, labeled_docs_all_models = chatintents.evaluate_models(docs[['text', 
                                                                           'category']],
                                                                           models)

An example df_comparison dataframe comparing model performance is shown below:

alt text

Tutorial

See this tutorial notebook for an example of using the chatintents package for comparing four different models on a dataset.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chatintents-0.0.1.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

chatintents-0.0.1-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file chatintents-0.0.1.tar.gz.

File metadata

  • Download URL: chatintents-0.0.1.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.10.0 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for chatintents-0.0.1.tar.gz
Algorithm Hash digest
SHA256 3d91956e11bf2c9ff5c6b82ee61eeeb9ddbe4ce170b5b45ae481d528f5e09f46
MD5 52dc0187662d81a533cd8f91562f9c76
BLAKE2b-256 eecb7746f079a811ca3eb8b9a42af94fdcf58cc988164ebada9b7403387c1a40

See more details on using hashes here.

File details

Details for the file chatintents-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: chatintents-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.10.0 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for chatintents-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0c638ec5cbb04ba02db65a14f3c07db49587edc2e69e895380f14a75ea200e10
MD5 17a2a7f41f3922a509781c63799f841e
BLAKE2b-256 db69a42b5c955e2f53129fc730ed55cb48ae0010fa620563913ca4012a674cdb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page