Skip to main content

HDBSCAN Tuning for BERTopic Models

Project description

TopicTuner — Tune BERTopic HDBSCAN Models

To install from PyPi :

pip install topicmodeltuner

The Problem

Out of the box, BERTopic relies upon HDBSCAN to cluster topics. Two of the most important HDBSCAN parameters, min_cluster_size and sample_size will almost always have a dramatic effect on cluster formation. They dictate the number of clusters created including the -1 or uncategorized cluster. While with some datasets a large number of uncategorized documents may be the right clustering, in practice BERTopic will essentially discard a large percentage of "good" documents and not use them for cluster formation and topic formation.

HDBSCAN is quite sensitive to the values of these two parameters relative to the text being clustered. This means that when using the BERTopic default value of min_topic_size=10 (which is assigned to HDBSCAN's min_cluster_size) the default parameters will more often than not result in an unmanageable number of topics; as well as a sub-optimal number of uncategorized documents. Additionally, documents assigned to the -1 category will not be used to determine topic vocabularly results.

The Solution

TopicTuner provides a TopicModelTuner class — a convenience wrapper for BERTopic Models that efficiently manages the process of discovering optimized min_cluster_size and sample_size parameters, providing:

  • Random and grid search functionality to quickly discover optimized parameters for a given BERTopic model.
  • An internal datastore that records all searches for a given model, making parameter selection fast and easy.
  • Visualizations to assist in parameter tuning and selection.
  • Two way Import/Export functionality so that you can start from scratch, or with an existing BERTopic model and export a BERTopic model with optimized parameters at the end of your session.
  • Save and Load for persistance.

To get you started this release includes both a demo notebook and API documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topicmodeltuner-0.3.4.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

topicmodeltuner-0.3.4-py3-none-any.whl (27.0 kB view details)

Uploaded Python 3

File details

Details for the file topicmodeltuner-0.3.4.tar.gz.

File metadata

  • Download URL: topicmodeltuner-0.3.4.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for topicmodeltuner-0.3.4.tar.gz
Algorithm Hash digest
SHA256 2ce603a4efd30d2773534093b23a43c893ce4c83d453ddf87a1b5a4b3b1e6de6
MD5 57b0943d581cc31ccb396f3b13341b24
BLAKE2b-256 0c405b5bafb7458cef6294e3ae287e2ef67b335690c4fc7d305da771c1eb9a00

See more details on using hashes here.

File details

Details for the file topicmodeltuner-0.3.4-py3-none-any.whl.

File metadata

File hashes

Hashes for topicmodeltuner-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0ee2ecb646e6f75f4c0ba078a029dd40e8bca3c5c817e5ce48308cbcaa52f8cd
MD5 86c166dc9c542fd7c80a652ccbce62e3
BLAKE2b-256 891e73e53e1847f07ff7afa7183bf1b8d6af4c48bd844a1b682f7330a2c9716a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page