Skip to main content

This is a package to generate topics for the text corpus.

Project description

TopicGPT package

How to install this package?

pip install wm_topicgpt==0.0.8

How to use this package?

Step 1: Set up global parameters

from topicgpt import config

# For NameFilter
config.azure_key = ""
config.azure_endpoint = ""

# For GPT3.5 or GPT4 or Ada-002
config.consumer_id = ""
config.private_key_path = ""
config.mso_llm_env = ""

Step 2: Load your dataset

Load your data, must be 'pandas.DataFrame' format.

import pandas as pd

data_df = pd.read_csv("dataset.csv")

Step 3: Run the code

We provide two approaches to make the topic modling: HDBSCAN approach and Kmeans approach.

HDBSCAN approach:

# If using jupyter notebook, you should includes those two lines.
import nest_asyncio
nest_asyncio.apply()


# Setting up some params for this approach. If you don't need some parts, just drop that part.
hdbscan_params = {
    # preprocessing part
    'preprocessing': {'words_range': (1, 500)},
    # name filter part
    'name_filter': {},
    # extracting keywords part
    'extract_keywords': {'llm_model': 'gpt-35-turbo', 'temperature': 0., 'batch_size': 300},
    # embedding part (must have)
    'embedding': {'model': 'bge', 'batch_size': 500, 'device': 'mps'},
    # hdbscan clustering part (must have)
    'hdbscan': {'reduced_dim': 5, 'n_neighbors': 10, 'min_cluster_percent': 0.02, 'topk': 5,
                'llm_model': 'gpt-35-turbo', 'temperature': 0.5, 'verbose': True},
}

from topicgpt.pipeline import topic_modeling_by_hdbscan

# data_df: your pd.DataFrame dataset
# text_col_name: the column name of texts in the data_df
# params: some parameters for this approach. If you don't pass this parameter, it will use default parameters
root = topic_modeling_by_hdbscan(data=data_df, text_col_name='userInput', params=hdbscan_params)

Kmeans approach:

# If using jupyter notebook, you should includes those two lines.
import nest_asyncio
nest_asyncio.apply()


# Setting up some params for this approach. If you don't need some parts, just drop that part.
kmeans_params = {
    # preprocessing part
    'preprocessing': {'words_range': (1, 500)},
    # name filter part
    'name_filter': {},
    # embedding part
    'embedding': {'model': 'bge', 'batch_size': 500, 'device': 'mps'},
    # hdbscan clustering part
    'kmeans': {'n_clusters_list': [100, 30, 5], 'topk': 5, 'llm_model': 'gpt-35-turbo', 'temperature': 0.5,
               'batch_size': 300, 'embed_model': "bge", "device": "mps", "embed_batch_size": 500,
               'ngram_range': (1, 2), 'topk_keywords': 10},
}

from topicgpt.pipeline import topic_modeling_by_kmeans

# data_df: your pd.DataFrame dataset
# text_col_name: the column name of texts in the data_df
# params: some parameters for this approach. If you don't pass this parameter, it will use default parameters
clusters = topic_modeling_by_kmeans(data=data_df, text_col_name='userInput', params=kmeans_params)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wm_topicgpt-0.0.9.tar.gz (20.9 kB view hashes)

Uploaded Source

Built Distribution

wm_topicgpt-0.0.9-py3-none-any.whl (25.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page