This is a package to generate topics for the text corpus.
Project description
TopicGPT package
How to install this package?
pip install wm_topicgpt==0.0.8
How to use this package?
Step 1: Set up global parameters
from topicgpt import config
# For NameFilter
config.azure_key = ""
config.azure_endpoint = ""
# For GPT3.5 or GPT4 or Ada-002
config.consumer_id = ""
config.private_key_path = ""
config.mso_llm_env = ""
Step 2: Load your dataset
Load your data, must be 'pandas.DataFrame' format.
import pandas as pd
data_df = pd.read_csv("dataset.csv")
Step 3: Run the code
We provide two approaches to make the topic modling: HDBSCAN approach and Kmeans approach.
HDBSCAN approach:
# If using jupyter notebook, you should includes those two lines.
import nest_asyncio
nest_asyncio.apply()
# Setting up some params for this approach. If you don't need some parts, just drop that part.
hdbscan_params = {
# preprocessing part
'preprocessing': {'words_range': (1, 500)},
# name filter part
'name_filter': {},
# extracting keywords part
'extract_keywords': {'llm_model': 'gpt-35-turbo', 'temperature': 0., 'batch_size': 300},
# embedding part (must have)
'embedding': {'model': 'bge', 'batch_size': 500, 'device': 'mps'},
# hdbscan clustering part (must have)
'hdbscan': {'reduced_dim': 5, 'n_neighbors': 10, 'min_cluster_percent': 0.02, 'topk': 5,
'llm_model': 'gpt-35-turbo', 'temperature': 0.5, 'verbose': True},
}
from topicgpt.pipeline import topic_modeling_by_hdbscan
# data_df: your pd.DataFrame dataset
# text_col_name: the column name of texts in the data_df
# params: some parameters for this approach. If you don't pass this parameter, it will use default parameters
root = topic_modeling_by_hdbscan(data=data_df, text_col_name='userInput', params=hdbscan_params)
Kmeans approach:
# If using jupyter notebook, you should includes those two lines.
import nest_asyncio
nest_asyncio.apply()
# Setting up some params for this approach. If you don't need some parts, just drop that part.
kmeans_params = {
# preprocessing part
'preprocessing': {'words_range': (1, 500)},
# name filter part
'name_filter': {},
# embedding part
'embedding': {'model': 'bge', 'batch_size': 500, 'device': 'mps'},
# hdbscan clustering part
'kmeans': {'n_clusters_list': [100, 30, 5], 'topk': 5, 'llm_model': 'gpt-35-turbo', 'temperature': 0.5,
'batch_size': 300, 'embed_model': "bge", "device": "mps", "embed_batch_size": 500,
'ngram_range': (1, 2), 'topk_keywords': 10},
}
from topicgpt.pipeline import topic_modeling_by_kmeans
# data_df: your pd.DataFrame dataset
# text_col_name: the column name of texts in the data_df
# params: some parameters for this approach. If you don't pass this parameter, it will use default parameters
clusters = topic_modeling_by_kmeans(data=data_df, text_col_name='userInput', params=kmeans_params)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wm_topicgpt-0.0.9.tar.gz
(20.9 kB
view hashes)
Built Distribution
Close
Hashes for wm_topicgpt-0.0.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6d9239f1a8cca08267d5784d3084ffd00d8845fcaa85aaeea303c8761115ff3 |
|
MD5 | 58b0b9981cd43039e9a447db0d58b3cc |
|
BLAKE2b-256 | 842e7976febcd799064b49d6a01e459384fc2c974a13b125eb25f3d6aada2a5b |