Official implementation of TopicGPT: A Prompt-based Topic Modeling Framework (NAACL'24)
Project description
TopicGPT
This repository contains scripts and prompts for our paper "TopicGPT: Topic Modeling by Prompting Large Language Models" (NAACL'24). Our topicgpt_python
package consists of five main functions:
generate_topic_lvl1
generates high-level and generalizable topics.generate_topic_lvl2
generates low-level and specific topics to each high-level topic.refine_topics
refines the generated topics by merging similar topics and removing irrelevant topics.assign_topics
assigns the generated topics to the input text, along with a quote that supports the assignment.correct_topics
corrects the generated topics by reprompting the model so that the final topic assignment is grounded in the topic list.
📣 Updates
- [11/09/24] Python package
topicgpt_python
is released! You can install it viapip install topicgpt_python
. We support OpenAI API, Vertex AI, and vLLM (requires GPUs for inference). See PyPI. - [11/18/23] Second-level topic generation code and refinement code are uploaded.
- [11/11/23] Basic pipeline is uploaded. Refinement and second-level topic generation code are coming soon.
📦 Using TopicGPT
Getting Started
- Make a new Python 3.9+ environment using virtualenv or conda.
- Install the required packages:
pip install topicgpt_python
- Set your API key:
export OPENAI_API_KEY={your_openai_api_key} export VERTEX_PROJECT={your_vertex_project} export VERTEX_LOCATION={your_vertex_location}
- Refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing.
Data
- Prepare your
.jsonl
data file in the following format:{ "id": "IDs (optional)", "text": "Documents", "label": "Ground-truth labels (optional)" }
- Put your data file in
data/input
. There is also a sample data filedata/input/sample.jsonl
to debug the code. - Raw dataset used in the paper (Bills and Wiki): [link].
Pipeline
Check out demo.ipynb
for a complete pipeline and more detailed instructions. We advise you to try running on a subset with cheaper (or open-source) models first before scaling up to the entire dataset.
-
Define I/O paths in
config.yml
. -
Load the package and config file:
from topicgpt_python import * import yaml with open("config.yml", "r") as f: config = yaml.safe_load(f)
-
Generate high-level topics:
generate_topic_lvl1(api, model, config['data_sample'], config['generation']['prompt'], config['generation']['seed'], config['generation']['output'], config['generation']['topic_output'], verbose=config['verbose'])
-
Generate low-level topics (optional)
if config['generate_subtopics']: generate_topic_lvl2(api, model, config['generation']['topic_output'], config['generation']['output'], config['generation_2']['prompt'], config['generation_2']['output'], config['generation_2']['topic_output'], verbose=config['verbose'])
-
Refine the generated topics by merging near duplicates and removing topics with low frequency (optional):
if config['refining_topics']: refine_topics(api, model, config['refinement']['prompt'], config['generation']['output'], config['refinement']['topic_output'], config['refinement']['prompt'], config['refinement']['output'], verbose=config['verbose'], remove=config['refinement']['remove'], mapping_file=config['refinement']['mapping_file']) #TODO: change to True if you want to refine the topics again
-
Assign and correct the topics, usually with a weaker model if using paid APIs to save cost:
assign_topics(api, model, config['data_sample'], config['assignment']['prompt'], config['assignment']['output'], config['generation']['topic_output'], #TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics verbose=config['verbose']) correct_topics(api, model, config['assignment']['output'], config['correction']['prompt'], config['generation']['topic_output'], #TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics config['correction']['output'], verbose=config['verbose'])
-
Check out the
data/output
folder for sample outputs. -
We also offer metric calculation functions in
topicgpt_python.metrics
to evaluate the alignment between the generated topics and the ground-truth labels (Adjusted Rand Index, Harmonic Purity, and Normalized Mutual Information).
📜 Citation
@misc{pham2023topicgpt,
title={TopicGPT: A Prompt-based Topic Modeling Framework},
author={Chau Minh Pham and Alexander Hoyle and Simeng Sun and Mohit Iyyer},
year={2023},
eprint={2311.01449},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file topicgpt_python-0.2.2.tar.gz
.
File metadata
- Download URL: topicgpt_python-0.2.2.tar.gz
- Upload date:
- Size: 23.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd3326097fd1c10cbf04815e30b9ab533db8d6836a7ab763adbf93b89746fc04 |
|
MD5 | b6e1002ec2f8ec0aeb4a7abc737f6219 |
|
BLAKE2b-256 | a3e7dea32203233f23353b5849996003fd2bc451aff87fc1524aa41ffa599307 |
File details
Details for the file topicgpt_python-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: topicgpt_python-0.2.2-py3-none-any.whl
- Upload date:
- Size: 29.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa2b6d1801c38444dcc5998cc26cc24ae342ce3a8a48254b1a2517eba292b323 |
|
MD5 | 76056c6b8cd813fa1ee893af63b460b9 |
|
BLAKE2b-256 | 19b748586956d296062a2704bcdbd830d4e74cf54613a2e9508a50dd5c68c634 |