Official implementation of TopicGPT: A Prompt-based Topic Modeling Framework (NAACL'24)
Project description
TopicGPT
This repository contains scripts and prompts for our paper "TopicGPT: Topic Modeling by Prompting Large Language Models" (NAACL'24).
📣 Updates
- [11/09/24] Python package
topicgpt_python
is released. You can install it viapip install topicgpt_python
. - [11/18/23] Second-level topic generation code and refinement code are uploaded.
- [11/11/23] Basic pipeline is uploaded. Refinement and second-level topic generation code are coming soon.
📦 Using TopicGPT
Getting Started
- Install the requirements:
pip install topicgpt_python
- Set your API key:
export OPENAI_API_KEY={your_openai_api_key}
export VERTEX_PROJECT={your_vertex_project}
export VERTEX_LOCATION={your_vertex_location}
export HF_TOKEN={your_huggingface_token}
- Refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing.
Data
- Prepare your
.jsonl
data file in the following format:{ "id": "Optional IDs", "text": "Documents", "label": "Ground-truth labels (optional)" }
- Put the data file in
data/input
. There is also a sample data filedata/input/sample.jsonl
to debug the code. - #TODO: fix - If you want to sample a subset of the data for topic generation, run
python script/data.py --data <data_file> --num_samples 1000 --output <output_file>
. This will sample 1000 documents from the data file and save it to<output_file>
. You can also specify--num_samples
to sample a different number of documents, see the paper for more detail. - Raw dataset used in the paper (Bills and Wiki): [link].
Pipeline
-
You can either run
script/run.sh
to run the entire pipeline or run each step individually. See the notebook inscript/example.ipynb
for a step-by-step guide. -
Topic generation: Modify the prompts according to the templates in
templates/generation_1.txt
andtemplates/seed_1.md
. Then, to run topic generation, do:python3 script/generation_1.py --deployment_name gpt-4 \ --max_tokens 300 --temperature 0.0 --top_p 0.0 \ --data data/input/sample.jsonl \ --prompt_file prompt/generation_1.txt \ --seed_file prompt/seed_1.md \ --out_file data/output/generation_1.jsonl \ --topic_file data/output/generation_1.md \ --verbose True
-
Topic refinement: If you want to refine the topics, modify the prompts according to the templates in
templates/refinement.txt
. Then, to run topic refinement, do:python3 refinement.py --deployment_name gpt-4 \ --max_tokens 500 --temperature 0.0 --top_p 0.0 \ --prompt_file prompt/refinement.txt \ --generation_file data/output/generation_1.jsonl \ --topic_file data/output/generation_1.md \ --out_file data/output/refinement.md \ --verbose True \ --updated_file data/output/refinement.jsonl \ --mapping_file data/output/refinement_mapping.txt \ --refined_again False \ --remove False
-
Topic assignment: Modify the prompts according to the templates in
templates/assignment.txt
. Then, to run topic assignment, do:python3 script/assignment.py --deployment_name gpt-3.5-turbo \ --max_tokens 300 --temperature 0.0 --top_p 0.0 \ --data data/input/sample.jsonl \ --prompt_file prompt/assignment.txt \ --topic_file data/output/generation_1.md \ --out_file data/output/assignment.jsonl \ --verbose True
-
Topic correction: If the assignment contains errors or hallucinated topics, modify the prompts according to the templates in
templates/correction.txt
(note that this prompt is very similar to the assignment prompt, only adding a{Message}
field towards the end of the prompt). Then, to run topic correction, do:python3 script/correction.py --deployment_name gpt-3.5-turbo \ --max_tokens 300 --temperature 0.0 --top_p 0.0 \ --data data/output/assignment.jsonl \ --prompt_file prompt/correction.txt \ --topic_file data/output/generation_1.md \ --out_file data/output/assignment_corrected.jsonl \ --verbose True
-
Second-level topic generation: If you want to generate second-level topics, modify the prompts according to the templates in
templates/generation_2.txt
. Then, to run second-level topic generation, do:python3 script/generation_2.py --deployment_name gpt-4 \ --max_tokens 300 --temperature 0.0 --top_p 0.0 \ --data data/output/generation_1.jsonl \ --seed_file data/output/generation_1.md \ --prompt_file prompt/generation_2.txt \ --out_file data/output/generation_2.jsonl \ --topic_file data/output/generation_2.md \ --verbose True
📜 Citation
@misc{pham2023topicgpt,
title={TopicGPT: A Prompt-based Topic Modeling Framework},
author={Chau Minh Pham and Alexander Hoyle and Simeng Sun and Mohit Iyyer},
year={2023},
eprint={2311.01449},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file topicgpt_python-0.1.1.tar.gz
.
File metadata
- Download URL: topicgpt_python-0.1.1.tar.gz
- Upload date:
- Size: 22.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 331e1ae668c971b876dc06cefe4df64bee377e8dc34c39da18fb23ab363d9e53 |
|
MD5 | 8a5fdff3d6d5cbed7fad05612584be47 |
|
BLAKE2b-256 | d0849da455ad4be3c037d3e26d2c81ecc4a56bbf8a60c160a9441838750bfdf2 |
File details
Details for the file topicgpt_python-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: topicgpt_python-0.1.1-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 77592d3276a0bba6d35888df21fad04a482dd5632d36553ccd5f34b51f3b19d8 |
|
MD5 | eac6a2812147fd3955b5a3449247fd20 |
|
BLAKE2b-256 | fbd0aa2539724049e25792581d641314cf917de59ea1ded64f175285706d435e |