Skip to main content

Official implementation of TopicGPT: A Prompt-based Topic Modeling Framework (NAACL'24)

Project description

TopicGPT

This repository contains scripts and prompts for our paper "TopicGPT: Topic Modeling by Prompting Large Language Models" (NAACL'24).

TopicGPT Pipeline Overview

📣 Updates

  • [11/09/24] Python package topicgpt_python is released. You can install it via pip install topicgpt_python.
  • [11/18/23] Second-level topic generation code and refinement code are uploaded.
  • [11/11/23] Basic pipeline is uploaded. Refinement and second-level topic generation code are coming soon.

📦 Using TopicGPT

Getting Started

  • Install the requirements: pip install topicgpt_python
  • Set your API key:
export OPENAI_API_KEY={your_openai_api_key}
export VERTEX_PROJECT={your_vertex_project}
export VERTEX_LOCATION={your_vertex_location}
export HF_TOKEN={your_huggingface_token}

Data

  • Prepare your .jsonl data file in the following format:
    {
        "id": "Optional IDs",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    
  • Put the data file in data/input. There is also a sample data file data/input/sample.jsonl to debug the code.
  • #TODO: fix - If you want to sample a subset of the data for topic generation, run python script/data.py --data <data_file> --num_samples 1000 --output <output_file>. This will sample 1000 documents from the data file and save it to <output_file>. You can also specify --num_samples to sample a different number of documents, see the paper for more detail.
  • Raw dataset used in the paper (Bills and Wiki): [link].

Pipeline

  • You can either run script/run.sh to run the entire pipeline or run each step individually. See the notebook in script/example.ipynb for a step-by-step guide.

  • Topic generation: Modify the prompts according to the templates in templates/generation_1.txt and templates/seed_1.md. Then, to run topic generation, do:

    python3 script/generation_1.py --deployment_name gpt-4 \
                            --max_tokens 300 --temperature 0.0 --top_p 0.0 \
                            --data data/input/sample.jsonl \
                            --prompt_file prompt/generation_1.txt \
                            --seed_file prompt/seed_1.md \
                            --out_file data/output/generation_1.jsonl \
                            --topic_file data/output/generation_1.md \
                            --verbose True
    
  • Topic refinement: If you want to refine the topics, modify the prompts according to the templates in templates/refinement.txt. Then, to run topic refinement, do:

    python3 refinement.py --deployment_name gpt-4 \
                    --max_tokens 500 --temperature 0.0 --top_p 0.0 \
                    --prompt_file prompt/refinement.txt \
                    --generation_file data/output/generation_1.jsonl \
                    --topic_file data/output/generation_1.md \
                    --out_file data/output/refinement.md \
                    --verbose True \
                    --updated_file data/output/refinement.jsonl \
                    --mapping_file data/output/refinement_mapping.txt \
                    --refined_again False \
                    --remove False
    
  • Topic assignment: Modify the prompts according to the templates in templates/assignment.txt. Then, to run topic assignment, do:

    python3 script/assignment.py --deployment_name gpt-3.5-turbo \
                            --max_tokens 300 --temperature 0.0 --top_p 0.0 \
                            --data data/input/sample.jsonl \
                            --prompt_file prompt/assignment.txt \
                            --topic_file data/output/generation_1.md \
                            --out_file data/output/assignment.jsonl \
                            --verbose True
    
  • Topic correction: If the assignment contains errors or hallucinated topics, modify the prompts according to the templates in templates/correction.txt (note that this prompt is very similar to the assignment prompt, only adding a {Message} field towards the end of the prompt). Then, to run topic correction, do:

    python3 script/correction.py --deployment_name gpt-3.5-turbo \
                            --max_tokens 300 --temperature 0.0 --top_p 0.0 \
                            --data data/output/assignment.jsonl \
                            --prompt_file prompt/correction.txt \
                            --topic_file data/output/generation_1.md \
                            --out_file data/output/assignment_corrected.jsonl \
                            --verbose True
    
  • Second-level topic generation: If you want to generate second-level topics, modify the prompts according to the templates in templates/generation_2.txt. Then, to run second-level topic generation, do:

    python3 script/generation_2.py --deployment_name gpt-4 \
                    --max_tokens 300 --temperature 0.0 --top_p 0.0 \
                    --data data/output/generation_1.jsonl \
                    --seed_file data/output/generation_1.md \
                    --prompt_file prompt/generation_2.txt \
                    --out_file data/output/generation_2.jsonl \
                    --topic_file data/output/generation_2.md \
                    --verbose True
    

📜 Citation

@misc{pham2023topicgpt,
      title={TopicGPT: A Prompt-based Topic Modeling Framework}, 
      author={Chau Minh Pham and Alexander Hoyle and Simeng Sun and Mohit Iyyer},
      year={2023},
      eprint={2311.01449},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topicgpt_python-0.1.1.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

topicgpt_python-0.1.1-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file topicgpt_python-0.1.1.tar.gz.

File metadata

  • Download URL: topicgpt_python-0.1.1.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for topicgpt_python-0.1.1.tar.gz
Algorithm Hash digest
SHA256 331e1ae668c971b876dc06cefe4df64bee377e8dc34c39da18fb23ab363d9e53
MD5 8a5fdff3d6d5cbed7fad05612584be47
BLAKE2b-256 d0849da455ad4be3c037d3e26d2c81ecc4a56bbf8a60c160a9441838750bfdf2

See more details on using hashes here.

File details

Details for the file topicgpt_python-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for topicgpt_python-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 77592d3276a0bba6d35888df21fad04a482dd5632d36553ccd5f34b51f3b19d8
MD5 eac6a2812147fd3955b5a3449247fd20
BLAKE2b-256 fbd0aa2539724049e25792581d641314cf917de59ea1ded64f175285706d435e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page