Skip to main content

Official implementation of TopicGPT: A Prompt-based Topic Modeling Framework (NAACL'24)

Project description

TopicGPT

arXiV Website

This repository contains scripts and prompts for our paper "TopicGPT: Topic Modeling by Prompting Large Language Models" (NAACL'24).

📣 Updates

  • [11/09/24] Python package topicgpt_python is released. You can install it via pip install topicgpt_python.
  • [11/18/23] Second-level topic generation code and refinement code are uploaded.
  • [11/11/23] Basic pipeline is uploaded. Refinement and second-level topic generation code are coming soon.

📦 Using TopicGPT

Getting Started

  • Install the requirements: pip install topicgpt_python
  • Set your API key:
export OPENAI_API_KEY={your_openai_api_key}
export VERTEX_PROJECT={your_vertex_project}
export VERTEX_LOCATION={your_vertex_location}
export HF_TOKEN={your_huggingface_token}

Data

  • Prepare your .jsonl data file in the following format:
    {
        "id": "Optional IDs",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    
  • Put the data file in data/input. There is also a sample data file data/input/sample.jsonl to debug the code.
  • #TODO: fix - If you want to sample a subset of the data for topic generation, run python script/data.py --data <data_file> --num_samples 1000 --output <output_file>. This will sample 1000 documents from the data file and save it to <output_file>. You can also specify --num_samples to sample a different number of documents, see the paper for more detail.
  • Raw dataset used in the paper (Bills and Wiki): [link].

Pipeline

  • You can either run script/run.sh to run the entire pipeline or run each step individually. See the notebook in script/example.ipynb for a step-by-step guide.

  • Topic generation: Modify the prompts according to the templates in templates/generation_1.txt and templates/seed_1.md. Then, to run topic generation, do:

    python3 script/generation_1.py --deployment_name gpt-4 \
                            --max_tokens 300 --temperature 0.0 --top_p 0.0 \
                            --data data/input/sample.jsonl \
                            --prompt_file prompt/generation_1.txt \
                            --seed_file prompt/seed_1.md \
                            --out_file data/output/generation_1.jsonl \
                            --topic_file data/output/generation_1.md \
                            --verbose True
    
  • Topic refinement: If you want to refine the topics, modify the prompts according to the templates in templates/refinement.txt. Then, to run topic refinement, do:

    python3 refinement.py --deployment_name gpt-4 \
                    --max_tokens 500 --temperature 0.0 --top_p 0.0 \
                    --prompt_file prompt/refinement.txt \
                    --generation_file data/output/generation_1.jsonl \
                    --topic_file data/output/generation_1.md \
                    --out_file data/output/refinement.md \
                    --verbose True \
                    --updated_file data/output/refinement.jsonl \
                    --mapping_file data/output/refinement_mapping.txt \
                    --refined_again False \
                    --remove False
    
  • Topic assignment: Modify the prompts according to the templates in templates/assignment.txt. Then, to run topic assignment, do:

    python3 script/assignment.py --deployment_name gpt-3.5-turbo \
                            --max_tokens 300 --temperature 0.0 --top_p 0.0 \
                            --data data/input/sample.jsonl \
                            --prompt_file prompt/assignment.txt \
                            --topic_file data/output/generation_1.md \
                            --out_file data/output/assignment.jsonl \
                            --verbose True
    
  • Topic correction: If the assignment contains errors or hallucinated topics, modify the prompts according to the templates in templates/correction.txt (note that this prompt is very similar to the assignment prompt, only adding a {Message} field towards the end of the prompt). Then, to run topic correction, do:

    python3 script/correction.py --deployment_name gpt-3.5-turbo \
                            --max_tokens 300 --temperature 0.0 --top_p 0.0 \
                            --data data/output/assignment.jsonl \
                            --prompt_file prompt/correction.txt \
                            --topic_file data/output/generation_1.md \
                            --out_file data/output/assignment_corrected.jsonl \
                            --verbose True
    
  • Second-level topic generation: If you want to generate second-level topics, modify the prompts according to the templates in templates/generation_2.txt. Then, to run second-level topic generation, do:

    python3 script/generation_2.py --deployment_name gpt-4 \
                    --max_tokens 300 --temperature 0.0 --top_p 0.0 \
                    --data data/output/generation_1.jsonl \
                    --seed_file data/output/generation_1.md \
                    --prompt_file prompt/generation_2.txt \
                    --out_file data/output/generation_2.jsonl \
                    --topic_file data/output/generation_2.md \
                    --verbose True
    

📜 Citation

@misc{pham2023topicgpt,
      title={TopicGPT: A Prompt-based Topic Modeling Framework}, 
      author={Chau Minh Pham and Alexander Hoyle and Simeng Sun and Mohit Iyyer},
      year={2023},
      eprint={2311.01449},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topicgpt_python-0.1.2.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

topicgpt_python-0.1.2-py3-none-any.whl (27.8 kB view details)

Uploaded Python 3

File details

Details for the file topicgpt_python-0.1.2.tar.gz.

File metadata

  • Download URL: topicgpt_python-0.1.2.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for topicgpt_python-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a4ce71e3b8375712f5037696b8907333599b1c42dc39563c8fc03f2848abd65b
MD5 aac37cedc05ff8f1aa131606a742e5cc
BLAKE2b-256 8998de01f0c4c9bffbeb7a9a994c2385e70a9c03f85486575d9bb4e101b3a2a9

See more details on using hashes here.

File details

Details for the file topicgpt_python-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for topicgpt_python-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c953ed77b000eb461bb78fc5046b05c935a1b78cd776116b95bdde29dfb49506
MD5 1398ade2a88353517d8ded5b810cff26
BLAKE2b-256 352f6b24a119966dce5b303cd9ee06176835449ef74a741c9032c520529c8916

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page