Skip to main content

Official implementation of TopicGPT: A Prompt-based Topic Modeling Framework (NAACL'24)

Project description

TopicGPT

arXiV Website

This repository contains scripts and prompts for our paper "TopicGPT: Topic Modeling by Prompting Large Language Models" (NAACL'24). Our topicgpt_python package consists of five main functions:

  • generate_topic_lvl1 generates high-level and generalizable topics.
  • generate_topic_lvl2 generates low-level and specific topics to each high-level topic.
  • refine_topics refines the generated topics by merging similar topics and removing irrelevant topics.
  • assign_topics assigns the generated topics to the input text, along with a quote that supports the assignment.
  • correct_topics corrects the generated topics by reprompting the model so that the final topic assignment is grounded in the topic list.

📣 Updates

  • [11/09/24] Python package topicgpt_python is released! You can install it via pip install topicgpt_python. We support OpenAI API, Vertex AI, and vLLM (requires GPUs for inference). See PyPI.
  • [11/18/23] Second-level topic generation code and refinement code are uploaded.
  • [11/11/23] Basic pipeline is uploaded. Refinement and second-level topic generation code are coming soon.

📦 Using TopicGPT

Getting Started

  1. Make a new Python 3.9+ environment using virtualenv or conda.
  2. Install the required packages:
    pip install topicgpt_python
    

Data

  • Prepare your .jsonl data file in the following format:
    {
        "id": "IDs (optional)",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    
  • Put your data file in data/input. There is also a sample data file data/input/sample.jsonl to debug the code.
  • Raw dataset used in the paper (Bills and Wiki): [link].

Pipeline

Check out demo.ipynb for a complete pipeline and more detailed instructions. We advise you to try running on a subset with cheaper (or open-source) models first before scaling up to the entire dataset.

  1. Define I/O paths in config.yml.

  2. Load the package and config file:

    from topicgpt_python import *
    import yaml
    
    with open("config.yml", "r") as f:
        config = yaml.safe_load(f)
    
  3. Generate high-level topics:

    generate_topic_lvl1(api, model, 
                    config['data_sample'], 
                    config['generation']['prompt'], 
                    config['generation']['seed'], 
                    config['generation']['output'], 
                    config['generation']['topic_output'], 
                    verbose=config['verbose'])
    
  4. Generate low-level topics (optional)

    if config['generate_subtopics']: 
        generate_topic_lvl2(api, model, 
                            config['generation']['topic_output'],
                            config['generation']['output'],
                            config['generation_2']['prompt'],
                            config['generation_2']['output'],
                            config['generation_2']['topic_output'],
                            verbose=config['verbose'])
    
  5. Refine the generated topics by merging near duplicates and removing topics with low frequency (optional):

    if config['refining_topics']: 
        refine_topics(api, model, 
                    config['refinement']['prompt'],
                    config['generation']['output'], 
                    config['refinement']['topic_output'],
                    config['refinement']['prompt'],
                    config['refinement']['output'],
                    verbose=config['verbose'],
                    remove=config['refinement']['remove'], 
                    mapping_file=config['refinement']['mapping_file'],
                    refined_again=False,)       #TODO: change to True if you want to refine the topics again
    
  6. Assign and correct the topics, usually with a weaker model if using paid APIs to save cost:

    assign_topics(api, model, 
                config['data_sample'],
                    config['assignment']['prompt'],
                    config['assignment']['output'],
                    config['generation']['topic_output'], #TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
                    verbose=config['verbose'])
    
    correct_topics(api, model, 
                config['assignment']['output'],
                config['correction']['prompt'],
                config['generation']['topic_output'],      #TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
                config['correction']['output'],
                verbose=config['verbose'])
    
  7. Check out the data/output folder for sample outputs.

  8. We also offer metric calculation functions in topicgpt_python.metrics to evaluate the alignment between the generated topics and the ground-truth labels (Adjusted Rand Index, Harmonic Purity, and Normalized Mutual Information).

📜 Citation

@misc{pham2023topicgpt,
      title={TopicGPT: A Prompt-based Topic Modeling Framework}, 
      author={Chau Minh Pham and Alexander Hoyle and Simeng Sun and Mohit Iyyer},
      year={2023},
      eprint={2311.01449},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topicgpt_python-0.2.1.tar.gz (23.2 kB view details)

Uploaded Source

Built Distribution

topicgpt_python-0.2.1-py3-none-any.whl (29.1 kB view details)

Uploaded Python 3

File details

Details for the file topicgpt_python-0.2.1.tar.gz.

File metadata

  • Download URL: topicgpt_python-0.2.1.tar.gz
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for topicgpt_python-0.2.1.tar.gz
Algorithm Hash digest
SHA256 c5a8cf7a285ffc3d6a122bfb975ac44d0994791a8fa5db86e1b3ee23bdbfd8e9
MD5 c1afc0fe78f013716bf3102840ac759e
BLAKE2b-256 38651e777de36d8ee4b69fe12d3b9a3e4e12fedebde75f901c5b8432e43b6cf5

See more details on using hashes here.

File details

Details for the file topicgpt_python-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for topicgpt_python-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0b30a3aac26a48d6b57b0dde29dddc24215df71b261127941eeaa44a903f9f34
MD5 401a16a76ff8a088e660b77feef2bd7c
BLAKE2b-256 2a5e0ca12d0c4e0f4de27d91e0e4c5194d6f066dcd9ca671e2b9110beff70142

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page