Skip to main content

Official implementation of TopicGPT: A Prompt-based Topic Modeling Framework (NAACL'24)

Project description

TopicGPT

arXiV Website

This repository contains scripts and prompts for our paper "TopicGPT: Topic Modeling by Prompting Large Language Models" (NAACL'24). Our topicgpt_python package consists of five main functions:

  • generate_topic_lvl1 generates high-level and generalizable topics.
  • generate_topic_lvl2 generates low-level and specific topics to each high-level topic.
  • refine_topics refines the generated topics by merging similar topics and removing irrelevant topics.
  • assign_topics assigns the generated topics to the input text, along with a quote that supports the assignment.
  • correct_topics corrects the generated topics by reprompting the model so that the final topic assignment is grounded in the topic list.

📣 Updates

  • [11/09/24] Python package topicgpt_python is released! You can install it via pip install topicgpt_python. We support OpenAI API, VertexAI, Azure API, Gemini API, and vLLM (requires GPUs for inference). See PyPI.
  • [11/18/23] Second-level topic generation code and refinement code are uploaded.
  • [11/11/23] Basic pipeline is uploaded. Refinement and second-level topic generation code are coming soon.

📦 Using TopicGPT

Getting Started

  1. Make a new Python 3.9+ environment using virtualenv or conda.
  2. Install the required packages:
    pip install topicgpt_python
    
  • Set your API key:
    # Run in shell
    # Needed only for the OpenAI API deployment
    export OPENAI_API_KEY={your_openai_api_key}
    
    # Needed only for the Vertex AI deployment
    export VERTEX_PROJECT={your_vertex_project}   # e.g. my-project
    export VERTEX_LOCATION={your_vertex_location} # e.g. us-central1
    
    # Needed only for Gemini deployment
    export GEMINI_API_KEY={your_gemini_api_key}
    
    # Needed only for the Azure API deployment
    export AZURE_OPENAI_API_KEY={your_azure_api_key}
    export AZURE_OPENAI_ENDPOINT={your_azure_endpoint}
    
  • Refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing.

Data

  • Prepare your .jsonl data file in the following format:
    {
        "id": "IDs (optional)",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    
  • Put your data file in data/input. There is also a sample data file data/input/sample.jsonl to debug the code.
  • Raw dataset used in the paper (Bills and Wiki): [link].

Pipeline

Check out demo.ipynb for a complete pipeline and more detailed instructions. We advise you to try running on a subset with cheaper (or open-source) models first before scaling up to the entire dataset.

  1. (Optional) Define I/O paths in config.yml and load using:

    import yaml
    
    with open("config.yml", "r") as f:
        config = yaml.safe_load(f)
    
  2. Load the package:

    from topicgpt_python import *
    
  3. Generate high-level topics:

    generate_topic_lvl1(api, model, data, prompt_file, seed_file, out_file, topic_file, verbose)
    
  4. Generate low-level topics (optional)

    generate_topic_lvl2(api, model, seed_file, data, prompt_file, out_file, topic_file, verbose)
    
  5. Refine the generated topics by merging near duplicates and removing topics with low frequency (optional):

    refine_topics(api, model, prompt_file, generation_file, topic_file, out_file, updated_file, verbose, remove, mapping_file)
    
  6. Assign and correct the topics, usually with a weaker model if using paid APIs to save cost:

    assign_topics(
    api, model, data, prompt_file, out_file, topic_file, verbose
    )
    
    correct_topics(
        api, model, data_path, prompt_path, topic_path, output_path, verbose
    ) 
    
  7. Check out the data/output folder for sample outputs.

  8. We also offer metric calculation functions in topicgpt_python.metrics to evaluate the alignment between the generated topics and the ground-truth labels (Adjusted Rand Index, Harmonic Purity, and Normalized Mutual Information).

📜 Citation

@misc{pham2023topicgpt,
      title={TopicGPT: A Prompt-based Topic Modeling Framework}, 
      author={Chau Minh Pham and Alexander Hoyle and Simeng Sun and Mohit Iyyer},
      year={2023},
      eprint={2311.01449},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topicgpt_python-0.2.7.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topicgpt_python-0.2.7-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file topicgpt_python-0.2.7.tar.gz.

File metadata

  • Download URL: topicgpt_python-0.2.7.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for topicgpt_python-0.2.7.tar.gz
Algorithm Hash digest
SHA256 1c7b63780d7cdf4386f832b103c8aab727f88f61f04add4e6951085bce7df0cf
MD5 ccf0cb8954d96cd4d519fee02ec79e8b
BLAKE2b-256 d77a4eaba34e155780cd89ed2c365ae6003e7c5a630eb85af25fd818decd4af3

See more details on using hashes here.

File details

Details for the file topicgpt_python-0.2.7-py3-none-any.whl.

File metadata

File hashes

Hashes for topicgpt_python-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 e56c4d1f4932d47a275464f61c2d2f422be494427dbb97793853ce0a38137047
MD5 ea0a021a5f51921b3b1b3a54642dfce2
BLAKE2b-256 03ad9b8949f39ea2ef5a7ded19970f87e6e808c965d53321ea44d461acde3006

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page