Skip to main content

Complementing topic models with few-shot in-context learning to generate interpretable topics

Project description

ConTextMining is a package generate interpretable topics labels from the keywords of topic models (e.g, LDA, BERTopic) through few-shot in-context learning.

pypi package GitHub Source Code

Requirements

Required packages

The following packages are required for ConTextMining.

  • torch (to learn how to install, please refer to pytorch.org/)

  • transformers

  • tokenizers

  • huggingface-hub

  • flash_attn

  • accelerate

To install these packages, you can do the following:

pip install torch transformers tokenizers huggingface-hub flash_attn accelerate

GPU requirements

You require at least one GPU to use ConTextMining.

VRAM requirements depend on factors like number of keywords or topics used to topic labels you wish to generate.

However, at least 8GB of VRAM is recommended

huggingface access token

You will need a huggingface access token. To obtain one:

  1. you'd first need to create a huggingface account if you do not have one.

  2. Create and store a new access token. To learn more, please refer to huggingface.co/docs/hub/en/security-tokens.

  3. Note: Some pre-trained large language models (LLMs) may require permissions. For more information, please refer to huggingface.co/docs/hub/en/models-gated.

Installation

To install in python, simply do the following:

pip install ConTextMining

Quick Start

Here we provide a quick example on how you can execute ConTextMining to conveniently generate interpretable topics labels from the keywords of topic models.

from ConTextMining import get_topic_labels



# specify your huggingface access token. To learn how to obtain one, refer to huggingface.co/docs/hub/en/security-tokens

hf_access_token="<your huggingface access token>" 



# specify the huggingface model id. Choose between "microsoft/Phi-3-mini-4k-instruct", "meta-llama/Meta-Llama-3.1-8B-Instruct" or "google/gemma-2-2b-it"

model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"



# specify the keywords for the few-shot learning examples

keywords_examples = [

    "olympic, year, said, games, team",

    "mr, bush, president, white, house",

    "sadi, report, evidence, findings, defense",

    "french, union, germany, workers, paris",

    "japanese, year, tokyo, matsui, said"

]



# specify the labels CORRESPONDING TO THE INDEX of the keywords of 'keyword_examples' above. 

labels_examples = [

    "sports",

    "politics",

    "research",

    "france",

    "japan"

]



# specify your topic modeling keywords of wish to generate coherently topic labels. 

topic_modeling_keywords ='''Topic 1: [amazing, really, place, phenomenon, pleasant],

Topic 2: [loud, awful, sunday, like, slow],

Topic 3: [spinach, carrots, green, salad, dressing],

Topic 4: [mango, strawberry, vanilla, banana, peanut],

Topic 5: [fish, roll, salmon, fresh, good]'''





print(get_topic_labels(topic_modeling_keywords=topic_modeling_keywords, keywords_examples=keywords_examples, labels_examples=labels_examples, model_id=model_id, access_token=hf_access_token))

You will now get the interpretable topic model labels for all 5 topics!

Documentation

ConTextMining.get_topic_labels(*, topic_modeling_keywords, labels_examples,keywords_examples, model_id, access_token)
  • topic_modeling_keywords (str, required): keywords stemming from the outputs of topic models (keywords representing each cluster) for ConTextMining to label.

  • keywords_examples (list, required): list-of-string(s) containing topic modeling keywords which serves as training examples for few-shot learning.

  • labels_examples (list, required): list-of-string(s) containing the labels CORRESPONDING TO THE INDEX of the keywords of keyword_examples above.

  • model_id (str, required): huggingface model_id of choice. For now, its a choice between "microsoft/Phi-3-mini-4k-instruct", "meta-llama/Meta-Llama-3.1-8B-Instruct", or "google/gemma-2-2b-it". Defaults to "google/gemma-2-2b-it".

  • access_token (str, required): Huggingface access token. To learn how to obtain one, refer to huggingface.co/docs/hub/en/security-tokens. Defaults to None

Citation

C Alba "ConText Mining: Complementing topic models with few-shot in-context learning to generate interpretable topics" Forthcoming at IEEE Symposium Series on Computational Intelligence.

Questions?

Contact me at alba@wustl.edu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextmining-0.0.3.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ConTextMining-0.0.3-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file contextmining-0.0.3.tar.gz.

File metadata

  • Download URL: contextmining-0.0.3.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for contextmining-0.0.3.tar.gz
Algorithm Hash digest
SHA256 151647106e3b6619ae34e7a8cb211bc51445ab521a6b5d143516bc628939d6bb
MD5 de17ee5d3acdd654f20bf8b9fb88dab8
BLAKE2b-256 700e393baa6fb398b88cc77736bee9d70a5d204c2727f3d7ceb585a116158ec8

See more details on using hashes here.

File details

Details for the file ConTextMining-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: ConTextMining-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for ConTextMining-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ab8ffe5de813702f6d4cf7fcdfb856d4ebf2a9ef117387b0662a3e00e0389428
MD5 1bd095f8cffc1b3213b500b77672e166
BLAKE2b-256 ec2bcf079d0c1c03ad6c0777bd33dfd26b336b25a9a943aa302a827f75a43f3f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page