Complementing topic models with few-shot in-context learning to generate interpretable topics
Project description
ConTextMining
is a package generate interpretable topics labels from the keywords of topic models (e.g, LDA
, BERTopic
) through few-shot in-context learning.
Requirements
Required packages
The following packages are required for ConTextMining
.
-
torch
(to learn how to install, please refer to pytorch.org/) -
transformers
-
tokenizers
-
huggingface-hub
-
flash_attn
-
accelerate
To install these packages, you can do the following:
pip install torch transformers tokenizers huggingface-hub flash_attn accelerate
GPU requirements
You require at least one GPU to use ConTextMining
.
VRAM requirements depend on factors like number of keywords or topics used to topic labels you wish to generate.
However, at least 8GB of VRAM is recommended
huggingface access token
You will need a huggingface access token. To obtain one:
-
you'd first need to create a huggingface account if you do not have one.
-
Create and store a new access token. To learn more, please refer to huggingface.co/docs/hub/en/security-tokens.
-
Note: Some pre-trained large language models (LLMs) may require permissions. For more information, please refer to huggingface.co/docs/hub/en/models-gated.
Installation
To install in python, simply do the following:
pip install ConTextMining
Quick Start
Here we provide a quick example on how you can execute ConTextMining
to conveniently generate interpretable topics labels from the keywords of topic models.
from ConTextMining import get_topic_labels
# specify your huggingface access token. To learn how to obtain one, refer to huggingface.co/docs/hub/en/security-tokens
hf_access_token="<your huggingface access token>"
# specify the huggingface model id. Choose between "microsoft/Phi-3-mini-4k-instruct", "meta-llama/Meta-Llama-3.1-8B-Instruct" or "google/gemma-2-2b-it"
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"
# specify the keywords for the few-shot learning examples
keywords_examples = [
"olympic, year, said, games, team",
"mr, bush, president, white, house",
"sadi, report, evidence, findings, defense",
"french, union, germany, workers, paris",
"japanese, year, tokyo, matsui, said"
]
# specify the labels CORRESPONDING TO THE INDEX of the keywords of 'keyword_examples' above.
labels_examples = [
"sports",
"politics",
"research",
"france",
"japan"
]
# specify your topic modeling keywords of wish to generate coherently topic labels.
topic_modeling_keywords ='''Topic 1: [amazing, really, place, phenomenon, pleasant],
Topic 2: [loud, awful, sunday, like, slow],
Topic 3: [spinach, carrots, green, salad, dressing],
Topic 4: [mango, strawberry, vanilla, banana, peanut],
Topic 5: [fish, roll, salmon, fresh, good]'''
print(get_topic_labels(topic_modeling_keywords=topic_modeling_keywords, keywords_examples=keywords_examples, labels_examples=labels_examples, model_id=model_id, access_token=hf_access_token))
You will now get the interpretable topic model labels for all 5 topics!
Documentation
ConTextMining.get_topic_labels(*, topic_modeling_keywords, labels_examples,keywords_examples, model_id, access_token)
-
topic_modeling_keywords
(str, required): keywords stemming from the outputs of topic models (keywords representing each cluster) forConTextMining
to label. -
keywords_examples
(list, required): list-of-string(s) containing topic modeling keywords which serves as training examples for few-shot learning. -
labels_examples
(list, required): list-of-string(s) containing the labels CORRESPONDING TO THE INDEX of the keywords ofkeyword_examples
above. -
model_id
(str, required): huggingface model_id of choice. For now, its a choice between "microsoft/Phi-3-mini-4k-instruct", "meta-llama/Meta-Llama-3.1-8B-Instruct", or "google/gemma-2-2b-it". Defaults to "google/gemma-2-2b-it". -
access_token
(str, required): Huggingface access token. To learn how to obtain one, refer to huggingface.co/docs/hub/en/security-tokens. Defaults toNone
Citation
C Alba "ConText Mining: Complementing topic models with few-shot in-context learning to generate interpretable topics" Working paper.
Questions?
Contact me at alba@wustl.edu
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file contextmining-0.0.2.tar.gz
.
File metadata
- Download URL: contextmining-0.0.2.tar.gz
- Upload date:
- Size: 5.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cca09917a4f0e324d023b5e06a9238a34b50c918233acddd9cef4cec4c877eaf |
|
MD5 | c2ea0723d89b1f26d59dbec00df925f8 |
|
BLAKE2b-256 | a834182fc3d327f5bae4eed37f487e92b48a3840faeb0d158e18bad276f1704e |
File details
Details for the file ConTextMining-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: ConTextMining-0.0.2-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | db4ef650b9c57c04b5a25b0d63901eca07c69796ed22a22bc167dcaa92a90d2e |
|
MD5 | c149ead56b180f0009b75c9e26a91d3e |
|
BLAKE2b-256 | a69dd500ff8ce7b776fc5a4531b630589034a50002107ddae8945757030bd23d |