FASTopic
Project description
FASTopic
FASTopic is a fast, adaptive, stable, and transferable topic modeling package. It leverages pretrained Transformers to produce document embeddings, and discovers latent topics through the optimal transport between document, topic, and word embeddings. This brings about a neat and efficient topic modeling paradigm, different from traditional probabilistic, VAE-based, and clustering-based models.
Installation
Install FASTopic with pip
:
pip install fastopic
Otherwise, install FASTopic from the source:
git clone https://github.com/bobxwu/FASTopic.git
cd FASTopic && python setup.py install
Quick Start
Discover topics from 20newsgroups.
from fastopic import FASTopic
from sklearn.datasets import fetch_20newsgroups
from topmost.preprocessing import Preprocessing
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
preprocessing = Preprocessing(vocab_size=10000, stopwords='English')
model = FASTopic(num_topics=50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)
topic_top_words
is a list of the top words in discovered topics.
doc_topic_dist
is the topic distributions of documents (doc-topic distributions),
a numpy array with shape $N \times K$ (number of documents $N$ and number of topics $K$).
Usage
1. Try FASTopic on your dataset
from fastopic import FASTopic
from topmost.preprocessing import Preprocessing
# Prepare your dataset.
your_dataset = [
'doc 1',
'doc 2', # ...
]
# Preprocess the dataset. This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc..
# Pass your tokenizer as:
# preprocessing = Preprocessing(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)
preprocessing = Preprocessing(stopwords='English')
model = FASTopic(num_topics=50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)
2. Topic activity over time
After training, we can compute the activity of each topic at each time slice.
topic_activity = model.topic_activity_over_time(time_slices)
Citation
If you want to use our package, please cite as
@article{wu2024fastopic,
title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm},
author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
journal={arXiv preprint arXiv:2405.17978},
year={2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for fastopic-0.0.3-3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a38c7495c84119abaa59a90f8562e48c4adc13e57da671f2a9e918f8bc25e13 |
|
MD5 | 01b8cb875a489dda2b3ee466fb8bb326 |
|
BLAKE2b-256 | 5ced601080c6c13accf48eadbbe264948d666cc8ac2090986832e73be6b9e179 |