FASTopic
Project description
FASTopic
FASTopic is a fast, adaptive, stable, and transferable topic modeling package. It leverages pretrained Transformers to produce document embeddings, and discovers latent topics through the optimal transport between document, topic, and word embeddings. This brings about a neat and efficient topic modeling paradigm, different from traditional probabilistic, VAE-based, and clustering-based models.
Installation
Install FASTopic with pip
:
pip install fastopic
Otherwise, install FASTopic from the source:
git clone https://github.com/bobxwu/FASTopic.git
cd FASTopic && python setup.py install
Quick Start
Discover topics from 20newsgroups.
from fastopic import FASTopic
from sklearn.datasets import fetch_20newsgroups
from topmost.preprocessing import Preprocessing
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
preprocessing = Preprocessing(vocab_size=10000)
model = FASTopic(50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)
topic_top_words
is a list of the top words in discovered topics.
doc_topic_dist
is the topic distributions of documents (doc-topic distributions),
a numpy array with shape $N \times K$ (number of documents $N$ and number of topics $K$).
Usage
1. Try FASTopic on your dataset
from fastopic import FASTopic
from topmost.preprocessing import Preprocessing
# Prepare your dataset.
your_dataset = [
'doc 1',
'doc 2', # ...
]
# preprocess the dataset.
# This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc..
# Pass your tokenizer as preprocessing = Preprocessing(vocab_size=5000, tokenizer=your_tokenizer)
preprocessing = Preprocessing(vocab_size=10000)
model = FASTopic(50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)
2. Topic activity over time
After training, we can compute the activity of each topic at each time slice.
topic_activity = model.topic_activity_over_time(time_slices)
Citation
If you want to use our package, please cite as
@article{wu2024fastopic,
title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm},
author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
journal={arXiv preprint arXiv:2405.17978},
year={2024}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file fastopic-0.0.2-1-py3-none-any.whl
.
File metadata
- Download URL: fastopic-0.0.2-1-py3-none-any.whl
- Upload date:
- Size: 14.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4d06835290fb1ca8ba0e6ae47b6a702a2b58a15b05aa529940b229378410fa8 |
|
MD5 | d2149600257626225711b4f4f129e654 |
|
BLAKE2b-256 | a8414157201be0e47b6ed4047541813cbccbec65160622afc7d43755f79f6d15 |