Skip to main content

FASTopic

Project description

FASTopic

stars PyPI Downloads LICENSE arXiv Contributors

FASTopic is a fast, adaptive, stable, and transferable topic modeling package. It leverages pretrained Transformers to produce document embeddings, and discovers latent topics through the optimal transport between document, topic, and word embeddings. This brings about a neat and efficient topic modeling paradigm, different from traditional probabilistic, VAE-based, and clustering-based models.

Installation

Install FASTopic with pip:

pip install fastopic

Otherwise, install FASTopic from the source:

git clone https://github.com/bobxwu/FASTopic.git
cd FASTopic && python setup.py install

Quick Start

Discover topics from 20newsgroups.

from fastopic import FASTopic
from sklearn.datasets import fetch_20newsgroups
from topmost.preprocessing import Preprocessing

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

preprocessing = Preprocessing(vocab_size=10000, stopwords='English')

model = FASTopic(num_topics=50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)

topic_top_words is a list of the top words in discovered topics. doc_topic_dist is the topic distributions of documents (doc-topic distributions), a numpy array with shape $N \times K$ (number of documents $N$ and number of topics $K$).

Usage

1. Try FASTopic on your dataset

from fastopic import FASTopic
from topmost.preprocessing import Preprocessing

# Prepare your dataset.
your_dataset = [
    'doc 1',
    'doc 2', # ...
]

# Preprocess the dataset. This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc..
# Pass your tokenizer as:
#   preprocessing = Preprocessing(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)
preprocessing = Preprocessing(stopwords='English')

model = FASTopic(num_topics=50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)

2. Topic activity over time

After training, we can compute the activity of each topic at each time slice.

topic_activity = model.topic_activity_over_time(time_slices)

Citation

If you want to use our package, please cite as

@article{wu2024fastopic,
    title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm},
    author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
    journal={arXiv preprint arXiv:2405.17978},
    year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

fastopic-0.0.3-3-py3-none-any.whl (16.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page