Skip to main content

FASTopic

Project description

FASTopic

stars PyPI Downloads LICENSE arXiv Contributors

FASTopic is a fast, adaptive, stable, and transferable topic modeling package. It leverages pretrained Transformers to produce document embeddings, and discovers latent topics through the optimal transport between document, topic, and word embeddings. This brings about a neat and efficient topic modeling paradigm, different from traditional probabilistic, VAE-based, and clustering-based models.

Installation

Install FASTopic with pip:

pip install fastopic

Otherwise, install FASTopic from the source:

git clone https://github.com/bobxwu/FASTopic.git
cd FASTopic && python setup.py install

Quick Start

Discover topics from 20newsgroups.

from fastopic import FASTopic
from sklearn.datasets import fetch_20newsgroups
from topmost.preprocessing import Preprocessing

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

preprocessing = Preprocessing(vocab_size=10000)

model = FASTopic(50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)

topic_top_words is a list of the top words in discovered topics. doc_topic_dist is the topic distributions of documents (doc-topic distributions), a numpy array with shape $N \times K$ (number of documents $N$ and number of topics $K$).

Usage

1. Try FASTopic on your dataset

from fastopic import FASTopic
from topmost.preprocessing import Preprocessing

# Prepare your dataset.
your_dataset = [
    'doc 1',
    'doc 2', # ...
]

# preprocess the dataset.
# This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc..
# Pass your tokenizer as preprocessing = Preprocessing(vocab_size=5000, tokenizer=your_tokenizer)
preprocessing = Preprocessing(vocab_size=10000)

model = FASTopic(50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)

2. Topic activity over time

After training, we can compute the activity of each topic at each time slice.

topic_activity = model.topic_activity_over_time(time_slices)

Citation

If you want to use our package, please cite as

@article{wu2024fastopic,
    title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm},
    author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
    journal={arXiv preprint arXiv:2405.17978},
    year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

fastopic-0.0.2-1-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file fastopic-0.0.2-1-py3-none-any.whl.

File metadata

  • Download URL: fastopic-0.0.2-1-py3-none-any.whl
  • Upload date:
  • Size: 14.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for fastopic-0.0.2-1-py3-none-any.whl
Algorithm Hash digest
SHA256 a4d06835290fb1ca8ba0e6ae47b6a702a2b58a15b05aa529940b229378410fa8
MD5 d2149600257626225711b4f4f129e654
BLAKE2b-256 a8414157201be0e47b6ed4047541813cbccbec65160622afc7d43755f79f6d15

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page