Skip to main content

FASTopic

Project description

FASTopic

stars PyPI Downloads LICENSE arXiv Contributors

FASTopic is a fast, adaptive, stable, and transferable topic modeling package. It leverages pretrained Transformers to produce document embeddings, and discovers latent topics through the optimal transport between document, topic, and word embeddings. This brings about a neat and efficient topic modeling paradigm, different from traditional probabilistic, VAE-based, and clustering-based models.

Installation

Install FASTopic with pip:

pip install fastopic

Otherwise, install FASTopic from the source:

git clone https://github.com/bobxwu/FASTopic.git
cd FASTopic && python setup.py install

Quick Start

Discover topics from 20newsgroups.

from fastopic import FASTopic
from sklearn.datasets import fetch_20newsgroups
from topmost.preprocessing import Preprocessing

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

preprocessing = Preprocessing(vocab_size=10000)

model = FASTopic(50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)

topic_top_words is a list of the top words in discovered topics. doc_topic_dist is the topic distributions of documents (doc-topic distributions), a numpy array with shape $N \times K$ (number of documents $N$ and number of topics $K$).

Usage

1. Try FASTopic on your dataset

from fastopic import FASTopic
from topmost.preprocessing import Preprocessing

# Prepare your dataset.
your_dataset = [
    'doc 1',
    'doc 2', # ...
]

# preprocess the dataset.
# This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc..
# Pass your tokenizer as preprocessing = Preprocessing(vocab_size=5000, tokenizer=your_tokenizer)
preprocessing = Preprocessing(vocab_size=10000)

model = FASTopic(50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)

2. Topic activity over time

After training, we can compute the activity of each topic at each time slice.

topic_activity = model.topic_activity_over_time(time_slices)

Citation

If you want to use our package, please cite as

@article{wu2024fastopic,
    title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm},
    author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
    journal={arXiv preprint arXiv:2405.17978},
    year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

fastopic-0.0.1-1-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file fastopic-0.0.1-1-py3-none-any.whl.

File metadata

  • Download URL: fastopic-0.0.1-1-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for fastopic-0.0.1-1-py3-none-any.whl
Algorithm Hash digest
SHA256 a6dc43087c7c3b79db1aff3d86471b64f442be3d77ca68a615299cff519f2793
MD5 97b61cf0223962c8304b331c8dd691d7
BLAKE2b-256 3f3b01c4d352b6e69a1d702562c1ad4c1dcc91a1eaca9395a58459fe1f8b0eea

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page