Skip to main content

Topmost: A Topic Modeling System Tookit

Project description

Github Stars Downloads PyPi Documentation Status License Contributors arXiv

TopMost provides complete lifecycles of topic modeling, including datasets, preprocessing, models, training, and evaluations. It covers the most popular topic modeling scenarios, like basic, dynamic, hierarchical, and cross-lingual topic modeling.

Check our survey paper on neural topic models accepted to Artificial Intelligence Review: A Survey on Neural Topic Models: Methods, Applications, and Challenges.

If you want to use TopMost, please cite as
@inproceedings{wu2023topmost,
    title = "Towards the {T}op{M}ost: A Topic Modeling System Toolkit",
    author = "Wu, Xiaobao  and Pan, Fengjun  and Luu, Anh Tuan",
    editor = "Cao, Yixin  and Feng, Yang  and Xiong, Deyi",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-demos.4",
    pages = "31--41"
}

@article{wu2023survey,
    title={A Survey on Neural Topic Models: Methods, Applications, and Challenges},
    author={Wu, Xiaobao and Nguyen, Thong and Luu, Anh Tuan},
    journal={Artificial Intelligence Review},
    url={https://doi.org/10.1007/s10462-023-10661-7},
    year={2024},
    publisher={Springer}
}

Overview

TopMost offers the following topic modeling scenarios with models, evaluation metrics, and datasets:

https://github.com/BobXWu/TopMost/raw/main/docs/source/_static/architecture.svg

Scenario

Model

Evaluation Metric

Datasets

Basic Topic Modeling
TC
TD
Clustering
Classification
20NG
IMDB
NeurIPS
ACL
NYT
Wikitext-103
Hierarchical
Topic Modeling
TC over levels
TD over levels
Clustering over levels
Classification over levels
20NG
IMDB
NeurIPS
ACL
NYT
Wikitext-103
Dynamic
Topic Modeling
TC over time slices
TD over time slices
Clustering
Classification
NeurIPS
ACL
NYT
Cross-lingual
Topic Modeling
TC (CNPMI)
TD over languages
Classification (Intra and Cross-lingual)

ECNews
Amazon
Review Rakuten

Quick Start

Install TopMost

Install topmost with pip as

$ pip install topmost

We try FASTopic to get the top words of discovered topics, topic_top_words and the topic distributions of documents, doc_topic_dist. The preprocessing steps are configurable. See our documentations.

from topmost import RawDataset, Preprocess, FASTopicTrainer
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
preprocess = Preprocess(vocab_size=10000)

dataset = RawDataset(docs, preprocess, device="cuda")

trainer = FASTopicTrainer(dataset, verbose=True)
top_words, doc_topic_dist = trainer.train()

new_docs = [
    "This is a document about space, including words like space, satellite, launch, orbit.",
    "This is a document about Microsoft Windows, including words like windows, files, dos."
]

new_theta = trainer.test(new_docs)
print(new_theta.argmax(1))

Usage

Download a preprocessed dataset

import topmost

topmost.download_dataset('20NG', cache_path='./datasets')

Train a model

device = "cuda" # or "cpu"

# load a preprocessed dataset
dataset = topmost.BasicDataset("./datasets/20NG", device=device, read_labels=True)
# create a model
model = topmost.ProdLDA(dataset.vocab_size)
model = model.to(device)

# create a trainer
trainer = topmost.BasicTrainer(model, dataset)

# train the model
top_words, train_theta = trainer.train()

Evaluate

from topmost import eva

# topic diversity and coherence
TD = eva._diversity(top_words)
TC = eva._coherence(dataset.train_texts, dataset.vocab, top_words)

# get doc-topic distributions of testing samples
test_theta = trainer.test(dataset.test_data)
# clustering
clustering_results = eva._clustering(test_theta, dataset.test_labels)
# classification
cls_results = eva._cls(train_theta, test_theta, dataset.train_labels, dataset.test_labels)

Test new documents

import torch
from topmost import Preprocess

new_docs = [
    "This is a new document about space, including words like space, satellite, launch, orbit.",
    "This is a new document about Microsoft Windows, including words like windows, files, dos."
]

preprocess = Preprocess()
new_parsed_docs, new_bow = preprocess.parse(new_docs, vocab=dataset.vocab)
new_theta = trainer.test(torch.as_tensor(new_bow.toarray(), device=device).float())

Installation

Stable release

To install TopMost, run this command in the terminal:

$ pip install topmost

This is the preferred method to install TopMost, as it will always install the most recent stable release.

From sources

The sources for TopMost can be downloaded from the Github repository.

$ pip install git+https://github.com/bobxwu/TopMost.git

Tutorials

We provide tutorials for different usages:

Name

Link

Quickstart

Open In GitHub

How to preprocess datasets

Open In GitHub

How to train and evaluate a basic topic model

Open In GitHub

How to train and evaluate a hierarchical topic model

Open In GitHub

How to train and evaluate a dynamic topic model

Open In GitHub

How to train and evaluate a cross-lingual topic model

Open In GitHub

Disclaimer

This library includes some datasets for demonstration. If you are a dataset owner who wants to exclude your dataset from this library, please contact Xiaobao Wu.

Authors

xiaobao-figure Xiaobao Wu

fengjun-figure Fengjun Pan

Contributors

Contributors

Acknowledgments

  • Icon by Flat-icons-com.

  • If you want to add any models to this package, we welcome your pull requests.

  • If you encounter any problem, please either directly contact Xiaobao Wu or leave an issue in the GitHub repo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topmost-1.0.0.tar.gz (56.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topmost-1.0.0-1-py3-none-any.whl (93.7 kB view details)

Uploaded Python 3

File details

Details for the file topmost-1.0.0.tar.gz.

File metadata

  • Download URL: topmost-1.0.0.tar.gz
  • Upload date:
  • Size: 56.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for topmost-1.0.0.tar.gz
Algorithm Hash digest
SHA256 5d8456e6837daeb69c1d57c6fb990d030f1e50a7945b7d79156ee672aa14907f
MD5 a60a475db8667851e9c7ad91642e98d1
BLAKE2b-256 8781d343525838d000ad5ffdd5a50b3163757a19c9f505c7690acb06087fcbc5

See more details on using hashes here.

File details

Details for the file topmost-1.0.0-1-py3-none-any.whl.

File metadata

  • Download URL: topmost-1.0.0-1-py3-none-any.whl
  • Upload date:
  • Size: 93.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for topmost-1.0.0-1-py3-none-any.whl
Algorithm Hash digest
SHA256 1fa1714ddd827c10718aa47a5cb91cdec6b665571dd445c01249d9fbf5d54a6d
MD5 7a5f8a5a6f66a907528a4986f12d6058
BLAKE2b-256 0fe963e0ad24e14fe1a1517f80c988b8b9b642bba6343507d5b3c2da23baa600

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page