Topmost: A Topic Modeling System Toolkit
Project description
TopMost provides complete lifecycles of topic modeling, including datasets, preprocessing, models, training, and evaluations. It covers the most popular topic modeling scenarios, like basic, dynamic, hierarchical, and cross-lingual topic modeling.
@inproceedings{wu2024topmost,
title = "Towards the {T}op{M}ost: A Topic Modeling System Toolkit",
author = "Wu, Xiaobao and Pan, Fengjun and Luu, Anh Tuan",
editor = "Cao, Yixin and Feng, Yang and Xiong, Deyi",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-demos.4",
pages = "31--41"
}
@article{wu2024survey,
title={A Survey on Neural Topic Models: Methods, Applications, and Challenges},
author={Wu, Xiaobao and Nguyen, Thong and Luu, Anh Tuan},
journal={Artificial Intelligence Review},
url={https://doi.org/10.1007/s10462-023-10661-7},
year={2024},
publisher={Springer}
}
Overview
TopMost offers the following topic modeling scenarios with models, evaluation metrics, and datasets:
Scenario |
Model |
Evaluation Metric |
Datasets |
|---|---|---|---|
Basic Topic Modeling
|
TC
TD
Clustering
Classification
|
20NG
IMDB
NeurIPS
ACL
NYT
Wikitext-103
|
|
Hierarchical
Topic Modeling
|
TC over levels
TD over levels
Clustering over levels
Classification over levels
|
20NG
IMDB
NeurIPS
ACL
NYT
Wikitext-103
|
|
Dynamic
Topic Modeling
|
TC over time slices
TD over time slices
Clustering
Classification
|
NeurIPS
ACL
NYT
|
|
Cross-lingual
Topic Modeling
|
TC (CNPMI)
TD over languages
Classification (Intra and Cross-lingual)
|
ECNews
Amazon
Review Rakuten
|
Quick Start
Install TopMost
Install topmost with pip as
$ pip install topmost
We try FASTopic to get the top words of discovered topics, topic_top_words and the topic distributions of documents, doc_topic_dist. The preprocessing steps are configurable. See our documentations.
from topmost import RawDataset, Preprocess, FASTopicTrainer
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
preprocess = Preprocess(vocab_size=10000)
dataset = RawDataset(docs, preprocess, device="cuda")
trainer = FASTopicTrainer(dataset, verbose=True)
top_words, doc_topic_dist = trainer.train()
new_docs = [
"This is a document about space, including words like space, satellite, launch, orbit.",
"This is a document about Microsoft Windows, including words like windows, files, dos."
]
new_theta = trainer.test(new_docs)
print(new_theta.argmax(1))
Usage
Download a preprocessed dataset
import topmost
topmost.download_dataset('20NG', cache_path='./datasets')
Train a model
device = "cuda" # or "cpu"
# load a preprocessed dataset
dataset = topmost.BasicDataset("./datasets/20NG", device=device, read_labels=True)
# create a model
model = topmost.ProdLDA(dataset.vocab_size)
model = model.to(device)
# create a trainer
trainer = topmost.BasicTrainer(model, dataset)
# train the model
top_words, train_theta = trainer.train()
Evaluate
from topmost import eva
# topic diversity and coherence
TD = eva._diversity(top_words)
TC = eva._coherence(dataset.train_texts, dataset.vocab, top_words)
# get doc-topic distributions of testing samples
test_theta = trainer.test(dataset.test_data)
# clustering
clustering_results = eva._clustering(test_theta, dataset.test_labels)
# classification
cls_results = eva._cls(train_theta, test_theta, dataset.train_labels, dataset.test_labels)
Test new documents
import torch
from topmost import Preprocess
new_docs = [
"This is a new document about space, including words like space, satellite, launch, orbit.",
"This is a new document about Microsoft Windows, including words like windows, files, dos."
]
preprocess = Preprocess()
new_parsed_docs, new_bow = preprocess.parse(new_docs, vocab=dataset.vocab)
new_theta = trainer.test(torch.as_tensor(new_bow.toarray(), device=device).float())
Installation
Stable release
To install TopMost, run this command in the terminal:
$ pip install topmost
This is the preferred method to install TopMost, as it will always install the most recent stable release.
From sources
The sources for TopMost can be downloaded from the Github repository.
$ pip install git+https://github.com/bobxwu/TopMost.git
Tutorials
We provide tutorials for different usages:
Name |
Link |
|---|---|
Quickstart |
|
How to preprocess datasets |
|
How to train and evaluate a basic topic model |
|
How to train and evaluate a hierarchical topic model |
|
How to train and evaluate a dynamic topic model |
|
How to train and evaluate a cross-lingual topic model |
Disclaimer
This library includes some datasets for demonstration. If you are a dataset owner who wants to exclude your dataset from this library, please contact Xiaobao Wu.
Contributors
Contact
We welcome your contributions to this project. Please feel free to submit pull requests.
If you encounter any problem, please either directly contact Xiaobao Wu or leave an issue in the GitHub repo.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file topmost-1.0.2.tar.gz.
File metadata
- Download URL: topmost-1.0.2.tar.gz
- Upload date:
- Size: 56.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e2772821cbc82cc39e8b5009c8b24ef638c53c20edb388f64db51599943e7cb7
|
|
| MD5 |
a495c6918f7628f5eef4b68d8d3b2f42
|
|
| BLAKE2b-256 |
37961a90e092b7c3c6c5cdf4b74c911c86934e8717946aebfbc9f5c520eab605
|
File details
Details for the file topmost-1.0.2-py3-none-any.whl.
File metadata
- Download URL: topmost-1.0.2-py3-none-any.whl
- Upload date:
- Size: 79.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e11ae9254446fdd988064a5bda38cb80fcc5671f2ed87ce25ba2e77eb9ea81b
|
|
| MD5 |
2218b6d1f49ae4e082a34027e4dffc47
|
|
| BLAKE2b-256 |
ec14ca0847269c3f851f3fc8dc24cd830e1a51545931515a04de4ff5d2807d57
|