topmost

Topmost: A Topic Modeling System Tookit

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Education
Operating System
- MacOS :: MacOS X
- Microsoft :: Windows
Programming Language

Project description

TopMost provides complete lifecycles of topic modeling, including datasets, preprocessing, models, training, and evaluations. It covers the most popular topic modeling scenarios, like basic, dynamic, hierarchical, and cross-lingual topic modeling.

Check our ACL 2024 demo paper: Towards the TopMost: A Topic Modeling System Toolkit.

Check our survey paper on neural topic models accepted to Artificial Intelligence Review: A Survey on Neural Topic Models: Methods, Applications, and Challenges.

If you want to use TopMost, please cite as

@inproceedings{wu2023topmost,
    title = "Towards the {T}op{M}ost: A Topic Modeling System Toolkit",
    author = "Wu, Xiaobao  and Pan, Fengjun  and Luu, Anh Tuan",
    editor = "Cao, Yixin  and Feng, Yang  and Xiong, Deyi",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-demos.4",
    pages = "31--41"
}

@article{wu2023survey,
    title={A Survey on Neural Topic Models: Methods, Applications, and Challenges},
    author={Wu, Xiaobao and Nguyen, Thong and Luu, Anh Tuan},
    journal={Artificial Intelligence Review},
    url={https://doi.org/10.1007/s10462-023-10661-7},
    year={2024},
    publisher={Springer}
}

Overview

TopMost offers the following topic modeling scenarios with models, evaluation metrics, and datasets:

https://github.com/BobXWu/TopMost/raw/main/docs/source/_static/architecture.svg

Scenario	Model	Evaluation Metric	Datasets
Basic Topic Modeling	LDA NMF ProdLDA DecTM ETM NSTM TSCTM BERTopic ECRTM FASTopic	TC TD Clustering Classification	20NG IMDB NeurIPS ACL NYT Wikitext-103
Hierarchical Topic Modeling	HDP SawETM HyperMiner ProGBN TraCo	TC over levels TD over levels Clustering over levels Classification over levels	20NG IMDB NeurIPS ACL NYT Wikitext-103
Dynamic Topic Modeling	DTM DETM CFDTM	TC over time slices TD over time slices Clustering Classification	NeurIPS ACL NYT
Cross-lingual Topic Modeling	NMTM InfoCTM	TC (CNPMI) TD over languages Classification (Intra and Cross-lingual)	ECNews Amazon Review Rakuten

Quick Start

Install TopMost

Install topmost with pip as

$ pip install topmost

We try FASTopic to get the top words of discovered topics, topic_top_words and the topic distributions of documents, doc_topic_dist. The preprocessing steps are configurable. See our documentations.

from topmost import RawDataset, Preprocess, FASTopicTrainer
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
preprocess = Preprocess(vocab_size=10000)

dataset = RawDataset(docs, preprocess, device="cuda")

trainer = FASTopicTrainer(dataset, verbose=True)
top_words, doc_topic_dist = trainer.train()

new_docs = [
    "This is a document about space, including words like space, satellite, launch, orbit.",
    "This is a document about Microsoft Windows, including words like windows, files, dos."
]

new_theta = trainer.test(new_docs)
print(new_theta.argmax(1))

Usage

Download a preprocessed dataset

import topmost

topmost.download_dataset('20NG', cache_path='./datasets')

Train a model

device = "cuda" # or "cpu"

# load a preprocessed dataset
dataset = topmost.BasicDataset("./datasets/20NG", device=device, read_labels=True)
# create a model
model = topmost.ProdLDA(dataset.vocab_size)
model = model.to(device)

# create a trainer
trainer = topmost.BasicTrainer(model, dataset)

# train the model
top_words, train_theta = trainer.train()

Evaluate

from topmost import eva

# topic diversity and coherence
TD = eva._diversity(top_words)
TC = eva._coherence(dataset.train_texts, dataset.vocab, top_words)

# get doc-topic distributions of testing samples
test_theta = trainer.test(dataset.test_data)
# clustering
clustering_results = eva._clustering(test_theta, dataset.test_labels)
# classification
cls_results = eva._cls(train_theta, test_theta, dataset.train_labels, dataset.test_labels)

Test new documents

import torch
from topmost import Preprocess

new_docs = [
    "This is a new document about space, including words like space, satellite, launch, orbit.",
    "This is a new document about Microsoft Windows, including words like windows, files, dos."
]

preprocess = Preprocess()
new_parsed_docs, new_bow = preprocess.parse(new_docs, vocab=dataset.vocab)
new_theta = trainer.test(torch.as_tensor(new_bow.toarray(), device=device).float())

Installation

Stable release

To install TopMost, run this command in the terminal:

$ pip install topmost

This is the preferred method to install TopMost, as it will always install the most recent stable release.

From sources

The sources for TopMost can be downloaded from the Github repository.

$ pip install git+https://github.com/bobxwu/TopMost.git

Tutorials

We provide tutorials for different usages:

Name	Link
Quickstart
How to preprocess datasets
How to train and evaluate a basic topic model
How to train and evaluate a hierarchical topic model
How to train and evaluate a dynamic topic model
How to train and evaluate a cross-lingual topic model

Disclaimer

This library includes some datasets for demonstration. If you are a dataset owner who wants to exclude your dataset from this library, please contact Xiaobao Wu.

Authors

Xiaobao Wu

Fengjun Pan

Contributors

Acknowledgments

Icon by Flat-icons-com.
If you want to add any models to this package, we welcome your pull requests.
If you encounter any problem, please either directly contact Xiaobao Wu or leave an issue in the GitHub repo.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Education
Operating System
- MacOS :: MacOS X
- Microsoft :: Windows
Programming Language

Release history Release notifications | RSS feed

1.0.2

Mar 7, 2025

1.0.1

Jan 26, 2025

This version

1.0.0

Jan 14, 2025

0.0.5

Jul 10, 2024

0.0.4

Jun 18, 2024

0.0.3

May 31, 2024

0.0.2

May 23, 2024

0.0.1

Sep 18, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topmost-1.0.0.tar.gz (56.7 kB view details)

Uploaded Jan 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

topmost-1.0.0-1-py3-none-any.whl (93.7 kB view details)

Uploaded Jan 14, 2025 Python 3

File details

Details for the file topmost-1.0.0.tar.gz.

File metadata

Download URL: topmost-1.0.0.tar.gz
Upload date: Jan 14, 2025
Size: 56.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for topmost-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`5d8456e6837daeb69c1d57c6fb990d030f1e50a7945b7d79156ee672aa14907f`
MD5	`a60a475db8667851e9c7ad91642e98d1`
BLAKE2b-256	`8781d343525838d000ad5ffdd5a50b3163757a19c9f505c7690acb06087fcbc5`

See more details on using hashes here.

File details

Details for the file topmost-1.0.0-1-py3-none-any.whl.

File metadata

Download URL: topmost-1.0.0-1-py3-none-any.whl
Upload date: Jan 14, 2025
Size: 93.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for topmost-1.0.0-1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1fa1714ddd827c10718aa47a5cb91cdec6b665571dd445c01249d9fbf5d54a6d`
MD5	`7a5f8a5a6f66a907528a4986f12d6058`
BLAKE2b-256	`0fe963e0ad24e14fe1a1517f80c988b8b9b642bba6343507d5b3c2da23baa600`

See more details on using hashes here.

topmost 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes