Topic modeling with contextual representations from sentence transformers.
Project description
Topic modeling is your turf too.
Contextual topic models with representations from transformers.
Intentions
- Provide simple, robust and fast implementations of existing approaches (BERTopic, Top2Vec, CTM) with minimal dependencies.
- Implement state-of-the-art approaches from my papers. (papers work-in-progress)
- Put all approaches in a broader conceptual framework.
- Provide clear and extensive documentation about the best use-cases for each model.
- Make the models' API streamlined and compatible with topicwizard and scikit-learn.
- Develop smarter, transformer-based evaluation metrics.
!!!This package is still a prototype, and no papers are published about the models. Until these are out, and most features are implemented I DO NOT recommend using this package for production and academic use!!!
Roadmap
- Model Implementation
- Pretty Printing
- Publish papers :hourglass_flowing_sand: (in progress..)
- Thorough documentation and good tutorials ⏳
- Implement visualization utilites for these models in topicwizard ⏳
- High-level topic descriptions with LLMs.
- Contextualized evaluation metrics.
Implemented Models
Mixture of Gaussians (GMM)
Topic models where topics are assumed to be Multivariate Normal components, and term importances are estimated with Soft-c-TF-IDF.
from turftopic import GMM
model = GMM(10).fit(texts)
model.print_topics()
KeyNMF
Nonnegative Matrix Factorization over keyword importances based on transformer representations.
from turftopic import KeyNMF
model = KeyNMF(10).fit(texts)
model.print_topics()
Semantic Signal Separation (S³)
Interprets topics as dimensions of semantics. Obtains these dimensions with ICA or PCA.
from turftopic import SemanticSignalSeparation
model = SemanticSignalSeparation(10).fit(texts)
model.print_topics()
Clustering Topic Models
Topics are clusters in embedding space and term importances are either estimated with c-TF-IDF (BERTopic) or proximity to cluster centroid (Top2Vec).
from turftopic import ClusteringTopicModel
model = ClusteringTopicModel().fit(texts)
model.print_topics()
Variational Autoencoders (CTM)
Contextual representations are used as ProdLDA encoder network inputs, either alone (ZeroShotTM) or concatenated to BoW (CombinedTM).
pip install turftopic[pyro-ppl]
from turftopic import AutoencodingTopicModel
model = AutoencodingTopicModel(10).fit(texts)
model.print_topics()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for turftopic-0.2.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9fcef8054cd8a50c9caa7d946442e9ff369de836eafe0fa0ff6cf2eb494de5f |
|
MD5 | f408687dbe318de9803a4d96bf741522 |
|
BLAKE2b-256 | 119b8eee8d9c48a3c8f3d770b2381dc7664d719ba53e3b192a847b13f6bcad21 |