Topic modeling with contextual representations from sentence transformers.
Project description
Topic modeling is your turf too.
Contextual topic models with representations from transformers.
Intentions
- Provide simple, robust and fast implementations of existing approaches (BERTopic, Top2Vec, CTM) with minimal dependencies.
- Implement state-of-the-art approaches from my papers. (papers work-in-progress)
- Put all approaches in a broader conceptual framework.
- Provide clear and extensive documentation about the best use-cases for each model.
- Make the models' API streamlined and compatible with topicwizard and scikit-learn.
- Develop smarter, transformer-based evaluation metrics.
!!!This package is still a prototype, and no papers are published about the models. Until these are out, and most features are implemented I DO NOT recommend using this package for production and academic use!!!
Roadmap
- Model Implementation
- Pretty Printing
- Publish papers :hourglass_flowing_sand: (in progress..)
- Thorough documentation and good tutorials ⏳
- Implement visualization utilites for these models in topicwizard ⏳
- High-level topic descriptions with LLMs.
- Contextualized evaluation metrics.
Implemented Models
Mixture of Gaussians (GMM)
Topic models where topics are assumed to be Multivariate Normal components, and term importances are estimated with Soft-c-TF-IDF.
from turftopic import GMM
model = GMM(10).fit(texts)
model.print_topics()
KeyNMF
Nonnegative Matrix Factorization over keyword importances based on transformer representations.
from turftopic import KeyNMF
model = KeyNMF(10).fit(texts)
model.print_topics()
Semantic Signal Separation (S³)
Interprets topics as dimensions of semantics. Obtains these dimensions with ICA or PCA.
from turftopic import SemanticSignalSeparation
model = SemanticSignalSeparation(10).fit(texts)
model.print_topics()
Clustering Topic Models
Topics are clusters in embedding space and term importances are either estimated with c-TF-IDF (BERTopic) or proximity to cluster centroid (Top2Vec).
from turftopic import ClusteringTopicModel
model = ClusteringTopicModel().fit(texts)
model.print_topics()
Variational Autoencoders (CTM)
Contextual representations are used as ProdLDA encoder network inputs, either alone (ZeroShotTM) or concatenated to BoW (CombinedTM).
pip install turftopic[pyro-ppl]
from turftopic import AutoencodingTopicModel
model = AutoencodingTopicModel(10).fit(texts)
model.print_topics()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for turftopic-0.2.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6a9f9bc01e169d2aefb5d09a4563724cd29392dd0df17791e63e16948f5bd88 |
|
MD5 | f49b07dc0fce64b8b78691bf8e2fceee |
|
BLAKE2b-256 | 976d673e47031114f71def7c865da729d53c3547d46a6c1eed5d2c5da60b28c9 |