Topic modeling with contextual representations from sentence transformers.
Project description
Topic modeling is your turf too.
Contextual topic models with representations from transformers.
Intentions
- Provide simple, robust and fast implementations of existing approaches (BERTopic, Top2Vec, CTM) with minimal dependencies.
- Implement state-of-the-art approaches from my papers. (papers work-in-progress)
- Put all approaches in a broader conceptual framework.
- Provide clear and extensive documentation about the best use-cases for each model.
- Make the models' API streamlined and compatible with topicwizard and scikit-learn.
- Develop smarter, transformer-based evaluation metrics.
!!!This package is still a prototype, and no papers are published about the models. Until these are out, and most features are implemented I DO NOT recommend using this package for production and academic use!!!
Roadmap
- Model Implementation
- Pretty Printing
- Publish papers :hourglass_flowing_sand: (in progress..)
- Thorough documentation and good tutorials ⏳
- Implement visualization utilites for these models in topicwizard ⏳
- High-level topic descriptions with LLMs.
- Contextualized evaluation metrics.
Implemented Models
Mixture of Gaussians (GMM)
Topic models where topics are assumed to be Multivariate Normal components, and term importances are estimated with Soft-c-TF-IDF.
from turftopic import GMM
model = GMM(10).fit(texts)
model.print_topics()
KeyNMF
Nonnegative Matrix Factorization over keyword importances based on transformer representations.
from turftopic import KeyNMF
model = KeyNMF(10).fit(texts)
model.print_topics()
Semantic Signal Separation (S³)
Interprets topics as dimensions of semantics. Obtains these dimensions with ICA or PCA.
from turftopic import SemanticSignalSeparation
model = SemanticSignalSeparation(10).fit(texts)
model.print_topics()
Clustering Topic Models
Topics are clusters in embedding space and term importances are either estimated with c-TF-IDF (BERTopic) or proximity to cluster centroid (Top2Vec).
from turftopic import ClusteringTopicModel
model = ClusteringTopicModel().fit(texts)
model.print_topics()
Variational Autoencoders (CTM)
Contextual representations are used as ProdLDA encoder network inputs, either alone (ZeroShotTM) or concatenated to BoW (CombinedTM).
pip install turftopic[pyro-ppl]
from turftopic import AutoencodingTopicModel
model = AutoencodingTopicModel(10).fit(texts)
model.print_topics()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turftopic-0.2.4.tar.gz.
File metadata
- Download URL: turftopic-0.2.4.tar.gz
- Upload date:
- Size: 16.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Linux/5.15.0-92-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7974c2059e253f4274594c73bb60a84d6e8d743dfeb1e3108f20a8e22e5cd6c
|
|
| MD5 |
a79c6fc579e2a23f0dabf648af60ce7a
|
|
| BLAKE2b-256 |
a3d5eecad8afa89baa758e7ba55a3f23d3f891ecd8a97374e79babe2fec2efaa
|
File details
Details for the file turftopic-0.2.4-py3-none-any.whl.
File metadata
- Download URL: turftopic-0.2.4-py3-none-any.whl
- Upload date:
- Size: 23.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Linux/5.15.0-92-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9fcef8054cd8a50c9caa7d946442e9ff369de836eafe0fa0ff6cf2eb494de5f
|
|
| MD5 |
f408687dbe318de9803a4d96bf741522
|
|
| BLAKE2b-256 |
119b8eee8d9c48a3c8f3d770b2381dc7664d719ba53e3b192a847b13f6bcad21
|