Skip to main content

Topic modeling with contextual representations from sentence transformers.

Project description


Topic modeling is your turf too.
Contextual topic models with representations from transformers.

Intentions

  • Provide simple, robust and fast implementations of existing approaches (BERTopic, Top2Vec, CTM) with minimal dependencies.
  • Implement state-of-the-art approaches from my papers. (papers work-in-progress)
  • Put all approaches in a broader conceptual framework.
  • Provide clear and extensive documentation about the best use-cases for each model.
  • Make the models' API streamlined and compatible with topicwizard and scikit-learn.
  • Develop smarter, transformer-based evaluation metrics.

!!!This package is still a prototype, and no papers are published about the models. Until these are out, and most features are implemented I DO NOT recommend using this package for production and academic use!!!

Roadmap

  • Model Implementation
  • Pretty Printing
  • Publish papers :hourglass_flowing_sand: (in progress..)
  • Thorough documentation and good tutorials ⏳
  • Implement visualization utilites for these models in topicwizard ⏳
  • High-level topic descriptions with LLMs.
  • Contextualized evaluation metrics.

Implemented Models

Open in Colab

Mixture of Gaussians (GMM)

Topic models where topics are assumed to be Multivariate Normal components, and term importances are estimated with Soft-c-TF-IDF.

from turftopic import GMM

model = GMM(10).fit(texts)
model.print_topics()

KeyNMF

Nonnegative Matrix Factorization over keyword importances based on transformer representations.

from turftopic import KeyNMF

model = KeyNMF(10).fit(texts)
model.print_topics()

Semantic Signal Separation (S³)

Interprets topics as dimensions of semantics. Obtains these dimensions with ICA or PCA.

from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(10).fit(texts)
model.print_topics()

Clustering Topic Models

Topics are clusters in embedding space and term importances are either estimated with c-TF-IDF (BERTopic) or proximity to cluster centroid (Top2Vec).

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel().fit(texts)
model.print_topics()

Variational Autoencoders (CTM)

Contextual representations are used as ProdLDA encoder network inputs, either alone (ZeroShotTM) or concatenated to BoW (CombinedTM).

pip install turftopic[pyro-ppl]
from turftopic import AutoencodingTopicModel

model = AutoencodingTopicModel(10).fit(texts)
model.print_topics()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turftopic-0.2.4.tar.gz (16.5 kB view hashes)

Uploaded Source

Built Distribution

turftopic-0.2.4-py3-none-any.whl (23.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page