Skip to main content

Topic modeling with contextual representations from sentence transformers.

Project description


Topic modeling is your turf too.
Contextual topic models with representations from transformers.

Intentions

  • Provide simple, robust and fast implementations of existing approaches (BERTopic, Top2Vec, CTM) with minimal dependencies.
  • Implement state-of-the-art approaches from my papers. (papers work-in-progress)
  • Put all approaches in a broader conceptual framework.
  • Provide clear and extensive documentation about the best use-cases for each model.
  • Make the models' API streamlined and compatible with topicwizard and scikit-learn.
  • Develop smarter, transformer-based evaluation metrics.

!!!This package is still a prototype, and no papers are published about the models. Until these are out, and most features are implemented I DO NOT recommend using this package for production and academic use!!!

Roadmap

  • Model Implementation
  • Pretty Printing
  • Publish papers :hourglass_flowing_sand: (in progress..)
  • Thorough documentation and good tutorials ⏳
  • Implement visualization utilites for these models in topicwizard ⏳
  • High-level topic descriptions with LLMs.
  • Contextualized evaluation metrics.

Implemented Models

Open in Colab

Mixture of Gaussians (GMM)

Topic models where topics are assumed to be Multivariate Normal components, and term importances are estimated with Soft-c-TF-IDF.

from turftopic import GMM

model = GMM(10).fit(texts)
model.print_topics()

KeyNMF

Nonnegative Matrix Factorization over keyword importances based on transformer representations.

from turftopic import KeyNMF

model = KeyNMF(10).fit(texts)
model.print_topics()

Semantic Signal Separation (S³)

Interprets topics as dimensions of semantics. Obtains these dimensions with ICA or PCA.

from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(10).fit(texts)
model.print_topics()

Clustering Topic Models

Topics are clusters in embedding space and term importances are either estimated with c-TF-IDF (BERTopic) or proximity to cluster centroid (Top2Vec).

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel().fit(texts)
model.print_topics()

Variational Autoencoders (CTM)

Contextual representations are used as ProdLDA encoder network inputs, either alone (ZeroShotTM) or concatenated to BoW (CombinedTM).

pip install turftopic[pyro-ppl]
from turftopic import AutoencodingTopicModel

model = AutoencodingTopicModel(10).fit(texts)
model.print_topics()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turftopic-0.2.4.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turftopic-0.2.4-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file turftopic-0.2.4.tar.gz.

File metadata

  • Download URL: turftopic-0.2.4.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Linux/5.15.0-92-generic

File hashes

Hashes for turftopic-0.2.4.tar.gz
Algorithm Hash digest
SHA256 d7974c2059e253f4274594c73bb60a84d6e8d743dfeb1e3108f20a8e22e5cd6c
MD5 a79c6fc579e2a23f0dabf648af60ce7a
BLAKE2b-256 a3d5eecad8afa89baa758e7ba55a3f23d3f891ecd8a97374e79babe2fec2efaa

See more details on using hashes here.

File details

Details for the file turftopic-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: turftopic-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Linux/5.15.0-92-generic

File hashes

Hashes for turftopic-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d9fcef8054cd8a50c9caa7d946442e9ff369de836eafe0fa0ff6cf2eb494de5f
MD5 f408687dbe318de9803a4d96bf741522
BLAKE2b-256 119b8eee8d9c48a3c8f3d770b2381dc7664d719ba53e3b192a847b13f6bcad21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page