turftopic

Topic modeling with contextual representations from sentence transformers.

These details have not been verified by PyPI

Project description

Topic modeling is your turf too.
Contextual topic models with representations from transformers.

Intentions

Provide simple, robust and fast implementations of existing approaches (BERTopic, Top2Vec, CTM) with minimal dependencies.
Implement state-of-the-art approaches from my papers. (papers work-in-progress)
Put all approaches in a broader conceptual framework.
Provide clear and extensive documentation about the best use-cases for each model.
Make the models' API streamlined and compatible with topicwizard and scikit-learn.
Develop smarter, transformer-based evaluation metrics.

!!!This package is still a prototype, and no papers are published about the models. Until these are out, and most features are implemented I DO NOT recommend using this package for production and academic use!!!

Roadmap

Model Implementation
Pretty Printing
Publish papers :hourglass_flowing_sand: (in progress..)
Thorough documentation and good tutorials ⏳
Implement visualization utilites for these models in topicwizard ⏳
High-level topic descriptions with LLMs.
Contextualized evaluation metrics.

Implemented Models

Mixture of Gaussians (GMM)

Topic models where topics are assumed to be Multivariate Normal components, and term importances are estimated with Soft-c-TF-IDF.

from turftopic import GMM

model = GMM(10).fit(texts)
model.print_topics()

KeyNMF

Nonnegative Matrix Factorization over keyword importances based on transformer representations.

from turftopic import KeyNMF

model = KeyNMF(10).fit(texts)
model.print_topics()

Semantic Signal Separation (S³)

Interprets topics as dimensions of semantics. Obtains these dimensions with ICA or PCA.

from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(10).fit(texts)
model.print_topics()

Clustering Topic Models

Topics are clusters in embedding space and term importances are either estimated with c-TF-IDF (BERTopic) or proximity to cluster centroid (Top2Vec).

from turftopic import ClusteringTopicModel

model = ClusteringTopicModel().fit(texts)
model.print_topics()

Variational Autoencoders (CTM)

Contextual representations are used as ProdLDA encoder network inputs, either alone (ZeroShotTM) or concatenated to BoW (CombinedTM).

pip install turftopic[pyro-ppl]

from turftopic import AutoencodingTopicModel

model = AutoencodingTopicModel(10).fit(texts)
model.print_topics()

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.3

Aug 17, 2024

0.5.2

Aug 12, 2024

0.5.1

Aug 12, 2024

0.5.0

Jul 31, 2024

0.4.5

Jul 1, 2024

0.4.3

Jun 28, 2024

0.4.2

Jun 28, 2024

0.4.1

Jun 27, 2024

0.4.0

Jun 25, 2024

0.3.0

Jun 9, 2024

0.2.14

May 26, 2024

0.2.13

Mar 22, 2024

0.2.12

Mar 16, 2024

0.2.11

Mar 2, 2024

0.2.10

Feb 29, 2024

0.2.9

Feb 26, 2024

0.2.8

Feb 26, 2024

0.2.7

Feb 23, 2024

0.2.6

Feb 19, 2024

0.2.5

Feb 16, 2024

This version

0.2.4

Feb 7, 2024

0.2.3

Feb 6, 2024

0.2.2

Feb 6, 2024

0.2.1

Feb 6, 2024

0.2.0

Jan 17, 2024

0.1.1

Dec 13, 2023

0.1.0

Dec 10, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turftopic-0.2.4.tar.gz (16.5 kB view hashes)

Uploaded Feb 7, 2024 Source

Built Distribution

turftopic-0.2.4-py3-none-any.whl (23.7 kB view hashes)

Uploaded Feb 7, 2024 Python 3

Hashes for turftopic-0.2.4.tar.gz

Hashes for turftopic-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`d7974c2059e253f4274594c73bb60a84d6e8d743dfeb1e3108f20a8e22e5cd6c`
MD5	`a79c6fc579e2a23f0dabf648af60ce7a`
BLAKE2b-256	`a3d5eecad8afa89baa758e7ba55a3f23d3f891ecd8a97374e79babe2fec2efaa`

Hashes for turftopic-0.2.4-py3-none-any.whl

Hashes for turftopic-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d9fcef8054cd8a50c9caa7d946442e9ff369de836eafe0fa0ff6cf2eb494de5f`
MD5	`f408687dbe318de9803a4d96bf741522`
BLAKE2b-256	`119b8eee8d9c48a3c8f3d770b2381dc7664d719ba53e3b192a847b13f6bcad21`