Topic modeling using sentence_transformer
Project description
transformertopic
Topic Modeling using sentence embeddings. The procedure is:
- split paragraphs in sentences
- compute sentence embeddings
- compute dimension reduction of these embeddings
- cluster them with HDBSCAN
- compute a human-readable representation of each cluster/topic
This is inspired by the Topic Modeling procedure described here by Maarten Grootendorst, who also has his own implementation available here.
I wanted to code it myself and have features marked with a ⭐, which as far as I know are not available in Grootendorst's implementation.
Features:
- Compute topic modeling
- Compute dynamic topic modeling ("trends" here)
- ⭐ Assign topics on sentence rather than document level
- ⭐ Experiment with different dimension reducers
- ⭐ Experiment with different ways to generate a wordcloud from a topic
- ⭐ Infer topics of new batches of docs without retraining
Usage
View also test.py
.
Choose a reducer
from transformertopic.dimensionReducers import PacmapEmbeddings, UmapEmbeddings, TsneEmbeddings
#reducer = PacmapEmbeddings()
#reducer = TsneEmbeddings()
reducer = UmapEmbeddings(umapNNeighbors=13)
Init and run the model
from transformertopic import TransformerTopic
tt = TransformerTopic(dimensionReducer=reducer, hdbscanMinClusterSize=20)
tt.train(documentsDataFrame=pandasDf, dateColumn='date', textColumn='coref_text', copyOtherColumns = True)
print(f"Found {tt.nTopics} topics")
print(tt.df.info())
If you want to use different embeddings, you can pass the SentenceTransformer model name via the stEmbeddings
init argument to TransformerTopic
.
Show sizes of largest topics
N = 10
topNtopics = tt.showTopicSizes(N)
Choose a cluster representator and show wordclouds for the biggest topics
from transformertopic.clusterRepresentators import TextRank, Tfidf, KMaxoids
representator = Tfidf()
# representator = TextRank()
tt.showWordclouds(topNtopics clusterRepresentator=representator)
Show frequency of topics over times (dynamic topic modeling), or trends:
tt.showTopicTrends()
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for transformertopic-1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | adc39962c277a6ca8f7cde5df4abd79ececb9f3ce9779ed05b3af0d7f454e59b |
|
MD5 | 3876db0f28c919e6b068ea2d79f66b65 |
|
BLAKE2b-256 | b4e3204f61aa674c87135766ec75439f2fb5f0fce61c0ffb1a51a5c4ce6423b0 |