Topic modeling using sentence_transformer
Project description
transformertopic
Topic Modeling using sentence embeddings. The procedure is:
- compute sentence embeddings
- compute dimension reduction of these
- cluster them
- compute a human-readable representation of each cluster/topic
This is inspired by the Topic Modeling procedure described here by Maarten Grootendorst, who also has his own implementation available here.
Usage
Choose a reducer
from transformertopic.dimensionReducers import PacmapEmbeddings, UmapEmbeddings, TsneEmbeddings
#reducer = PacmapEmbeddings()
#reducer = TsneEmbeddings()
reducer = UmapEmbeddings(umapNNeighbors=13)
Init and run the model
from transformertopic import TransformerTopic
tt = TransformerTopic(dimensionReducer=reducer, hdbscanMinClusterSize=20)
tt.train(documentsDataFrame=pandasDf, dateColumn='date', textColumn='coref_text', copyOtherColumns = True)
print(f"Found {tt.nTopics} topics")
print(tt.df.info())
Show sizes of largest topics
N = 10
topNtopics = tt.showTopicSizes(N)
Choose a cluster representator and show wordclouds for the biggest topics
from transformertopic.clusterRepresentators import TextRank, Tfidf, KMaxoids
representator = Tfidf()
# representator = TextRank()
tt.showWordclouds(topNtopics clusterRepresentator=representator)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
transformertopic-1.1.tar.gz
(10.7 kB
view hashes)
Built Distribution
Close
Hashes for transformertopic-1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9e093c84173252b09c1390e263da744c30af2d4ceee2c6e3ff97a4baa55acb8 |
|
MD5 | 9641285b92be5a50d48952eea406283b |
|
BLAKE2b-256 | a2849cb31f455e179e5c2e9a16120af99f0962cc1ec39d8c25d979df45c1b15d |