Skip to main content

Topic modeling using sentence_transformer

Project description

transformertopic

Topic Modeling using sentence embeddings. The procedure is:

  1. compute sentence embeddings
  2. compute dimension reduction of these
  3. cluster them
  4. compute a human-readable representation of each cluster/topic

This is inspired by the Topic Modeling procedure described here by Maarten Grootendorst, who also has his own implementation available here.

Usage

Choose a reducer

from transformertopic.dimensionReducers import PacmapEmbeddings, UmapEmbeddings, TsneEmbeddings
#reducer = PacmapEmbeddings()
#reducer = TsneEmbeddings()
reducer = UmapEmbeddings(umapNNeighbors=13)

Init and run the model

from transformertopic import TransformerTopic
tt = TransformerTopic(dimensionReducer=reducer, hdbscanMinClusterSize=20)
tt.train(documentsDataFrame=pandasDf, dateColumn='date', textColumn='coref_text', copyOtherColumns = True)
print(f"Found {tt.nTopics} topics")
print(tt.df.info())

Show sizes of largest topics

N = 10
topNtopics = tt.showTopicSizes(N)

Choose a cluster representator and show wordclouds for the biggest topics

from transformertopic.clusterRepresentators import TextRank, Tfidf, KMaxoids
representator = Tfidf()
# representator = TextRank()
tt.showWordclouds(topNtopics clusterRepresentator=representator)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transformertopic-1.1.tar.gz (10.7 kB view hashes)

Uploaded Source

Built Distribution

transformertopic-1.1-py3-none-any.whl (14.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page