Skip to main content

Topic modeling using sentence_transformer

Project description

transformertopic

Topic Modeling using sentence embeddings. The procedure is:

  1. split paragraphs in sentences
  2. compute sentence embeddings
  3. compute dimension reduction of these embeddings
  4. cluster them with HDBSCAN
  5. compute a human-readable representation of each cluster/topic

This is inspired by the Topic Modeling procedure described here by Maarten Grootendorst, who also has his own implementation available here.

I wanted to code it myself and have features marked with a ⭐, which as far as I know are not available in Grootendorst's implementation.

Features:

  • Compute topic modeling
  • Compute dynamic topic modeling ("trends" here)
  • ⭐ Assign topics on sentence rather than document level
  • ⭐ Experiment with different dimension reducers
  • ⭐ Experiment with different ways to generate a wordcloud from a topic
  • ⭐ Infer topics of new batches of docs without retraining

Usage

View also test.py.

Choose a reducer

from transformertopic.dimensionReducers import PacmapEmbeddings, UmapEmbeddings, TsneEmbeddings
#reducer = PacmapEmbeddings()
#reducer = TsneEmbeddings()
reducer = UmapEmbeddings(umapNNeighbors=13)

Init and run the model

from transformertopic import TransformerTopic
tt = TransformerTopic(dimensionReducer=reducer, hdbscanMinClusterSize=20)
tt.train(documentsDataFrame=pandasDf, dateColumn='date', textColumn='coref_text', copyOtherColumns = True)
print(f"Found {tt.nTopics} topics")
print(tt.df.info())

If you want to use different embeddings, you can pass the SentenceTransformer model name via the stEmbeddings init argument to TransformerTopic.

Show sizes of largest topics

N = 10
topNtopics = tt.showTopicSizes(N)

Choose a cluster representator and show wordclouds for the biggest topics

from transformertopic.clusterRepresentators import TextRank, Tfidf, KMaxoids
representator = Tfidf()
# representator = TextRank()
tt.showWordclouds(topNtopics clusterRepresentator=representator)

Show frequency of topics over times (dynamic topic modeling), or trends:

tt.showTopicTrends()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transformertopic-1.4.tar.gz (11.7 kB view hashes)

Uploaded Source

Built Distribution

transformertopic-1.4-py3-none-any.whl (14.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page