Skip to main content

Topic modeling using sentence_transformer

Project description

transformertopic

Topic Modeling using sentence embeddings. This procedure works very well: in practice it almost always produces sensible topics and (from a practical point of view) renders all LDA variants obsolete.

This is my own implementation of the procedure described here by Maarten Grootendorst, who also has his own implementation available here. Thanks for this brilliant idea!

I wanted to code it myself and have features marked with a ⭐, which as far as I know are not available in Grootendorst's implementation.

Features:

  • Compute topic modeling
  • Compute dynamic topic modeling ("trends" here)
  • ⭐ Assign topics on sentence rather than document level
  • ⭐ Experiment with different dimension reducers
  • ⭐ Experiment with different ways to generate a wordcloud from a topic
  • ⭐ Infer topics of new batches of docs without retraining

How it works

In the following the words "cluster" and "topic" are used interchangeably. Please note that in classic Topic Modeling procedures (e.g. those based on LDA) each document is a probability distribution over topics. In this sense the procedure here presented could be considered as a special case where these distributions are always degenerate and concentrate the probability on one single index.

The procedure is:

  1. split paragraphs into sentences
  2. compute sentence embeddings (using sentence transformers)
  3. compute dimension reduction of these embeddings (with umap, pacmap, tsne or pca)
  4. cluster them with HDBSCAN
  5. for each topic compute a "cluster representator": a dictionary with words as keys and ranks as values (using tfidf, textrank or kmaxoids [^1])
  6. use the cluster representators to compute wordclouds for each topic

[^1]: my own implementation, see kmaxoids.py

Installation

pip install -U transformertopic

Usage

View also test.py.

Choose a reducer

from transformertopic.dimensionReducers import PacmapEmbeddings, UmapEmbeddings, TsneEmbeddings
#reducer = PacmapEmbeddings()
#reducer = TsneEmbeddings()
reducer = UmapEmbeddings(umapNNeighbors=13)

Init and run the model

from transformertopic import TransformerTopic
tt = TransformerTopic(dimensionReducer=reducer, hdbscanMinClusterSize=20)
tt.train(documentsDataFrame=pandasDf, dateColumn='date', textColumn='coref_text', copyOtherColumns = True)
print(f"Found {tt.nTopics} topics")
print(tt.df.info())

If you want to use different embeddings, you can pass the SentenceTransformer model name via the stEmbeddings init argument to TransformerTopic.

Show sizes of largest topics

N = 10
topNtopics = tt.showTopicSizes(N)

Choose a cluster representator and show wordclouds for the biggest topics

from transformertopic.clusterRepresentators import TextRank, Tfidf, KMaxoids
representator = Tfidf()
# representator = TextRank()
tt.showWordclouds(topNtopics clusterRepresentator=representator)

Show frequency of topics over times (dynamic topic modeling), or trends:

tt.showTopicTrends()

Show topics in which "car" appears in the top 75 words in their cluster representation:

tt.searchForWordInTopics("car", topNWords=75)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transformertopic-1.8.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

transformertopic-1.8-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file transformertopic-1.8.tar.gz.

File metadata

  • Download URL: transformertopic-1.8.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.10

File hashes

Hashes for transformertopic-1.8.tar.gz
Algorithm Hash digest
SHA256 1027dd38a00cb99f321dd2f1d7d22f1adc5b11b8e910cd65663e1495e2298e0b
MD5 06abb6dcb8bc375d3308892331371ac4
BLAKE2b-256 4d5a1430bae11a0d7d76f4bb4281ef65d7dcf05ee90bb1bdde8202acc653b359

See more details on using hashes here.

File details

Details for the file transformertopic-1.8-py3-none-any.whl.

File metadata

  • Download URL: transformertopic-1.8-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.10

File hashes

Hashes for transformertopic-1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 15cfb4b4ba01c8b936d3e547cd9271b8a081974b300df55645cc97b9eaaa3a9d
MD5 3ef1995b76d36d9b22a3de826cc658f5
BLAKE2b-256 dd79b98ee7a105ef2f98fb55878039c42d79bed9efffc2da4e1136f318bb681b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page