Topic modeling using sentence_transformer
Project description
transformertopic
Topic Modeling using sentence embeddings. This procedure works very well: in practice it almost always produces sensible topics and (from a practical point of view) renders all LDA variants obsolete.
This is my own implementation of the procedure described here by Maarten Grootendorst, who also has his own implementation available here. Thanks for this brilliant idea!
I wanted to code it myself and have features marked with a ⭐, which as far as I know are not available in Grootendorst's implementation.
Features:
- Compute topic modeling
- Compute dynamic topic modeling ("trends" here)
- ⭐ Assign topics on sentence rather than document level
- ⭐ Experiment with different dimension reducers
- ⭐ Experiment with different ways to generate a wordcloud from a topic
- ⭐ Infer topics of new batches of docs without retraining
How it works
In the following the words "cluster" and "topic" are used interchangeably. Please note that in classic Topic Modeling procedures (e.g. those based on LDA) each document is a probability distribution over topics. In this sense the procedure here presented could be considered as a special case where these distributions are always degenerate and concentrate the probability on one single index.
The procedure is:
- split paragraphs into sentences
- compute sentence embeddings (using sentence transformers)
- compute dimension reduction of these embeddings (with umap, pacmap, tsne or pca)
- cluster them with HDBSCAN
- for each topic compute a "cluster representator": a dictionary with words as keys and ranks as values (using tfidf, textrank or kmaxoids [^1])
- use the cluster representators to compute wordclouds for each topic
[^1]: my own implementation, see kmaxoids.py
Installation
pip install -U transformertopic
Usage
View also test.py
.
Choose a reducer
from transformertopic.dimensionReducers import PacmapEmbeddings, UmapEmbeddings, TsneEmbeddings
#reducer = PacmapEmbeddings()
#reducer = TsneEmbeddings()
reducer = UmapEmbeddings(umapNNeighbors=13)
Init and run the model
from transformertopic import TransformerTopic
tt = TransformerTopic(dimensionReducer=reducer, hdbscanMinClusterSize=20)
tt.train(documentsDataFrame=pandasDf, dateColumn='date', textColumn='coref_text', copyOtherColumns = True)
print(f"Found {tt.nTopics} topics")
print(tt.df.info())
If you want to use different embeddings, you can pass the SentenceTransformer model name via the stEmbeddings
init argument to TransformerTopic
.
Show sizes of largest topics
N = 10
topNtopics = tt.showTopicSizes(N)
Choose a cluster representator and show wordclouds for the biggest topics
from transformertopic.clusterRepresentators import TextRank, Tfidf, KMaxoids
representator = Tfidf()
# representator = TextRank()
tt.showWordclouds(topNtopics clusterRepresentator=representator)
Show frequency of topics over times (dynamic topic modeling), or trends:
tt.showTopicTrends()
Show topics in which "car" appears in the top 75 words in their cluster representation:
tt.searchForWordInTopics("car", topNWords=75)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file transformertopic-1.8.tar.gz
.
File metadata
- Download URL: transformertopic-1.8.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1027dd38a00cb99f321dd2f1d7d22f1adc5b11b8e910cd65663e1495e2298e0b |
|
MD5 | 06abb6dcb8bc375d3308892331371ac4 |
|
BLAKE2b-256 | 4d5a1430bae11a0d7d76f4bb4281ef65d7dcf05ee90bb1bdde8202acc653b359 |
File details
Details for the file transformertopic-1.8-py3-none-any.whl
.
File metadata
- Download URL: transformertopic-1.8-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15cfb4b4ba01c8b936d3e547cd9271b8a081974b300df55645cc97b9eaaa3a9d |
|
MD5 | 3ef1995b76d36d9b22a3de826cc658f5 |
|
BLAKE2b-256 | dd79b98ee7a105ef2f98fb55878039c42d79bed9efffc2da4e1136f318bb681b |