BERTopic performs topic Modeling with state-of-the-art transformer models.

These details have not been verified by PyPI

Project links

Homepage

Project description

BERTopic

BERTopic is a topic modeling technique that leverages ðŸ¤— transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. It even supports visualizations similar to LDAvis!

Corresponding medium post can be found here and here.

Installation

Installation can be done using pypi:

pip install bertopic

To use the visualization options, install BERTopic as follows:

pip install bertopic[visualization]

To use Flair embeddings, install BERTopic as follows:

pip install bertopic[flair]

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation here or you can follow along with the Google Colab notebook here.

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of english documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs)

After generating topics and their probabilities, we can access the frequent topics that were generated:

>>> topic_model.get_topic_freq().head()
Topic	Count
-1	7288
49	3992
30	701
27	684
11	568

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 49:

>>> topic_model.get_topic(49)
[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

NOTE: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Visualize Topics

After having trained our BERTopic model, we can iteratively go through perhaps a hundred topic to get a good understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

Embedding Models

The parameter embedding_model takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model.

Sentence-Transformers
You can select any model from sentence-transformers here and pass it through BERTopic with embedding_model:

from bertopic import BERTopic
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model)

Flair
Flair allows you to choose almost any embedding model that is publicly available. Flair can be used as follows:

from bertopic import BERTopic
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)

You can select any ðŸ¤— transformers model here.

Custom Embeddings
You can also use previously generated embeddings by passing it through fit_transform():

topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs, embeddings)

Overview

Methods	Code
Fit the model	`topic_model.fit(docs])`
Fit the model and predict documents	`topic_model.fit_transform(docs])`
Predict new documents	`topic_model.transform([new_doc])`
Access single topic	`topic_model.get_topic(12)`
Access all topics	`topic_model.get_topics()`
Get topic freq	`topic_model.get_topic_freq()`
Visualize Topics	`topic_model.visualize_topics()`
Visualize Topic Probability Distribution	`topic_model.visualize_distribution(probabilities[0])`
Update topic representation	`topic_model.update_topics(docs, topics, n_gram_range=(1, 3))`
Reduce nr of topics	`topic_model.reduce_topics(docs, topics, nr_topics=30)`
Find topics	`topic_model.find_topics("vehicle")`
Save model	`topic_model.save("my_model")`
Load model	`BERTopic.load("my_model")`
Get parameters	`topic_model.get_params()`

Citation

To cite BERTopic in your work, please use the following bibtex reference:

@misc{grootendorst2020bertopic,
  author       = {Maarten Grootendorst},
  title        = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.},
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.5.0},
  doi          = {10.5281/zenodo.4430182},
  url          = {https://doi.org/10.5281/zenodo.4430182}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.17.0

Mar 19, 2025

0.16.4

Oct 9, 2024

0.16.3

Jul 22, 2024

0.16.2

May 12, 2024

0.16.1

Apr 21, 2024

0.16.0

Nov 27, 2023

0.15.0

May 30, 2023

0.14.1

Mar 2, 2023

0.14.0

Feb 14, 2023

0.13.0

Jan 4, 2023

0.12.0

Sep 11, 2022

0.11.0

Jul 11, 2022

0.10.0

Apr 30, 2022

0.9.4

Dec 14, 2021

0.9.3

Oct 17, 2021

0.9.2

Oct 12, 2021

0.9.1

Sep 1, 2021

0.9.0

Aug 7, 2021

0.8.1

Jun 8, 2021

0.8.0

May 31, 2021

0.7.0

Apr 26, 2021

0.6.0

Mar 1, 2021

This version

0.5.0

Feb 8, 2021

0.4.3

Jan 15, 2021

0.4.2

Jan 10, 2021

0.4.1

Jan 7, 2021

0.4.0

Dec 21, 2020

0.3.4

Nov 27, 2020

0.3.3

Nov 17, 2020

0.3.2

Nov 16, 2020

0.3.1

Nov 4, 2020

0.3.0

Oct 29, 2020

0.2.3

Oct 17, 2020

0.2.2

Oct 14, 2020

0.2.1

Oct 11, 2020

0.2.0

Oct 11, 2020

0.1.2

Oct 1, 2020

0.1.1

Sep 24, 2020

0.1.0

Sep 24, 2020

0.0.1

Sep 24, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bertopic-0.5.0.tar.gz (21.9 kB view details)

Uploaded Feb 8, 2021 Source

Built Distribution

bertopic-0.5.0-py2.py3-none-any.whl (21.1 kB view details)

Uploaded Feb 8, 2021 Python 2Python 3

File details

Details for the file bertopic-0.5.0.tar.gz.

File metadata

Download URL: bertopic-0.5.0.tar.gz
Upload date: Feb 8, 2021
Size: 21.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.4

File hashes

Hashes for bertopic-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`f4ccb38e4b8156e0d1c2d927718f8bcb24517479b975e14eb61649e5423afd5a`
MD5	`53a79659b1306f5db806ad0f66cea387`
BLAKE2b-256	`7cff39fe00b1f319c40ef8e27e42306d17bda2253333159af38f8ca976f9d2a6`

See more details on using hashes here.

File details

Details for the file bertopic-0.5.0-py2.py3-none-any.whl.

File metadata

Download URL: bertopic-0.5.0-py2.py3-none-any.whl
Upload date: Feb 8, 2021
Size: 21.1 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.4

File hashes

Hashes for bertopic-0.5.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`04597ecbeec2b589aecb73cd0cd892f8428d92b0f1621f82fd701b66d07c5518`
MD5	`6d6f30c52e1e078afad323a545a2cfba`
BLAKE2b-256	`1339e65e64257055dda6713129d8e34f02538ec8f9ba41721e7c2898eda290c0`

See more details on using hashes here.

bertopic 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BERTopic

Installation

Getting Started

Quick Start

Visualize Topics

Embedding Models

Overview

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes