Skip to main content

BERTopic performs topic Modeling with state-of-the-art transformer models.

Project description

PyPI - Python PyPI - License PyPI - PyPi Build docs DOI

BERTopic

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. It even supports visualizations similar to LDAvis!

Corresponding medium post can be found here.

Installation

Installation can be done using pypi:

pip install bertopic

To use the visualization options, install BERTopic as follows:

pip install bertopic[visualization]
Installation Errors

PyTorch 1.4.0 or higher is recommended. If the install gives an error, please install pytorch first here.

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation here or you can follow along with the Google Colab notebook here.

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of english documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

model = BERTopic(language="english")
topics, probabilities = model.fit_transform(docs)

After generating topics and their probabilities, we can access the frequent topics that were generated:

>>> model.get_topic_freq().head()
Topic	Count
-1	7288
49	3992
30	701
27	684
11	568

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 49:

>>> model.get_topic(49)
[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]
Supported Languages
Use "multilingual" to select a model that supports 50+ languages.

Moreover, the following languages are supported:
Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanize, Bosnian, Breton, Bulgarian, Burmese, Burmese zawgyi font, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanize, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanize, Telugu, Telugu Romanize, Thai, Turkish, Ukrainian, Urdu, Urdu Romanize, Uyghur, Uzbek, Vietnamese, Welsh, Western Frisian, Xhosa, Yiddish

Visualize Topics

After having trained our BERTopic model, we can iteratively go through perhaps a hundred topic to get a good understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:

model.visualize_topics()

Visualize Topic Probabilities

The variable probabilities that is returned from transform() or fit_transform() can be used to understand how confident BERTopic is that certain topics can be found in a document.

To visualize the distributions, we simply call:

# Make sure to input the probabilities of a single document!
model.visualize_distribution(probabilities[0])

Embedding Models

You can select any model from sentence-transformers and pass it through BERTopic with embedding_model:

from bertopic import BERTopic
model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

You can also use previously generated embeddings by passing it through fit_transform():

model = BERTopic()
topics, probabilities = model.fit_transform(docs, embeddings)

Click here for a list of supported sentence transformers models.

Overview

Methods Code
Fit the model model.fit(docs])
Fit the model and predict documents model.fit_transform(docs])
Predict new documents model.transform([new_doc])
Access single topic model.get_topic(12)
Access all topics model.get_topics()
Get topic freq model.get_topic_freq()
Visualize Topics model.visualize_topics()
Visualize Topic Probability Distribution model.visualize_distribution(probabilities[0])
Update topic representation model.update_topics(docs, topics, n_gram_range=(1, 3))
Reduce nr of topics model.reduce_topics(docs, topics, probabilities, nr_topics=30)
Find topics model.find_topics("vehicle")
Save model model.save("my_model")
Load model BERTopic.load("my_model")

Citation

To cite BERTopic in your work, please use the following bibtex reference:

@misc{grootendorst2020bertopic,
  author       = {Maarten Grootendorst},
  title        = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.},
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.4.2},
  doi          = {10.5281/zenodo.4430182},
  url          = {https://doi.org/10.5281/zenodo.4430182}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bertopic-0.4.3.tar.gz (21.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bertopic-0.4.3-py2.py3-none-any.whl (20.6 kB view details)

Uploaded Python 2Python 3

File details

Details for the file bertopic-0.4.3.tar.gz.

File metadata

  • Download URL: bertopic-0.4.3.tar.gz
  • Upload date:
  • Size: 21.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.4

File hashes

Hashes for bertopic-0.4.3.tar.gz
Algorithm Hash digest
SHA256 da5489a451d0d16020dbf67a6e38f705b5839343259c2db237600d1c87ab4b37
MD5 69147dba4935c8596940da64496b918e
BLAKE2b-256 9d326e510465a64d3c3e583a258dedd6cd00883fa168791e9e882f677d197421

See more details on using hashes here.

File details

Details for the file bertopic-0.4.3-py2.py3-none-any.whl.

File metadata

  • Download URL: bertopic-0.4.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.4

File hashes

Hashes for bertopic-0.4.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c92f6783ffb28041d5ac64912b319db307447635d66b002f59b03e41c8dac000
MD5 802d38a5512bca4a54376d3d88c22013
BLAKE2b-256 7da5700851ac2bc1068462b8ee18b52b54a6716b4f90d758b232105b50c9f227

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page