Skip to main content

Topic modeling with contextual representations from sentence transformers.

Project description


Topic modeling is your turf too.
Contextual topic models with representations from transformers.

DOI

Features

SOTA Transformer-based Topic Models :compass: , :key: KeyNMF, :gem: GMM, 🗻 Topeax, 🌀 SensTopic, Clustering Models (BERTopic and Top2Vec), Autoencoding models (ZeroShotTM and CombinedTM), FASTopic
Models for all Scenarios :chart_with_upwards_trend: Dynamic, :ocean: Online, :herb: Seeded, :evergreen_tree: Hierarchical, and :camera: Multimodal topic modeling
Easy Interpretation :bookmark_tabs: Pretty Printing, :bar_chart: Interactive Figures, :art: topicwizard compatible
Topic Analysis :robot: LLM-generated names and descriptions, :wave: Manual Topic Naming
Informative Topic Descriptions :key: Keyphrases, Noun-phrases, Lemmatization, Stemming

Basics

Open in Colab

For more details on a particular topic, you can consult our documentation page:

:house: Build and Train Topic Models :art: Explore, Interpret and Visualize your Models :wrench: Modify and Fine-tune Topic Models
:pushpin: Choose the Right Model for your Use-Case :chart_with_upwards_trend: Explore Topics Changing over Time :newspaper: Use Phrases or Lemmas for Topic Models
:ocean: Extract Topics from a Stream of Documents :evergreen_tree: Find Hierarchical Order in Topics :whale: Name Topics with Large Language Models

Installation

Turftopic can be installed from PyPI.

pip install turftopic

If you intend to use CTMs, make sure to install the package with Pyro as an optional dependency.

pip install "turftopic[pyro-ppl]"

If you want to use clustering models like BERTopic or Top2Vec, install:

pip install "turftopic[umap-learn]"

Fitting a Model

Turftopic's models follow the scikit-learn API conventions, and as such they are quite easy to use if you are familiar with scikit-learn workflows.

Here's an example of how you use KeyNMF, one of our models on the 20Newsgroups dataset from scikit-learn.

If you are using a Mac, you might have to install the required SSL certificates on your system in order to be able to download the dataset.

from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(
    subset="all",
    remove=("headers", "footers", "quotes"),
)
corpus: list[str] = newsgroups.data
print(len(corpus)) # 18846

Turftopic also comes with interpretation tools that make it easy to display and understand your results.

from turftopic import KeyNMF

model = KeyNMF(20)
document_topic_matrix = model.fit_transform(corpus)

Interpreting Models

Turftopic comes with a number of pretty printing utilities for interpreting the models.

To see the highest the most important words for each topic, use the print_topics() method.

model.print_topics()
Topic ID Top 10 Words
0 armenians, armenian, armenia, turks, turkish, genocide, azerbaijan, soviet, turkey, azerbaijani
1 sale, price, shipping, offer, sell, prices, interested, 00, games, selling
2 christians, christian, bible, christianity, church, god, scripture, faith, jesus, sin
3 encryption, chip, clipper, nsa, security, secure, privacy, encrypted, crypto, cryptography
....
# Print highest ranking documents for topic 0
model.print_representative_documents(0, corpus, document_topic_matrix)
Document Score
Poor 'Poly'. I see you're preparing the groundwork for yet another retreat from your... 0.40
Then you must be living in an alternate universe. Where were they? An Appeal to Mankind During the... 0.40
It is 'Serdar', 'kocaoglan'. Just love it. Well, it could be your head wasn't screwed on just right... 0.39
model.print_topic_distribution(
    "I think guns should definitely banned from all public institutions, such as schools."
)
Topic name Score
7_gun_guns_firearms_weapons 0.05
17_mail_address_email_send 0.00
3_encryption_chip_clipper_nsa 0.00
19_baseball_pitching_pitcher_hitter 0.00
11_graphics_software_program_3d 0.00

Automated Topic Naming

Turftopic now allows you to automatically assign human readable names to topics using LLMs or n-gram retrieval!

You will need to pip install "turftopic[openai]" for this to work.

from turftopic import KeyNMF
from turftopic.analyzers import OpenAIAnalyzer

model = KeyNMF(10).fit(corpus)

namer = OpenAIAnalyzer("gpt-4o-mini")
model.rename_topics(namer)
model.print_topics()
Topic ID Topic Name Highest Ranking
0 Operating Systems and Software windows, dos, os, ms, microsoft, unix, nt, memory, program, apps
1 Atheism and Belief Systems atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith
2 Computer Architecture and Performance motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance
3 Storage Technologies disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot
...

Vectorizers Module

You can use a set of custom vectorizers for topic modeling over phrases, as well as lemmata and stems.

You will need to pip install "turftopic[spacy]" for this to work.

from turftopic import BERTopic
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer

model = BERTopic(
    n_components=10,
    vectorizer=NounPhraseCountVectorizer("en_core_web_sm"),
)
model.fit(corpus)
model.print_topics()
Topic ID Highest Ranking
...
3 fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism
4 religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index
...

Visualization

Turftopic comes with a number of visualization and pretty printing utilities for specific models and specific contexts, such as hierarchical or dynamic topic modelling. You will find an overview of these in the Interpreting and Visualizing Models section of our documentation.

pip install "turftopic[datamapplot, openai]"
from turftopic import ClusteringTopicModel
from turftopic.analyzers import OpenAIAnalyzer

model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)

namer = OpenAIAnalyzer("gpt-5-nano")
model.rename_topics(namer)

fig = model.plot_clusters_datamapplot()
fig.show()
image

In addition, Turftopic is natively supported in topicwizard, an interactive topic model visualization library, is compatible with all models from Turftopic.

pip install "turftopic[topic-wizard]"

By far the easiest way to visualize your models for interpretation is to launch the topicwizard web app.

import topicwizard

topicwizard.visualize(corpus, model=model)
Screenshot of the topicwizard Web Application

Alternatively you can use the Figures API in topicwizard for individual HTML figures.

Citation

Please cite us when using Turftopic:

@article{
  Kardos2025,
  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
  doi = {10.21105/joss.08183},
  url = {https://doi.org/10.21105/joss.08183},
  year = {2025},
  publisher = {The Open Journal},
  volume = {10},
  number = {111},
  pages = {8183},
  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
  journal = {Journal of Open Source Software} 
}

References

  • Kardos, M., Kostkan, J., Vermillet, A., Nielbo, K., Enevoldsen, K., & Rocca, R. (2024, June 13). $S^3$ - Semantic Signal separation. arXiv.org. https://arxiv.org/abs/2406.09556
  • Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.
  • Grootendorst, M. (2022, March 11). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv.org. https://arxiv.org/abs/2203.05794
  • Angelov, D. (2020, August 19). Top2VEC: Distributed representations of topics. arXiv.org. https://arxiv.org/abs/2008.09470
  • Bianchi, F., Terragni, S., & Hovy, D. (2020, April 8). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv.org. https://arxiv.org/abs/2004.03974
  • Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics.
  • Kristensen-McLachlan, R. D., Hicke, R. M. M., Kardos, M., & Thunø, M. (2024, October 16). Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media. arXiv.org. https://arxiv.org/abs/2410.12791

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turftopic-0.25.2.tar.gz (105.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turftopic-0.25.2-py3-none-any.whl (131.6 kB view details)

Uploaded Python 3

File details

Details for the file turftopic-0.25.2.tar.gz.

File metadata

  • Download URL: turftopic-0.25.2.tar.gz
  • Upload date:
  • Size: 105.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.12.11 Linux/6.17.0-20-generic

File hashes

Hashes for turftopic-0.25.2.tar.gz
Algorithm Hash digest
SHA256 1c922fd614f1b6a03465782ce0b06c1fa0501c5e0830b66c923be5ff2176794d
MD5 5ebded8537cb6eff6b9b79ae5ba07ff3
BLAKE2b-256 a565d8ace3db4984ce24e1320e783539154c362fce21bf6da3b94235177fc42e

See more details on using hashes here.

File details

Details for the file turftopic-0.25.2-py3-none-any.whl.

File metadata

  • Download URL: turftopic-0.25.2-py3-none-any.whl
  • Upload date:
  • Size: 131.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.12.11 Linux/6.17.0-20-generic

File hashes

Hashes for turftopic-0.25.2-py3-none-any.whl
Algorithm Hash digest
SHA256 10b7c9f7eebf30efab7c27a504958f971f7a7d58524c09e91dc7ed9f06a70fd5
MD5 cca678f4d4e26e0102912cbfcebeead8
BLAKE2b-256 48b7c46713e2ce4988cca82ae9c3a26510ec98bcf3cdb721a404c618d8ff4892

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page