Skip to main content

Pretty and opinionated topic model visualization in Python.

Project description

topicwizard


Pretty and opinionated topic model visualization in Python.

Open in Colab PyPI version pip downloads python version Code style: black

https://user-images.githubusercontent.com/13087737/234209888-0d20ede9-2ea1-4d6e-b69b-71b863287cc9.mp4

New in version 0.5.0 🌟

  • Enhanced readibility and legibility of graphs.
  • Added helper tooltips to help you understand and interpret the graphs.
  • Improved stability.
  • Negative topic distributions are now supported in documents.

Features

  • Investigate complex relations between topics, words, documents and groups/genres/labels
  • Easy to use pipelines that can be utilized for downstream tasks
  • Sklearn, Gensim and BERTopic compatible :nut_and_bolt:
  • Highly interactive web app
  • Interactive and composable Plotly figures
  • Automatically infer topic names, oooor...
  • Name topics manually
  • Easy deployment :earth_africa:

Installation

Install from PyPI:

pip install topic-wizard

Pipelines

The main abstraction of topicwizard around a topic model is a topic pipeline, which consists of a vectorizer, that turns texts into bag-of-tokens representations and a topic model which decomposes these representations into vectors of topic importance. topicwizard allows you to use both scikit-learn pipelines or its own TopicPipeline.

Let's build a pipeline. We will use scikit-learns CountVectorizer as our vectorizer component:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=5, max_df=0.8, stop_words="english")

The topic model I will use for this example is Non-negative Matrix Factorization as it is fast and usually finds good topics.

from sklearn.decomposition import NMF

model = NMF(n_components=10)

Then let's put this all together in a pipeline. You can either use sklearn Pipelines...

from sklearn.pipeline import make_pipeline

topic_pipeline = make_pipeline(vectorizer, model)

Or TopicPipeline from topicwizard:

from topicwizard.pipeline import make_topic_pipeline

topic_pipeline = make_topic_pipeline(vectorizer, model, norm_rows=False)

Let's load a corpus that we would like to analyze, in this example I will use 20newsgroups from sklearn.

from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset="all")
corpus = newsgroups.data

# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = [newsgroups.target_names[label] for label in newsgroups.target]

Then let's fit our pipeline to this data:

topic_pipeline.fit(corpus)

The advantages of using a TopicPipeline over a regular pipeline are numerous:

  • Output dimensions (topics) are named
  • You can set the output to be a pandas dataframe (topic_pipeline.set_output(transform="pandas")) with topics as columns.
  • You can treat topic importances as pseudoprobability-distributions (topic_pipeline.norm_row = True)
  • You can freeze components so that the pipeline will stay frozen when fitting downstream components (topic_pipeline.freeze = True)

Here's an example of how you can easily display a heatmap over topics in a document using TopicPipelines.

import plotly.express as px

pipeline = make_topic_pipeline(vectorizer, model).set_output(transform="pandas")
texts = [
   "Coronavirus killed 50000 people today.",
   "Donald Trump's presidential campaing is going very well",
   "Protests against police brutality have been going on all around the US.",
]
topic_df = pipeline.transform(texts)
topic_df.index = texts
px.imshow(topic_df).show()

topic_heatmap

You didn't even have to use topicwizards own visualizations for this!!

You can also use TopicPipelines for downstream tasks, such as unsupervised text labeling with the help of human-learn.

pip install human-learn
from hulearn.classification import FunctionClassifier
from sklearn.pipeline import make_pipeline

topic_pipeline = make_topic_pipeline(vectorizer, model).fit(texts)

# Investigate topics
topicwizard.visualize(topic_pipeline)

# Creating rule for classifying something as a corona document
def corona_rule(df, threshold=0.5):
    is_about_corona = df["11_vaccine_pandemic_virus_coronavirus"] > threshold
    return is_about_corona.astype(int)

# Freezing topic pipeline
topic_pipeline.freeze = True
classifier = FunctionClassifier(corona_rule)
cls_pipeline = make_pipeline(topic_pipeline, classifier)

Web Application

You can launch the topic wizard web application for interactively investigating your topic models. The app is also quite easy to deploy in case you want to create a client-facing interface.

import topicwizard

topicwizard.visualize(corpus, pipeline=topic_pipeline)

From version 0.3.0 you can also disable pages you do not wish to display thereby sparing a lot of time for yourself:

# A large corpus takes a looong time to compute 2D projections for so
# so you can speed up preprocessing by disabling it alltogether.
topicwizard.visualize(corpus, pipeline=topic_pipeline, exclude_pages=["documents"])
Topics Words Documents Groups
topics screenshot words screenshot documents screenshot groups screenshot

Figures

If you want customizable, faster, html-saveable interactive plots, you can use the figures API. Here are a couple of examples:

from topicwizard.figures import word_map, document_topic_timeline, topic_wordclouds, word_association_barchart
Word Map Timeline of Topics in a Document
word_map(corpus, pipeline=topic_pipeline) document_topic_timeline( "Joe Biden takes over presidential office from Donald Trump.", pipeline=topic_pipeline)
word map screenshot doc_timeline
Wordclouds of Topics Topic for Word Importance
topic_wordclouds(corpus, pipeline=topic_pipeline) word_association_barchart(["supreme", "court"], corpus=corpus, pipeline=topic_pipeline)
wordclouds topic_word_imp

For more information consult our Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topic_wizard-0.5.0.tar.gz (84.4 kB view hashes)

Uploaded Source

Built Distribution

topic_wizard-0.5.0-py3-none-any.whl (108.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page