Pretty and opinionated topic model visualization in Python.

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

topicwizard

Pretty and opinionated topic model visualization in Python.

https://user-images.githubusercontent.com/13087737/234209888-0d20ede9-2ea1-4d6e-b69b-71b863287cc9.mp4

New in version 0.5.0 🌟

Enhanced readibility and legibility of graphs.
Added helper tooltips to help you understand and interpret the graphs.
Improved stability.
Negative topic distributions are now supported in documents.

Features

Investigate complex relations between topics, words, documents and groups/genres/labels
Easy to use pipelines that can be utilized for downstream tasks
Sklearn, Gensim and BERTopic compatible :nut_and_bolt:
Highly interactive web app
Interactive and composable Plotly figures
Automatically infer topic names, oooor...
Name topics manually
Easy deployment :earth_africa:

Installation

Install from PyPI:

pip install topic-wizard

Pipelines

The main abstraction of topicwizard around a topic model is a topic pipeline, which consists of a vectorizer, that turns texts into bag-of-tokens representations and a topic model which decomposes these representations into vectors of topic importance. topicwizard allows you to use both scikit-learn pipelines or its own TopicPipeline.

Let's build a pipeline. We will use scikit-learns CountVectorizer as our vectorizer component:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=5, max_df=0.8, stop_words="english")

The topic model I will use for this example is Non-negative Matrix Factorization as it is fast and usually finds good topics.

from sklearn.decomposition import NMF

model = NMF(n_components=10)

Then let's put this all together in a pipeline. You can either use sklearn Pipelines...

from sklearn.pipeline import make_pipeline

topic_pipeline = make_pipeline(vectorizer, model)

Or TopicPipeline from topicwizard:

from topicwizard.pipeline import make_topic_pipeline

topic_pipeline = make_topic_pipeline(vectorizer, model, norm_rows=False)

Let's load a corpus that we would like to analyze, in this example I will use 20newsgroups from sklearn.

from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset="all")
corpus = newsgroups.data

# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = [newsgroups.target_names[label] for label in newsgroups.target]

Then let's fit our pipeline to this data:

topic_pipeline.fit(corpus)

The advantages of using a TopicPipeline over a regular pipeline are numerous:

Output dimensions (topics) are named
You can set the output to be a pandas dataframe (topic_pipeline.set_output(transform="pandas")) with topics as columns.
You can treat topic importances as pseudoprobability-distributions (topic_pipeline.norm_row = True)
You can freeze components so that the pipeline will stay frozen when fitting downstream components (topic_pipeline.freeze = True)

Here's an example of how you can easily display a heatmap over topics in a document using TopicPipelines.

import plotly.express as px

pipeline = make_topic_pipeline(vectorizer, model).set_output(transform="pandas")
texts = [
   "Coronavirus killed 50000 people today.",
   "Donald Trump's presidential campaing is going very well",
   "Protests against police brutality have been going on all around the US.",
]
topic_df = pipeline.transform(texts)
topic_df.index = texts
px.imshow(topic_df).show()

topic_heatmap

You didn't even have to use topicwizards own visualizations for this!!

You can also use TopicPipelines for downstream tasks, such as unsupervised text labeling with the help of human-learn.

pip install human-learn

from hulearn.classification import FunctionClassifier
from sklearn.pipeline import make_pipeline

topic_pipeline = make_topic_pipeline(vectorizer, model).fit(texts)

# Investigate topics
topicwizard.visualize(topic_pipeline)

# Creating rule for classifying something as a corona document
def corona_rule(df, threshold=0.5):
    is_about_corona = df["11_vaccine_pandemic_virus_coronavirus"] > threshold
    return is_about_corona.astype(int)

# Freezing topic pipeline
topic_pipeline.freeze = True
classifier = FunctionClassifier(corona_rule)
cls_pipeline = make_pipeline(topic_pipeline, classifier)

Web Application

You can launch the topic wizard web application for interactively investigating your topic models. The app is also quite easy to deploy in case you want to create a client-facing interface.

import topicwizard

topicwizard.visualize(corpus, pipeline=topic_pipeline)

From version 0.3.0 you can also disable pages you do not wish to display thereby sparing a lot of time for yourself:

# A large corpus takes a looong time to compute 2D projections for so
# so you can speed up preprocessing by disabling it alltogether.
topicwizard.visualize(corpus, pipeline=topic_pipeline, exclude_pages=["documents"])

Topics	Words	Documents	Groups

Figures

If you want customizable, faster, html-saveable interactive plots, you can use the figures API. Here are a couple of examples:

from topicwizard.figures import word_map, document_topic_timeline, topic_wordclouds, word_association_barchart

Word Map	Timeline of Topics in a Document
`word_map(corpus, pipeline=topic_pipeline)`	`document_topic_timeline( "Joe Biden takes over presidential office from Donald Trump.", pipeline=topic_pipeline)`

Wordclouds of Topics	Topic for Word Importance
`topic_wordclouds(corpus, pipeline=topic_pipeline)`	`word_association_barchart(["supreme", "court"], corpus=corpus, pipeline=topic_pipeline)`

For more information consult our Documentation

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.2

Mar 11, 2024

1.0.1

Mar 2, 2024

1.0.0

Feb 12, 2024

This version

0.5.0

Nov 12, 2023

0.4.0

Aug 3, 2023

0.3.1

Jul 6, 2023

0.3.0

Jun 30, 2023

0.2.6

Jun 26, 2023

0.2.5

May 17, 2023

0.2.4

May 10, 2023

0.2.3

Apr 25, 2023

0.2.2

Mar 13, 2023

0.2.1

Feb 19, 2023

0.2.0

Feb 19, 2023

0.1.12

Jan 23, 2023

0.1.11

Jan 23, 2023

0.1.10

Jan 23, 2023

0.1.9

Jan 20, 2023

0.1.8

Jan 20, 2023

0.1.7

Jan 20, 2023

0.1.6

Jan 20, 2023

0.1.5

Jan 20, 2023

0.1.4

Jan 18, 2023

0.1.3

Jan 18, 2023

0.1.2

Jan 18, 2023

0.1.1

Jan 18, 2023

0.1.0

Jan 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topic_wizard-0.5.0.tar.gz (84.4 kB view hashes)

Uploaded Nov 12, 2023 Source

Built Distribution

topic_wizard-0.5.0-py3-none-any.whl (108.5 kB view hashes)

Uploaded Nov 12, 2023 Python 3

Hashes for topic_wizard-0.5.0.tar.gz

Hashes for topic_wizard-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`994c855241cb4f2642f41ec0e1f5253fb9c74a54576a13a9c99638e660b54393`
MD5	`5affe79f992c0b66dacff57026e750b3`
BLAKE2b-256	`975340f11b50922714315063e5a49950c6a5935ae75bee07d4f444e9ef9bc5f7`

Hashes for topic_wizard-0.5.0-py3-none-any.whl

Hashes for topic_wizard-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9ede64ed8a9d4766a04d2c4c07f643ce328d92292b44049a381c4bc40aa3d157`
MD5	`ae123315be0b70fed10d413949c4a79a`
BLAKE2b-256	`0c53ed1062e9e51aab0632cf0626046cd07671483082d6bcc6566ed0c4f11bbe`