Pretty and opinionated topic model visualization in Python.
Project description
topicwizard
Pretty and opinionated topic model visualization in Python.
New in version 0.5.0 🌟
- Enhanced readibility and legibility of graphs.
- Added helper tooltips to help you understand and interpret the graphs.
- Improved stability.
- Negative topic distributions are now supported in documents.
Features
- Investigate complex relations between topics, words, documents and groups/genres/labels
- Easy to use pipelines that can be utilized for downstream tasks
- Sklearn, Gensim and BERTopic compatible :nut_and_bolt:
- Highly interactive web app
- Interactive and composable Plotly figures
- Automatically infer topic names, oooor...
- Name topics manually
- Easy deployment :earth_africa:
Installation
Install from PyPI:
pip install topic-wizard
Pipelines
The main abstraction of topicwizard around a topic model is a topic pipeline, which consists of a vectorizer, that turns texts into bag-of-tokens
representations and a topic model which decomposes these representations into vectors of topic importance.
topicwizard allows you to use both scikit-learn pipelines or its own TopicPipeline
.
Let's build a pipeline. We will use scikit-learns CountVectorizer as our vectorizer component:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=5, max_df=0.8, stop_words="english")
The topic model I will use for this example is Non-negative Matrix Factorization as it is fast and usually finds good topics.
from sklearn.decomposition import NMF
model = NMF(n_components=10)
Then let's put this all together in a pipeline. You can either use sklearn Pipelines...
from sklearn.pipeline import make_pipeline
topic_pipeline = make_pipeline(vectorizer, model)
Or TopicPipeline from topicwizard:
from topicwizard.pipeline import make_topic_pipeline
topic_pipeline = make_topic_pipeline(vectorizer, model, norm_rows=False)
Let's load a corpus that we would like to analyze, in this example I will use 20newsgroups from sklearn.
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset="all")
corpus = newsgroups.data
# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = [newsgroups.target_names[label] for label in newsgroups.target]
Then let's fit our pipeline to this data:
topic_pipeline.fit(corpus)
The advantages of using a TopicPipeline over a regular pipeline are numerous:
- Output dimensions (topics) are named
- You can set the output to be a pandas dataframe (
topic_pipeline.set_output(transform="pandas")
) with topics as columns. - You can treat topic importances as pseudoprobability-distributions (
topic_pipeline.norm_row = True
) - You can freeze components so that the pipeline will stay frozen when fitting downstream components (
topic_pipeline.freeze = True
)
Here's an example of how you can easily display a heatmap over topics in a document using TopicPipelines.
import plotly.express as px
pipeline = make_topic_pipeline(vectorizer, model).set_output(transform="pandas")
texts = [
"Coronavirus killed 50000 people today.",
"Donald Trump's presidential campaing is going very well",
"Protests against police brutality have been going on all around the US.",
]
topic_df = pipeline.transform(texts)
topic_df.index = texts
px.imshow(topic_df).show()
You didn't even have to use topicwizards own visualizations for this!!
You can also use TopicPipelines for downstream tasks, such as unsupervised text labeling with the help of human-learn.
pip install human-learn
from hulearn.classification import FunctionClassifier
from sklearn.pipeline import make_pipeline
topic_pipeline = make_topic_pipeline(vectorizer, model).fit(texts)
# Investigate topics
topicwizard.visualize(topic_pipeline)
# Creating rule for classifying something as a corona document
def corona_rule(df, threshold=0.5):
is_about_corona = df["11_vaccine_pandemic_virus_coronavirus"] > threshold
return is_about_corona.astype(int)
# Freezing topic pipeline
topic_pipeline.freeze = True
classifier = FunctionClassifier(corona_rule)
cls_pipeline = make_pipeline(topic_pipeline, classifier)
Web Application
You can launch the topic wizard web application for interactively investigating your topic models. The app is also quite easy to deploy in case you want to create a client-facing interface.
import topicwizard
topicwizard.visualize(corpus, pipeline=topic_pipeline)
From version 0.3.0 you can also disable pages you do not wish to display thereby sparing a lot of time for yourself:
# A large corpus takes a looong time to compute 2D projections for so
# so you can speed up preprocessing by disabling it alltogether.
topicwizard.visualize(corpus, pipeline=topic_pipeline, exclude_pages=["documents"])
Topics | Words | Documents | Groups |
---|---|---|---|
Figures
If you want customizable, faster, html-saveable interactive plots, you can use the figures API. Here are a couple of examples:
from topicwizard.figures import word_map, document_topic_timeline, topic_wordclouds, word_association_barchart
Word Map | Timeline of Topics in a Document |
---|---|
word_map(corpus, pipeline=topic_pipeline) |
document_topic_timeline( "Joe Biden takes over presidential office from Donald Trump.", pipeline=topic_pipeline) |
Wordclouds of Topics | Topic for Word Importance |
---|---|
topic_wordclouds(corpus, pipeline=topic_pipeline) |
word_association_barchart(["supreme", "court"], corpus=corpus, pipeline=topic_pipeline) |
For more information consult our Documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file topic_wizard-0.5.0.tar.gz
.
File metadata
- Download URL: topic_wizard-0.5.0.tar.gz
- Upload date:
- Size: 84.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.0 CPython/3.11.5 Linux/5.15.0-88-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 994c855241cb4f2642f41ec0e1f5253fb9c74a54576a13a9c99638e660b54393 |
|
MD5 | 5affe79f992c0b66dacff57026e750b3 |
|
BLAKE2b-256 | 975340f11b50922714315063e5a49950c6a5935ae75bee07d4f444e9ef9bc5f7 |
File details
Details for the file topic_wizard-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: topic_wizard-0.5.0-py3-none-any.whl
- Upload date:
- Size: 108.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.0 CPython/3.11.5 Linux/5.15.0-88-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ede64ed8a9d4766a04d2c4c07f643ce328d92292b44049a381c4bc40aa3d157 |
|
MD5 | ae123315be0b70fed10d413949c4a79a |
|
BLAKE2b-256 | 0c53ed1062e9e51aab0632cf0626046cd07671483082d6bcc6566ed0c4f11bbe |