Pretty and opinionated topic model visualization in Python.
Project description
topicwizard
Pretty and opinionated topic model visualization in Python.
New in version 0.4.0 🌟 🌟
- Introduced topic pipelines that make it easier and safer to use topic models in downstream tasks and interpretation.
New in version 0.3.1 🌟 🌟
- You can now investigate relations of pre-existing labels to your topics and words :mag:
New in version 0.3.0 🌟
- Exclude pages, that are not needed :bird:
- Self-contained interactive figures :gift:
- Topic name inference is now default behavior and is done implicitly.
Features
- Investigate complex relations between topics, words, documents and groups/genres/labels
- Sklearn, Gensim and BERTopic compatible :nut_and_bolt:
- Highly interactive web app
- Interactive and composable Plotly figures
- Automatically infer topic names, oooor...
- Name topics manually
- Easy deployment :earth_africa:
Installation
Install from PyPI:
pip install topic-wizard
Usage (documentation)
Step 0:
Have a corpus ready for analysis, in this example I am going to use 20 newgroups from scikit-learn.
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset="all")
corpus = newsgroups.data
# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = [newsgroups.target_names[label] for label in newsgroups.target]
Step 1:
Train a scikit-learn compatible topic model. (If you want to use non-scikit-learn topic models, check compatibility)
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
# Create topic pipeline
pipeline = make_pipeline(
CountVectorizer(stop_words="english", min_df=10),
NMF(n_components=30),
)
# Then fit it on the given texts
pipeline.fit(corpus)
From version 0.4.0 you can also use TopicPipelines, which are almost functionally identical but come with a set of built-in conveniences and safeties.
from topicwizard.pipeline import make_topic_pipeline
pipeline = make_topic_pipeline(
CountVectorizer(stop_words="english", min_df=10),
NMF(n_components=30),
)
Step 2a:
Visualize with the topicwizard webapp :bulb:
import topicwizard
topicwizard.visualize(corpus, pipeline=pipeline)
From version 0.3.0 you can also disable pages you do not wish to display thereby sparing a lot of time for yourself:
# A large corpus takes a looong time to compute 2D projections for so
# so you can speed up preprocessing by disabling it alltogether.
topicwizard.visualize(corpus, pipeline=pipeline, exclude_pages=["documents"])
From version 0.3.1 you can investigate groups/labels by passing them along to the webapp.
topicwizard.visualize(corpus, pipeline=pipeline, group_labels=group_labels)
Ooooor...
Step 2b:
Produce high quality self-contained HTML plots and create your own dashboards/reports :strawberry:
Map of words
from topicwizard.figures import word_map
word_map(corpus, pipeline=pipeline)
Timelines of topic distributions
from topicwizard.figures import document_topic_timeline
document_topic_timeline(
"Joe Biden takes over presidential office from Donald Trump.",
pipeline=pipeline,
)
Wordclouds of your topics :cloud:
from topicwizard.figures import topic_wordclouds
topic_wordclouds(corpus, pipeline=pipeline)
And much more... (documentation)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for topic_wizard-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e834c3ec6f3a10f70574f4efde2f1cdcea57dd39af9a1eecb415002f5975986c |
|
MD5 | 16a5aec2d4e5b782ea417a2234a10e9f |
|
BLAKE2b-256 | 89bea77e628c25702aa9b4d8a74183455c087048416d57e047dff7fb8e14815c |