Skip to main content

Pretty and opinionated topic model visualization in Python.

Project description

topicwizard


Pretty and opinionated topic model visualization in Python.

Open in Colab PyPI version pip downloads python version Code style: black

https://user-images.githubusercontent.com/13087737/234209888-0d20ede9-2ea1-4d6e-b69b-71b863287cc9.mp4

New in version 0.4.0 🌟 🌟

  • Introduced topic pipelines that make it easier and safer to use topic models in downstream tasks and interpretation.

New in version 0.3.1 🌟 🌟

  • You can now investigate relations of pre-existing labels to your topics and words :mag:

New in version 0.3.0 🌟

  • Exclude pages, that are not needed :bird:
  • Self-contained interactive figures :gift:
  • Topic name inference is now default behavior and is done implicitly.

Features

  • Investigate complex relations between topics, words, documents and groups/genres/labels
  • Sklearn, Gensim and BERTopic compatible :nut_and_bolt:
  • Highly interactive web app
  • Interactive and composable Plotly figures
  • Automatically infer topic names, oooor...
  • Name topics manually
  • Easy deployment :earth_africa:

Installation

Install from PyPI:

pip install topic-wizard

Usage (documentation)

Step 0:

Have a corpus ready for analysis, in this example I am going to use 20 newgroups from scikit-learn.

from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset="all")
corpus = newsgroups.data

# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = [newsgroups.target_names[label] for label in newsgroups.target]

Step 1:

Train a scikit-learn compatible topic model. (If you want to use non-scikit-learn topic models, check compatibility)

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline

# Create topic pipeline
pipeline = make_pipeline(
    CountVectorizer(stop_words="english", min_df=10),
    NMF(n_components=30),
)

# Then fit it on the given texts
pipeline.fit(corpus)

From version 0.4.0 you can also use TopicPipelines, which are almost functionally identical but come with a set of built-in conveniences and safeties.

from topicwizard.pipeline import make_topic_pipeline

pipeline = make_topic_pipeline(
    CountVectorizer(stop_words="english", min_df=10),
    NMF(n_components=30),
)

Step 2a:

Visualize with the topicwizard webapp :bulb:

import topicwizard

topicwizard.visualize(corpus, pipeline=pipeline)

From version 0.3.0 you can also disable pages you do not wish to display thereby sparing a lot of time for yourself:

# A large corpus takes a looong time to compute 2D projections for so
# so you can speed up preprocessing by disabling it alltogether.
topicwizard.visualize(corpus, pipeline=pipeline, exclude_pages=["documents"])

topics screenshot words screenshot words screenshot documents screenshot

From version 0.3.1 you can investigate groups/labels by passing them along to the webapp.

topicwizard.visualize(corpus, pipeline=pipeline, group_labels=group_labels)

groups screenshot

Ooooor...

Step 2b:

Produce high quality self-contained HTML plots and create your own dashboards/reports :strawberry:

Map of words

from topicwizard.figures import word_map

word_map(corpus, pipeline=pipeline)

word map screenshot

Timelines of topic distributions

from topicwizard.figures import document_topic_timeline

document_topic_timeline(
    "Joe Biden takes over presidential office from Donald Trump.",
    pipeline=pipeline,
)

document timeline

Wordclouds of your topics :cloud:

from topicwizard.figures import topic_wordclouds

topic_wordclouds(corpus, pipeline=pipeline)

wordclouds

And much more... (documentation)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topic_wizard-0.4.0.tar.gz (76.9 kB view hashes)

Uploaded Source

Built Distribution

topic_wizard-0.4.0-py3-none-any.whl (97.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page