Skip to main content

Pretty and opinionated topic model visualization in Python.

Project description

topicwizard


Pretty and opinionated topic model visualization in Python.

Open in Colab PyPI version pip downloads python version Code style: black

https://user-images.githubusercontent.com/13087737/234209888-0d20ede9-2ea1-4d6e-b69b-71b863287cc9.mp4

New in version 0.4.0 🌟 🌟

  • Introduced topic pipelines that make it easier and safer to use topic models in downstream tasks and interpretation.

New in version 0.3.1 🌟 🌟

  • You can now investigate relations of pre-existing labels to your topics and words :mag:

New in version 0.3.0 🌟

  • Exclude pages, that are not needed :bird:
  • Self-contained interactive figures :gift:
  • Topic name inference is now default behavior and is done implicitly.

Features

  • Investigate complex relations between topics, words, documents and groups/genres/labels
  • Sklearn, Gensim and BERTopic compatible :nut_and_bolt:
  • Highly interactive web app
  • Interactive and composable Plotly figures
  • Automatically infer topic names, oooor...
  • Name topics manually
  • Easy deployment :earth_africa:

Installation

Install from PyPI:

pip install topic-wizard

Usage (documentation)

Step 0:

Have a corpus ready for analysis, in this example I am going to use 20 newgroups from scikit-learn.

from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset="all")
corpus = newsgroups.data

# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = [newsgroups.target_names[label] for label in newsgroups.target]

Step 1:

Train a scikit-learn compatible topic model. (If you want to use non-scikit-learn topic models, check compatibility)

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline

# Create topic pipeline
pipeline = make_pipeline(
    CountVectorizer(stop_words="english", min_df=10),
    NMF(n_components=30),
)

# Then fit it on the given texts
pipeline.fit(corpus)

From version 0.4.0 you can also use TopicPipelines, which are almost functionally identical but come with a set of built-in conveniences and safeties.

from topicwizard.pipeline import make_topic_pipeline

pipeline = make_topic_pipeline(
    CountVectorizer(stop_words="english", min_df=10),
    NMF(n_components=30),
)

Step 2a:

Visualize with the topicwizard webapp :bulb:

import topicwizard

topicwizard.visualize(corpus, pipeline=pipeline)

From version 0.3.0 you can also disable pages you do not wish to display thereby sparing a lot of time for yourself:

# A large corpus takes a looong time to compute 2D projections for so
# so you can speed up preprocessing by disabling it alltogether.
topicwizard.visualize(corpus, pipeline=pipeline, exclude_pages=["documents"])

topics screenshot words screenshot words screenshot documents screenshot

From version 0.3.1 you can investigate groups/labels by passing them along to the webapp.

topicwizard.visualize(corpus, pipeline=pipeline, group_labels=group_labels)

groups screenshot

Ooooor...

Step 2b:

Produce high quality self-contained HTML plots and create your own dashboards/reports :strawberry:

Map of words

from topicwizard.figures import word_map

word_map(corpus, pipeline=pipeline)

word map screenshot

Timelines of topic distributions

from topicwizard.figures import document_topic_timeline

document_topic_timeline(
    "Joe Biden takes over presidential office from Donald Trump.",
    pipeline=pipeline,
)

document timeline

Wordclouds of your topics :cloud:

from topicwizard.figures import topic_wordclouds

topic_wordclouds(corpus, pipeline=pipeline)

wordclouds

And much more... (documentation)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topic_wizard-0.4.0.tar.gz (76.9 kB view details)

Uploaded Source

Built Distribution

topic_wizard-0.4.0-py3-none-any.whl (97.5 kB view details)

Uploaded Python 3

File details

Details for the file topic_wizard-0.4.0.tar.gz.

File metadata

  • Download URL: topic_wizard-0.4.0.tar.gz
  • Upload date:
  • Size: 76.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.10.8 Linux/5.15.0-78-generic

File hashes

Hashes for topic_wizard-0.4.0.tar.gz
Algorithm Hash digest
SHA256 b2f85f9d81b8fc900647e7c6d4c502320659862186887410cd8c8a23b8a31102
MD5 0062b99b8ed4b2103a269e91c80e409a
BLAKE2b-256 d4e019d148f58617bcd4f9868536fd405f8f5e25d00e87e1b1da08ade455ba90

See more details on using hashes here.

File details

Details for the file topic_wizard-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: topic_wizard-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 97.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.10.8 Linux/5.15.0-78-generic

File hashes

Hashes for topic_wizard-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e834c3ec6f3a10f70574f4efde2f1cdcea57dd39af9a1eecb415002f5975986c
MD5 16a5aec2d4e5b782ea417a2234a10e9f
BLAKE2b-256 89bea77e628c25702aa9b4d8a74183455c087048416d57e047dff7fb8e14815c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page