Skip to main content

An NLP Package for generating Topic Models

Project description

SimpleTopicModel

Easily identifying themes in text

image

What is this?

This is a package that wraps up common theme identification (Topic Modeling) techniques in Python. SimpleTopicModel is currently under development, and subject to change.

How do I get it?

Currently, you can git clone this repo and import it locally. Be sure to run pip install -r requirements.txt in the repo folder, to ensure you've got the relevant requirements.

I'm working on setting up a pypi release, slated for the near future.

How do I use it?

image Use couldn't be easier. Most topic modeling techniques follow the same paradigm:

  1. Convert your text to numbers (embeddings): The excellent Sentence-Transformers package does this for us, using Microsoft's Mini-LM model.
  2. Reduce Dimensionality: This package uses UMAP, but you could substitute TSNE or PCA if you wanted to.
  3. Cluster: We're using HDBSCAN to build hierarchial clusters (which we'd like to traverse in a later release), but you could also use a KNN, GMM, etc.
  4. Visualize (Optional): This displays the reduced dimension embeddings in 3d (or 2d) space, so you can get a feel for how "tight" the clusters are.

What's next?

  • Clean up/professionalize this repo & releases
  • Add automated cluster naming techniques (cTF-IDF, LLM-assisted naming, etc)
  • Make a sweet logo & eyecatching graphics
  • Fix the docs page

Acknowledgements:

This builds on previous work including Gensim (LDA), BERTopic, Top2Vec, and pyLDAvis. They're all excellent, more mature alternatives to SimpleTopicModel, and I'd encourage you to go check them out!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SimpleTopicModel-0.0.8.tar.gz (17.0 kB view details)

Uploaded Source

File details

Details for the file SimpleTopicModel-0.0.8.tar.gz.

File metadata

  • Download URL: SimpleTopicModel-0.0.8.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for SimpleTopicModel-0.0.8.tar.gz
Algorithm Hash digest
SHA256 5ce0858691453eb1fd753f547a9c51734a1c90cde7be60c6feae0454d150a586
MD5 6cf0cbb353bd6e183c2df5c69fb6ca52
BLAKE2b-256 e837cffcc304168b6b5d237caa6686ce170b06aad6769e02501994387e4994fe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page