Skip to main content

Topic modelling over short texts

Project description

tweetopic

:zap: Blazing Fast topic modelling over short texts in Python

PyPI version pip downloads python version Code style: black

Features

  • Fast :zap:
  • Scalable :collision:
  • High consistency and coherence :dart:
  • High quality topics :fire:
  • Easy visualization and inspection :eyes:
  • Full scikit-learn compatibility :nut_and_bolt:

🛠 Installation

Install from PyPI:

pip install tweetopic

👩‍💻 Usage (documentation)

Train your a topic model on a corpus of short texts:

from tweetopic import DMM
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# Creating a vectorizer for extracting document-term matrix from the
# text corpus.
vectorizer = CountVectorizer(min_df=15, max_df=0.1)

# Creating a Dirichlet Multinomial Mixture Model with 30 components
dmm = DMM(n_components=30, n_iterations=100, alpha=0.1, beta=0.1)

# Creating topic pipeline
pipeline = Pipeline([
    ("vectorizer", vectorizer),
    ("dmm", dmm),
])

You may fit the model with a stream of short texts:

pipeline.fit(texts)

To investigate internal structure of topics and their relations to words and indicidual documents we recommend using topicwizard.

Install it from PyPI:

pip install topic-wizard

Then visualize your topic model:

import topicwizard

topicwizard.visualize(pipeline=pipeline, corpus=texts)

topicwizard visualization

🎓 References

  • Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 233–242). Association for Computing Machinery.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tweetopic-0.4.0.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

tweetopic-0.4.0-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file tweetopic-0.4.0.tar.gz.

File metadata

  • Download URL: tweetopic-0.4.0.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Linux/5.15.0-107-generic

File hashes

Hashes for tweetopic-0.4.0.tar.gz
Algorithm Hash digest
SHA256 6dce0275cb2963665bc612895e17c3f6b7c6a40abe85a0a21acb2513eed06994
MD5 ea37c8abbdf3b869ed8265df88e26389
BLAKE2b-256 8e9f1f413152c3550aa099836a932260556bb2aa934ab74db5ba1b020582e922

See more details on using hashes here.

File details

Details for the file tweetopic-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: tweetopic-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Linux/5.15.0-107-generic

File hashes

Hashes for tweetopic-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 214c80452c95c16c7718bc2031517ec79210ae200c96d186a8dd3fd47f1ae93e
MD5 1a525118388b1dcd72118966c4fa73fd
BLAKE2b-256 4c6541ee278b600aa9a10ec7b47e3119f000b113fc8bfb0e7090b92ee5b24968

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page