Skip to main content

The following package enables users to perform text modelling

Project description

GPyM_TM

GPyM_TM is a Python package to perform topic modelling, either through the use of a Dirichlet multinomial mixture model, or a Poisson model. Each of the above models is available within the package in a separate class, namely GSDMM utilizes the Dirichlet multinomial mixture model, while GPM makes use of the Poisson model to perform the text clustering respectively. The package is also available on Pypi.

Preamble

The aim of topic modelling is to extract latent topics from large corpora. GSDMM [1] assumes each document belongs to a single topic, which is a suitable assumption for some short texts. Given an initial number of topics, K, this algorithm clusters documents and extracts the topical structures present within the corpus. If K is set to a high value, then the model will also automatically learn the number of clusters.

[1] Yin, J. and Wang, J., 2014, August. A Dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 233-242).

Getting Started:

The package is available online for use within Python 3 enviroments.

The installation can be performed through the use of a standard 'pip' install command, as provided below:

pip install GPyM-TM

Prerequisites:

The package has several dependencies, namely:

  • numpy
  • random
  • math
  • pandas
  • re
  • nltk
  • gensim
  • scipy

GSDMM

Function and class description:

The class is named GSDMM, while the function itself is named DMM.

The function can take 6 possible arguments, two of which are required, and the remaining 4 being optional.

The required arguments are:

  • corpus - text file, which has been cleaned and loaded into Python. That is, the text should all be lowercase, all punctuation and numbers should have also been removed.
  • nTopics - the number of topics.

The optional requirements are:

  • alpha, beta - these are the distribution specific parameters.(The defaults for both of these parameters are 0.1.)
  • nTopWords - number of top words per a topic.(The default is 10.)
  • iters - number of Gibbs sampler iterations.(The default is 15.)

Output:

The function provides several components of output, namely:

  • psi - topic x word matrix.
  • theta - document x topic matrix.
  • topics - the top words per topic.
  • assignments - the topic numbers of selected topics only, as well as the final topic assignments.
  • Final k - the final number of selected topics.
  • coherence - the coherence score, which is a performance measure.
  • selected_theta
  • selected_psi

GPM

Function and class description:

The class is named GPM, while the function itself is named GPM.

The function can take 8 possible arguments, two of which are required, and the remaining 6 being optional.

The required arguments are:

  • corpus - text file, which has been cleaned and loaded into Python. That is, the text should all be lowercase, all punctuation and numbers should have also been removed.
  • nTopics - the number of topics.

The optional requirements are:

  • alpha, beta and gam - these are the distribution specific parameters.(The defaults for these parameters are alpha = 0.001, beta = 0.001 and gam = 0.1 respectively.)
  • nTopWords - number of top words per a topic.(The default is 10.)
  • iters - number of Gibbs sampler iterations.(The default is 15.)
  • N - this is a parameter used to normalize the document lengths, which is required for the Poisson model.

Output:

The function provides several components of output, namely:

  • psi - topic x word matrix.
  • theta - document x topic matrix.
  • topics - the top words per topic.
  • assignments - the topic numbers of selected topics only, as well as the final topic assignments.
  • Final k - the final number of selected topics.
  • coherence - the coherence score, which is a performance measure.
  • selected_theta
  • selected_psi

Example Usage:

A more comprehensive tutorial is also available.

Installation;

Run the following command within a Python command window:

pip install GPym_TM

Implementation;

Import the package into the relevant python script, with the following:

from GSDMM import DMM from GPM import GPM

Call the class:

Possible examples of calling the GSDMM function are as follows:

data_DMM = GSDMM.DMM(corpus, nTopics)

data_DMM = GSDMM.DMM(corpus, nTopics, alpha = 0.25, beta = 0.15, nTopWords = 12, iters =5)

Possible examples of calling the GPM function are as follows:

data_GPM = GPM.GPM(corpus, nTopics)

data_GPM = GPM.GPM(corpus, nTopics, alpha = 0.002, beta = 0.03, gam = 0.06, nTopWords = 12, iters = 7, N = 8)

Results;

The output obtained for the Dirichlet multinomial mixture model appears as follows:

Post

While, the output obtained for the Poisson model appears as follows:

poisson

Built With:

Google Collab - Web framework

Python - Programming language of choice

Pypi - Distribution

Authors:

Jocelyn Mazarura

Co-Authors:

Alta de Waal

Ricardo Marques

License:

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments:

University of Pretoria Tuks Logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GPyM_TM-3.0.1.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

GPyM_TM-3.0.1-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file GPyM_TM-3.0.1.tar.gz.

File metadata

  • Download URL: GPyM_TM-3.0.1.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for GPyM_TM-3.0.1.tar.gz
Algorithm Hash digest
SHA256 fcdf2d0ef80c603b1fbdcc4651a39ff4c12851fd18f9610fa8feff3684f31214
MD5 19b2a9f760eb06b0ded4d10bf5466f38
BLAKE2b-256 4f631468a4e7e5e6890ddc3cf3879d2edbad7b35b7dc563c0df3280afb406644

See more details on using hashes here.

File details

Details for the file GPyM_TM-3.0.1-py3-none-any.whl.

File metadata

  • Download URL: GPyM_TM-3.0.1-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for GPyM_TM-3.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bdde8695ca417d0891913f1f5c30174f9fa92267387b54df747660b541d536c4
MD5 46c7739b55327eff92cc707ec1af33d6
BLAKE2b-256 6bce1c1b25ac63ef7642d5b87f5566a15b361683fbe4b25cc81c980ae00ca6f2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page