Skip to main content

A Python package for Fuzzy Topic Models

Project description

Fuzzy Topic Modeling - methods derived from Fuzzy Latent Semantic Analysis

This is the Python code to train Fuzzy Latent Semantic Analysis-based topic models. The details of the original FLSA model can be found here. With my group, we have formulated two alternative topic modeling algorithms 'FLSA-W' and 'FLSA-V' , which are derived from FLSA. Once the paper is published (it has been accepted), we will place a link here too.

Table of contents

  1. Introduction to Topic Modeling
  2. Explanation algorithms
  3. Getting started
  • FLSA & FLSA-W
  • FLSA-W
    • Instructions to get map_file from Vosviewer
  1. Class methods
  2. Dependencies

Introduction to Topic Modeling

Topic modeling is a popular task within the domain of Natural Language Processing (NLP). Topic modeling is a type of statistical modeling for discovering the latent 'topics' occuring in a collection of documents. While humans typically describe the topic of something by a single word, topic modeling algorithms describe topics as a probability distribution over words.

Various topic modeling algorithms exist, and one thing they have in common is that they all output two matrices:

  1. Probability of a word given a topic. This is a M x C (vocabulary size x number of topics) matrix.
  2. Probability of a topic given a document. This is a C x N (number of topics x number of documents) matrix.

From the first matrix, the top n words per topic are taken to represent that topic.

On top of finding the latent topics in a text, topic models can also be used for more expainable text classification. In that case, documents can be represented as a 'topic embedding'; a c-length vector in which each cell represents a topic and contains a number that indicates the extend of which a topic is represented in the document. These topic embeddings can then be fed to machine learning classification models. Some machine learning classification models can show the weights they assigned to the input variables, based on which they made their decisions. The idea is that if the topics are interpretable, then the weights assigned to the topics reveal why a model made its decisions.

Explanation algorithms

The general approach to the algorithm(s) can be explained as follows:

  1. Create a local term matrix. This is a N x M (number of documents x vocabulary size) matrix that gives the count of each word i in document j.
  2. Create a global term matrix in which the words from different documents are also related to each other (the four options for weighting in the class are: 'normal', 'entropy','idf','probidf').
  3. Project the data in a lower dimensional space (we use singular value decomposition).
  4. Use fuzzy clustering to get the partition matrix.
  5. Use Bayes' Theorem and matrix multiplication to get the needed matrices.

FLSA

The original FLSA approach aims to find clusters in the projected space of documents.

FLSA-W

Documents might contain multiple topics, making them difficult to cluster. Therefore, it might makes more sense to cluster on words instead of documents. That is what what we do with FLSA-W(ords).

FLSA-E

Trains a Word2Vec word embedding from the corpus. Then clusters in this embedding space to find topics.

FLSA-V

FLSA-W clusters on a projected space of words and implicitly assumes that the projections ensure that related words are located nearby each other. However, there is no optimization algorithm that ensures this is the case. With FLSA-V(os), we use the output from Vosviewer as input to our model. Vosviewer is an open-source software tool used for bibliographic mapping that optimizes its projections such that related words are located nearby each other. Using Vosviewer's output, FLSA-V's calculations start with step 4 (yet, step 1 is used for calculating some probabilities).

Getting started

Many parameters have default settings, so that the algorithms can be called only setting the following two variables:

  • input_file, The data on which you want to train the topic model.

    • Format: list of lists of tokens.
    • Example: [['this','is','the','first','document'],['why','am','i','stuck','in','the','middle'],['save','the','best','for','last']].
  • num_topics, The number of topics you want the topic model to find.

    • Format: int (greater than zero).
    • Example: 15.

Suppose, your data (list of lists of strings) is called data and you want to run a topic model with 10 topics. Run the following code to get the two matrices:

flsa_model = FLSA(input_file = data, num_topics = 10)
prob_word_given_topic, prob_topic_given_document = flsa_model.get_matrices()

To see the words and probabilities corresponding to each topic, run:

flsa_model.show_topics()

Below is a description of the other parameters per algorithm.

FLSA & FLSA-W

  • num_words, The number of words (top-n) per topic used to represent that topic.

    • Format: int (greater than zero).
    • Default value: 20
  • word_weighting, The method used for global term weighting (as describes in step 2 of the algorithm)

    • Format: str (choose between: 'entropy', 'idf', 'normal', 'probidf').
    • Default value: 'normal'
  • cluster_method, The (fuzzy) cluster method to be used.

    • Format: str (choose between: 'fcm', 'gk', 'fst-pso').
    • Default value: 'fcm'
  • svd_factors, The number of dimensions to project the data into.

    • Format: int (greater than zero).
    • Default value: 2.

FLSA-V

  • map_file, The output file from Vosviewer.
    • Format: pd.DataFrame (The Dataframe needs to contain the following columns: 'id','x','y')
    • Example:
id x y
word_one -0.4626 0.8213
word_two 0.6318 -0.2331
... ... ...
word_M 0.9826 0.184
  • num_words, The number of words (top-n) per topic used to represent that topic.

    • Format: int (greater than zero).
    • Default value: 20
  • cluster_method, The (fuzzy) cluster method to be used.

    • Format: str (choose between: 'fcm', 'gk', 'fst-pso').
    • Default value: 'fcm'

Instructions to get map_file from Vosviewer

  1. Create a tab-separated file from your dataset in which you show for each word how often it appears with each other word.
    Format: Word_1 <TAB> Word_2 <TAB> Frequency.
    (Since this quickly leads to an unproccesable number of combinations, we recommend using only the words that appear in at least x documents; we used 100).
  2. Download Vosviewer.
  3. Vosviewer > Create > Create a map based on text data > Read data from VOSviewer files
    Under 'VOSviewer corpus file (required)' submit your .txt file from step 1 and click 'finish'.
  4. The exported file is a tab-separated file, and can be loaded into Python as follows:
    Suppose the file is called map_file.txt:
    map_file = pd.read_csv('<DIRECTORY>/map_file.txt', delimiter = "\t")
  5. Please check the Vosviewer manual for more information.

Class Methods

Dependencies

numpy == 1.19.2
pandas == 1.3.3
scipy == 1.5.2
pyfume == 0.2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

FuzzyTM-2.0.9.tar.gz (25.3 kB view details)

Uploaded Source

Built Distribution

FuzzyTM-2.0.9-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file FuzzyTM-2.0.9.tar.gz.

File metadata

  • Download URL: FuzzyTM-2.0.9.tar.gz
  • Upload date:
  • Size: 25.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.13

File hashes

Hashes for FuzzyTM-2.0.9.tar.gz
Algorithm Hash digest
SHA256 cf411262d07b06ab2cc481e5addd05fbfa9aa4e3359eb0e2dd98a1e52be00c66
MD5 675c378ab0a9fb6499f3360c35450498
BLAKE2b-256 6077551ee48692ccac82326180c8a83c7d53ad19c31d6896a77a945b1cafe50a

See more details on using hashes here.

File details

Details for the file FuzzyTM-2.0.9-py3-none-any.whl.

File metadata

  • Download URL: FuzzyTM-2.0.9-py3-none-any.whl
  • Upload date:
  • Size: 31.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.13

File hashes

Hashes for FuzzyTM-2.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 e5a6beb619f9090badebcdc161a9b9b2b7ee26ca44940894cabac1d8297c9551
MD5 b925b28651a2b8f4ffa8d4b31ca9c6a6
BLAKE2b-256 2d30074bac7a25866a2807c1005c7852c0139ac22ba837871fc01f16df29b9dc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page