Probabilistic Latent Semantic Analysis
python implementation of Probabilistic Latent Semantic Analysis
What PLSA can do for you
Broadly speaking, PLSA is a tool of Natural Language Processing (NLP). It analyses a collection of text documents (a corpus) under the assumption that there are (by far) fewer topics to write about than there are documents in the corpus. It then tries to identify these topics (in terms of words and their relative importance to each topic) and to give you the relative importance of a pre-specified number of topics in each document.
In doing so, it does not actually try to "make sense" of each document (or "understand" it) by contextually analysing it. Rather, it simply counts how often which word occurs in each document, regardless of the context in which they occur. As such, it belongs to the family of so-called bag-of-words models.
To give an example, a bunch of documents might frequently contain words like "eating", "nutrition", "health", etc. Others might contain words like "state", "party", "ministry", etc. Yet others might contain words like "tournament", "ranking", "win", etc. It is easy to imagine there being documents that contain a mixture of these words. Not knowing in advance how many topics there are, one would have to run PLSA with several different numbers of topics and see the results to judge how many is a good choice. Picking three in our example would yield topics that could be described as "food", "politics", and "sports" and, while a number of documents will emerge as being purely about one of these topics, it is easy to imagine that there are others that have contributions from more than one topic (e.g., about a new initiative from the ministry of health, combining "food" and "politics"). PLSA will give you that mixture.
This code is available on the python package index PyPi. To install, I strongly recommend setting up a new virtual python environment, and then type
pip install plsa
on the console.
WARNING: On first use, some components of
nltk that don't come with it out-of-the-box
wil be downloaded. Should you install (against my express recommendation) install
plsa package system-wide (with
sudo), then you lack the access rights
to write the required
nltk data to where it is supposed to go (into a subfolder of
plsa package directory).
This package depends on the following python packages:
If you want to run the
you will also need to install the
The matrices to store and manipulate data can easily get quite large. That
means you will soon run out of memory when toying with a large corpus. This
could be mitigated to some extent by using sparse matrices. But since there
is no built-in support for sparse matrices of more than 2 dimensions
(we need 3) in
scipy, this is not implemented.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size plsa-0.6.0-py3-none-any.whl (33.1 kB)||File type Wheel||Python version py3||Upload date||Hashes View hashes|