Probabilistic Latent Semantic Analysis

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3.6
- Python :: 3.7
Topic
- Scientific/Engineering :: Information Analysis
- Text Processing

Project description

PLSA

A python implementation of Probabilistic Latent Semantic Analysis

What PLSA can do for you

Broadly speaking, PLSA is a tool of Natural Language Processing (NLP). It analyses a collection of text documents (a corpus) under the assumption that there are (by far) fewer topics to write about than there are documents in the corpus. It then tries to identify these topics (in terms of words and their relative importance to each topic) and to give you the relative importance of a pre-specified number of topics in each document.

In doing so, it does not actually try to "make sense" of each document (or "understand" it) by contextually analysing it. Rather, it simply counts how often which word occurs in each document, regardless of the context in which they occur. As such, it belongs to the family of so-called bag-of-words models.

In reducing a large number of documents to a much smaller number of topics, PSLA can be seen as an example of unsupervised dimensionality reduction, most related to non-negative matrix factorization.

To give an example, a bunch of documents might frequently contain words like "eating", "nutrition", "health", etc. Others might contain words like "state", "party", "ministry", etc. Yet others might contain words like "tournament", "ranking", "win", etc. It is easy to imagine there being documents that contain a mixture of these words. Not knowing in advance how many topics there are, one would have to run PLSA with several different numbers of topics and see the results to judge how many is a good choice. Picking three in our example would yield topics that could be described as "food", "politics", and "sports" and, while a number of documents will emerge as being purely about one of these topics, it is easy to imagine that there are others that have contributions from more than one topic (e.g., about a new initiative from the ministry of health, combining "food" and "politics"). PLSA will give you that mixture.

Installation

This code is available on the python package index PyPi. To install, I strongly recommend setting up a new virtual python environment, and then type

pip install plsa

on the console.

WARNING: On first use, some components of nltk that don't come with it out-of-the-box wil be downloaded. Should you install (against my express recommendation) install the plsa package system-wide (with sudo), then you lack the access rights to write the required nltk data to where it is supposed to go (into a subfolder of the plsa package directory).

Dependencies

This package depends on the following python packages:

numpy
matplotlib
wordcould
nltk

If you want to run the example notebook, you will also need to install the jupyter package.

Getting Started

Clone the GitHub repository and run the jupyter notebook Examples.ipynb in the notebooks folder.

Documentation

Read the API documentation on Read the Docs

Technical considerations

The matrices to store and manipulate data can easily get quite large. That means you will soon run out of memory when toying with a large corpus. This could be mitigated to some extent by using sparse matrices. But since there is no built-in support for sparse matrices of more than 2 dimensions (we need 3) in scipy, this is not implemented.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3.6
- Python :: 3.7
Topic
- Scientific/Engineering :: Information Analysis
- Text Processing

Release history Release notifications | RSS feed

This version

0.6.0

Sep 18, 2019

0.5.0

Sep 17, 2019

0.4.0

Sep 11, 2019

0.3.0

Sep 10, 2019

0.2.0

Sep 6, 2019

0.1.1

Sep 5, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

plsa-0.6.0-py3-none-any.whl (33.1 kB view details)

Uploaded Sep 18, 2019 Python 3

File details

Details for the file plsa-0.6.0-py3-none-any.whl.

File metadata

Download URL: plsa-0.6.0-py3-none-any.whl
Upload date: Sep 18, 2019
Size: 33.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.4

File hashes

Hashes for plsa-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d056fee79bc86f4c07e57877dff0506160ad2f9abb5f11cefcd247332a95af11`
MD5	`15a88eb9bb8ff0e90ae0dfa5354347b6`
BLAKE2b-256	`de459e87603bab31ca53fc7f95d6821f5a195f628225c6eb39972352ab65fb6a`

See more details on using hashes here.

plsa 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PLSA

What PLSA can do for you

Installation

Dependencies

Getting Started

Documentation

Technical considerations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes