Skip to main content

Numba implementation of PLSA

Project description

Numba PLSA

PLSA for sparse matrices implemented with Numba. Wicked fast.

Installation

  1. Clone the repo: git clone https://github.com/TnTo/numba-plsa.git
  2. Install: pip install .
  3. Run the example: python example.py

Usage

The numba-plsa package provides two implementations: a basic NumPy method and a numba method. The plsa method wraps the basic algorithmic functionality, and the algorithm is chosen by using the method argument (method='numba' or method='basic', the default). The basic method works for any NumPy document-term matrix, whereas the numba method is optimized for sparse matrices. The plsa method automatically converts the input document-term matrix to a COO sparse matrix. The plsa_direct method is also available, which assumes the input is already in COO form and skips some precomputation for faster performance on large matrices.

Two very basic classes are included to assist with topic modeling tasks for text corpora. The Corpus class takes on text documents and can build a document-term matrix. The PLSAModel class has a train method which provides an interface to plsa.

For an example, see the example (conveniently named example.py). The numba method runs in under a second on a standard laptop with 4 GB of RAM available. The 20 newsgroups data set, which contains 2,000 documents, is used for evaluation. Assuming NumPy seeds play nice cross-OS, the results should be

Top topic terms
================
Topic 1: boswell, yalcin, onur, wright, mbytes
Topic 2: premiums, yeast, vitamin, sinus, candida
Topic 3: ports, pci, stereo, ankara, istanbul
Topic 4: icons, atari, lsd, cyprus, apps
Topic 5: wires, neutral, circuit, wiring, wire
Topic 6: gifs, simtel, jfif, gif, jpeg
Topic 7: nhl, sleeve, gant, players, league
Topic 8: mormon, gaza, xxxx, israeli, arabs
Topic 9: chi, det, suck, cubs, pit
Topic 10: cramer, theism, odwyer, bds, clayton

We can assign coherent labels to most topics, such as "pharmaceuticals" for Topic 2, "middle east" for Topic 8, and "baseball" for Topic 9. Adjusting corpus construction parameters, running for more iterations, or changing the number of topics can yield even better results.

Performance comparisons

We compare the two implementations on artificial problems of different sizes, all with document-term matrix sparsity around 95% (which is fairly dense for a text-based corpus). These results were obtained on a standard laptop with 4 GB of RAM available. The script speed_test.py can be used to recreate the figures.

Corpus size Vocab size Number of topics Basic method avg. time / iteration (best of 3) Numba method avg. time / iteration (best of 3)
100 500 10 0.0047 s 0.00058 s
250 1000 10 0.024 s 0.0028 s
100 2500 10 0.026 s 0.0028 s
1000 5000 10 1.16 s 0.042 s
2000 6000 10 2.59 s 0.12 s
3000 5000 10 3.26 s 0.13 s

The file large_speed_test.py carries out a test for a large matrix: 10,000 documents and 100,000 terms in the vocabulary with 99% sparsity (10 million non-zero entries). Running a 5 topic model for 10 iterations takes around 30 seconds, or 3 seconds per iteration.

We can also compare numba-plsa to a popular Python package on GitHub: PLSA. We used the example data from the PLSA repo. The two methods resulted the same distributions when using the same initializations.

Implementation Corpus size Vocab size Number of topics Number of iterations Time (best of 3)
PLSA package 13 2126 5 30 44.89 s
numba-plsa, basic 13 2126 5 30 0.082 s
numba-plsa, numba 13 2126 5 30 0.006 s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

numba_plsa-0.0.2.tar.gz (6.1 kB view details)

Uploaded Source

File details

Details for the file numba_plsa-0.0.2.tar.gz.

File metadata

  • Download URL: numba_plsa-0.0.2.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for numba_plsa-0.0.2.tar.gz
Algorithm Hash digest
SHA256 5b1ba8422e9c80c74523473a75e47dafd1a69a9a2fbaa5de13f72e072950a669
MD5 f80cb82d844c5b21ef11a728ebf3fbe5
BLAKE2b-256 27fe492ee573f624663d4c908cd7d3da762d1edcf1c882228289b4e4d402413d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page