Numba implementation of PLSA
Project description
Numba PLSA
PLSA for sparse matrices implemented with Numba. Wicked fast.
Installation
- Clone the repo:
git clone https://github.com/TnTo/numba-plsa.git
- Install:
pip install .
- Run the example:
python example.py
Usage
The numba-plsa package provides two implementations: a basic NumPy method and a numba method. The plsa
method wraps the basic algorithmic functionality, and the algorithm is chosen by using the method
argument (method='numba'
or method='basic'
, the default). The basic method works for any NumPy document-term matrix, whereas the numba method is optimized for sparse matrices. The plsa
method automatically converts the input document-term matrix to a COO sparse matrix. The plsa_direct
method is also available, which assumes the input is already in COO form and skips some precomputation for faster performance on large matrices.
Two very basic classes are included to assist with topic modeling tasks for text corpora. The Corpus
class takes on text documents and can build a document-term matrix. The PLSAModel
class has a train
method which provides an interface to plsa
.
For an example, see the example (conveniently named example.py
). The numba method runs in under a second on a standard laptop with 4 GB of RAM available. The 20 newsgroups data set, which contains 2,000 documents, is used for evaluation. Assuming NumPy seeds play nice cross-OS, the results should be
Top topic terms
================
Topic 1: boswell, yalcin, onur, wright, mbytes
Topic 2: premiums, yeast, vitamin, sinus, candida
Topic 3: ports, pci, stereo, ankara, istanbul
Topic 4: icons, atari, lsd, cyprus, apps
Topic 5: wires, neutral, circuit, wiring, wire
Topic 6: gifs, simtel, jfif, gif, jpeg
Topic 7: nhl, sleeve, gant, players, league
Topic 8: mormon, gaza, xxxx, israeli, arabs
Topic 9: chi, det, suck, cubs, pit
Topic 10: cramer, theism, odwyer, bds, clayton
We can assign coherent labels to most topics, such as "pharmaceuticals" for Topic 2, "middle east" for Topic 8, and "baseball" for Topic 9. Adjusting corpus construction parameters, running for more iterations, or changing the number of topics can yield even better results.
Performance comparisons
We compare the two implementations on artificial problems of different sizes, all with document-term matrix sparsity around 95% (which is fairly dense for a text-based corpus). These results were obtained on a standard laptop with 4 GB of RAM available. The script speed_test.py
can be used to recreate the figures.
Corpus size | Vocab size | Number of topics | Basic method avg. time / iteration (best of 3) | Numba method avg. time / iteration (best of 3) |
---|---|---|---|---|
100 | 500 | 10 | 0.0047 s | 0.00058 s |
250 | 1000 | 10 | 0.024 s | 0.0028 s |
100 | 2500 | 10 | 0.026 s | 0.0028 s |
1000 | 5000 | 10 | 1.16 s | 0.042 s |
2000 | 6000 | 10 | 2.59 s | 0.12 s |
3000 | 5000 | 10 | 3.26 s | 0.13 s |
The file large_speed_test.py
carries out a test for a large matrix: 10,000 documents and 100,000 terms in the vocabulary with 99% sparsity (10 million non-zero entries). Running a 5 topic model for 10 iterations takes around 30 seconds, or 3 seconds per iteration.
We can also compare numba-plsa to a popular Python package on GitHub: PLSA. We used the example data from the PLSA repo. The two methods resulted the same distributions when using the same initializations.
Implementation | Corpus size | Vocab size | Number of topics | Number of iterations | Time (best of 3) |
---|---|---|---|---|---|
PLSA package | 13 | 2126 | 5 | 30 | 44.89 s |
numba-plsa, basic | 13 | 2126 | 5 | 30 | 0.082 s |
numba-plsa, numba | 13 | 2126 | 5 | 30 | 0.006 s |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file numba_plsa-0.0.2.tar.gz
.
File metadata
- Download URL: numba_plsa-0.0.2.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b1ba8422e9c80c74523473a75e47dafd1a69a9a2fbaa5de13f72e072950a669 |
|
MD5 | f80cb82d844c5b21ef11a728ebf3fbe5 |
|
BLAKE2b-256 | 27fe492ee573f624663d4c908cd7d3da762d1edcf1c882228289b4e4d402413d |