Implementation of Gaussian LDA topic model, with efficiency tricks
Project description
Gaussian LDA
Another implementation of the paper Gaussian LDA for Topic Models with Word Embeddings.
This is a Python implementation based as closely as possible on the Java implementation released by the paper's authors.
Installation
You'll first need to install the choldate
package, following its installation
instructions. (It's not
possible to include this as a dependency for the PyPi package.)
Then install gaussianlda using Pip:
pip install gaussianlda
Usage
The package provides two classes for training Gaussian LDA:
- Cholesky only,
gaussianlda.GaussianLDATrainer
: Simple Gibbs sampler with optional Cholesky decomposition trick. - Cholesky+aliasing,
gaussianlda.GaussianLDAAliasTrainer
: Cholesky decomposition (not optional) and the Vose aliasing trick.
The trainer is prepared by instantiating the training class:
- corpus: List of documents, where each document is a list of int IDs of words. These are IDs into the vocabulary and the embeddings matrix.
- vocab_embeddings: (V, D) Numpy array, where V is the number of words in the vocabulary and D is the dimensionality of the embeddings.
- vocab: Vocabulary, given as a list of words, whose position corresponds to the indices using in the data. This is not strictly needed for training, but is used to output topics.
- num_tables: Number of topics to learn.
- alpha, kappa: Hyperparameters to the doc-topic Dirichlet and the inverse Wishart prior
- save_path: Path to write the model out to after each iteration.
- mh_steps (aliasing only): Number of Montecarlo-Hastings steps for each topic sample.
Then you set the sampler running for a specified number of iterations
over the training data by calling trainer.sample(num_iters)
.
Example
import numpy as np
from gaussianlda import GaussianLDAAliasTrainer
# A small vocabulary as a list of words
vocab = "money business bank finance sheep cow goat pig".split()
# A random embedding for each word
# Really, you'd want to load something more useful!
embeddings = np.random.sample((8, 100), dtype=np.float32)
corpus = [
[0, 2, 1, 1, 3, 0, 6, 1],
[3, 1, 1, 3, 7, 0, 1, 2],
[7, 5, 4, 7, 7, 4, 6],
[5, 6, 1, 7, 7, 5, 6, 4],
]
# Prepare a trainer
trainer = GaussianLDAAliasTrainer(
corpus, embeddings, vocab, 2, 0.1, 0.1
)
# Set training running
trainer.sample(10)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gaussianlda-0.1.6.tar.gz
.
File metadata
- Download URL: gaussianlda-0.1.6.tar.gz
- Upload date:
- Size: 20.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 155fd2a81daa9aefe2134eaa9723e2a04362bd8dded979ef7fb52d94db974cad |
|
MD5 | bda1903fa0535cacc7327c2b0878af17 |
|
BLAKE2b-256 | 16ef1aafaf522880311cea035b50be20a207ffa1d051422272faa3365a009755 |
File details
Details for the file gaussianlda-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: gaussianlda-0.1.6-py3-none-any.whl
- Upload date:
- Size: 36.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c86bd86fcbcba9e5c4ca90809e1ad8c1e3c5a8003218da5a94c87462b6a1631 |
|
MD5 | 539a5d66bd5f2549e5b54dc5d11f88ee |
|
BLAKE2b-256 | 25a531e87bb1b049b495149a3d88562c3e9669c9ffc528be9256a408cbcf6c16 |