Skip to main content

Implementation of Gaussian LDA topic model, with efficiency tricks

Project description

Gaussian LDA

Another implementation of the paper Gaussian LDA for Topic Models with Word Embeddings.

This is a Python implementation based as closely as possible on the Java implementation released by the paper's authors.

Installation

You'll first need to install the choldate package, following its installation instructions. (It's not possible to include this as a dependency for the PyPi package.)

Then install gaussianlda using Pip:

pip install gaussianlda

Usage

The package provides two classes for training Gaussian LDA:

  • Cholesky only, gaussianlda.GaussianLDATrainer: Simple Gibbs sampler with optional Cholesky decomposition trick.
  • Cholesky+aliasing, gaussianlda.GaussianLDAAliasTrainer: Cholesky decomposition (not optional) and the Vose aliasing trick.

The trainer is prepared by instantiating the training class:

  • corpus: List of documents, where each document is a list of int IDs of words. These are IDs into the vocabulary and the embeddings matrix.
  • vocab_embeddings: (V, D) Numpy array, where V is the number of words in the vocabulary and D is the dimensionality of the embeddings.
  • vocab: Vocabulary, given as a list of words, whose position corresponds to the indices using in the data. This is not strictly needed for training, but is used to output topics.
  • num_tables: Number of topics to learn.
  • alpha, kappa: Hyperparameters to the doc-topic Dirichlet and the inverse Wishart prior
  • save_path: Path to write the model out to after each iteration.
  • mh_steps (aliasing only): Number of Montecarlo-Hastings steps for each topic sample.

Then you set the sampler running for a specified number of iterations over the training data by calling trainer.sample(num_iters).

Example

import numpy as np
from gaussianlda import GaussianLDAAliasTrainer

# A small vocabulary as a list of words
vocab = "money business bank finance sheep cow goat pig".split()
# A random embedding for each word
# Really, you'd want to load something more useful!
embeddings = np.random.sample((8, 100), dtype=np.float32)
corpus = [
    [0, 2, 1, 1, 3, 0, 6, 1],
    [3, 1, 1, 3, 7, 0, 1, 2],
    [7, 5, 4, 7, 7, 4, 6],
    [5, 6, 1, 7, 7, 5, 6, 4],
]
# Prepare a trainer
trainer = GaussianLDAAliasTrainer(
    corpus, embeddings, vocab, 2, 0.1, 0.1
)
# Set training running
trainer.sample(10)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gaussianlda-0.2.9.tar.gz (51.6 kB view details)

Uploaded Source

Built Distribution

gaussianlda-0.2.9-py3-none-any.whl (59.7 kB view details)

Uploaded Python 3

File details

Details for the file gaussianlda-0.2.9.tar.gz.

File metadata

  • Download URL: gaussianlda-0.2.9.tar.gz
  • Upload date:
  • Size: 51.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.5.2

File hashes

Hashes for gaussianlda-0.2.9.tar.gz
Algorithm Hash digest
SHA256 61c87fd9c57dfd5deebc13c0ff876352d5ec1dff729aaeaae2206bd1fa5ca0c8
MD5 f36ca3e89a2115c1d14e33f89cc7ceab
BLAKE2b-256 c2e5b1bdbebb9ffd92dae7e3806903eb4f286561445f1ffdd3f7f8b2d732b948

See more details on using hashes here.

File details

Details for the file gaussianlda-0.2.9-py3-none-any.whl.

File metadata

  • Download URL: gaussianlda-0.2.9-py3-none-any.whl
  • Upload date:
  • Size: 59.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.5.2

File hashes

Hashes for gaussianlda-0.2.9-py3-none-any.whl
Algorithm Hash digest
SHA256 15c4601f2f775d844df45626d1a3f7b23a2cf18eb2e2a2f00b26c6e45f0b3420
MD5 a33934f672f61a4c174ea99fa5d2e0e6
BLAKE2b-256 9591b328392363ed0b2c7050744c8bfeb5908321b0ff6daf66fecacfe7dc5b76

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page