Skip to main content

A unified latent variable modeling framework for analyzing large multimodal and multilingual datasets

Project description

DeepLatent

DeepLatent is a unified latent variable modeling framework for analyzing large multimodal and multilingual datasets. It relies on variational inference using deep neural networks for estimation.

The package currently supports:

  • Generic latent factor models
  • Topic models: The latent variables are a mixture of topics within documents.
  • Ideal point models: The latent variables are interpreted as ideological dimensions.

🌟 Key Features

  • Multilingual and multimodal support

    • Learn topics / ideal points across multiple modalities (e.g., texts and images, texts and votes, etc.)
    • Learn the weight of each modality in determining the latent variables per observation
  • Flexible metadata handling:

    • prevalence: covariates that influence the latent variables
    • content: covariates that influence the response variables conditional on the latent variables (e.g., topic-word distributions)
    • labels: outcomes for classification or regression tasks
    • prediction: additional predictors for the labels
  • Flexible input/output representations:

    • Document embeddings (for texts, images, audio-visual data)
    • Word frequencies (BoW)
    • Raw images
    • Discrete choice data
    • Voting records

📦 Models

GTM (Generalized Topic Model)

  • Learns topics on the simplex
  • Supports dirichlet or logistic_normal priors (optionally conditioned on covariates)

IdealPointNN

  • Learns unconstrained latent variables (ℝ️ⁿ) for ideal point modeling
  • Designed for political texts, images, audio and video recordings, surveys, and votes
  • Uses a gaussian prior (optionally conditioned on covariates)

Installation

From PyPI (Recommended)

pip install deeplatent

From Source

git clone https://github.com/PinchOfData/DeepLatent.git  
cd deeplatent
pip install -e .

Development Installation

git clone https://github.com/PinchOfData/DeepLatent.git 
cd deeplatent
python setup_dev.py

🚀 Getting Started

1. Prepare Your Data with Corpus()

Supports text, embeddings, votes, and survey questions:

import sys
sys.path.append('../src/')

from corpus import Corpus

modalities = {
    "text": {
        "column": "doc_clean",
        "views": {
            "bow": {
                "type": "bow",
                "vectorizer": CountVectorizer()
            }
        }
    },
    "image": {
        "column": "image_path",
        "views": {
            "embedding": {
                "type": "embedding",
                "embed_fn": my_image_embedder
            }
        }
    }
}

my_dataset = Corpus(df, modalities=modalities)

Optionally include metadata:

  • prevalence, content, labels, prediction

2. Train a Model

For Topic Models:

from models import GTM

model = GTM(
    n_topics=20, 
    doc_topic_prior="logistic_normal",
    ae_type="wae"
)

For Ideal Point Models:

from models import IdealPointNN

model = IdealPointNN(
    n_ideal_points=1, # one-dimensional ideal point model
    ae_type="vae"
)

🔧 Common Options

Argument Description
ae_type "wae" (Wasserstein autoencoder) or "vae" (variational autoencoder) or "ae" (plain autoencoder)
fusion "poe" (Product of Experts), "moe_gating" (Mixture of Experts), or "moe_average" (Simple averaging across modalities)
update_prior Learn a structured prior conditioned on prevalence covariates
w_prior Strength of prior alignment for wae
w_pred_loss Weight of supervised loss predicting label
kl_annealing_* Strength of prior alignment for vae. Helps preventing posterior collapse.

🔍 Analysis and Utilities

📚 Topic Models (GTM)

  • get_topic_words() – top words per topic
  • get_covariate_words() – word shifts by content covariates
  • get_top_docs() – representative documents
  • get_topic_word_distribution() – topic-word matrix
  • get_covariate_word_distribution() – word shift matrix
  • plot_topic_word_distribution() – word clouds / bar plots
  • visualize_docs() – document embeddings (UMAP, t-SNE, PCA)
  • visualize_words() – word embeddings
  • visualize_topics() – topic embeddings

👤 Ideal Point Models (IdealPointNN)

  • get_ideal_points() – ℝ️ⁿ latent space
  • get_predictions() – supervised output
  • get_modality_weights() – fusion weights (PoE or gating)

📁 Tutorials

Check out the example notebooks to get started.

Download sample data to run some notebooks: Congressional Speeches CSV


📖 References

  • Deep Latent Variable Models for Unstructured Data , Germain Gauthier, Philine Widmer, Elliott Ash (2025)

  • The Neural Ideal Point Model , Germain Gauthier, Hugo Subtil, Philine Widmer (2025)


⚠️ Disclaimer

This package is under active development 🚧 — feedback and contributions welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplatent-0.1.2.tar.gz (3.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplatent-0.1.2-py3-none-any.whl (55.5 kB view details)

Uploaded Python 3

File details

Details for the file deeplatent-0.1.2.tar.gz.

File metadata

  • Download URL: deeplatent-0.1.2.tar.gz
  • Upload date:
  • Size: 3.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for deeplatent-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2e355869cdd79514c67179207f672faeb8dbfdea977f5bf79ef345efc1f0d748
MD5 78de9ee9aa8e535c968b6cc8bc3ef8ac
BLAKE2b-256 f4b20d822370d35a0a83656f534e643a34eb7e2bda87dc6ac0b4088dda2c991e

See more details on using hashes here.

File details

Details for the file deeplatent-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: deeplatent-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 55.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for deeplatent-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 46ac5e920eb43e09c28139b1672418d0608d1f1dcbac78d72dff4161b11ed578
MD5 67fc76c57a77671ceac88c6295f419b9
BLAKE2b-256 3c6879d75f115e6644bda932d28a8803251dcaa1806a46ad2f982c6887d0632e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page