Skip to main content

A unified latent variable modeling framework for analyzing large multimodal and multilingual datasets

Project description

DeepLatent

DeepLatent is a unified latent variable modeling framework for analyzing large multimodal and multilingual datasets. It relies on variational inference using deep neural networks for estimation.

The package currently supports:

  • Generic latent factor models
  • Topic models: The latent variables are a mixture of topics within documents.
  • Ideal point models: The latent variables are interpreted as ideological dimensions.

🌟 Key Features

  • Multilingual and multimodal support

    • Learn topics / ideal points across multiple modalities (e.g., texts and images, texts and votes, etc.)
    • Learn the weight of each modality in determining the latent variables per observation
  • Flexible metadata handling:

    • prevalence: covariates that influence the latent variables
    • content: covariates that influence the response variables conditional on the latent variables (e.g., topic-word distributions)
    • labels: outcomes for classification or regression tasks
    • prediction: additional predictors for the labels
  • Flexible input/output representations:

    • Document embeddings (for texts, images, audio-visual data)
    • Word frequencies (BoW)
    • Raw images
    • Discrete choice data
    • Voting records

📦 Models

GTM (Generalized Topic Model)

  • Learns topics on the simplex
  • Supports dirichlet or logistic_normal priors (optionally conditioned on covariates)

IdealPointNN

  • Learns unconstrained latent variables (ℝ️ⁿ) for ideal point modeling
  • Designed for political texts, images, audio and video recordings, surveys, and votes
  • Uses a gaussian prior (optionally conditioned on covariates)

Installation

From PyPI (Recommended)

pip install deeplatent

From Source

git clone https://github.com/PinchOfData/DeepLatent.git  
cd deeplatent
pip install -e .

Development Installation

git clone https://github.com/PinchOfData/DeepLatent.git 
cd deeplatent
python setup_dev.py

🚀 Getting Started

1. Prepare Your Data with Corpus()

Supports text, embeddings, votes, and survey questions:

import sys
sys.path.append('../src/')

from corpus import Corpus

modalities = {
    "text": {
        "column": "doc_clean",
        "views": {
            "bow": {
                "type": "bow",
                "vectorizer": CountVectorizer()
            }
        }
    },
    "image": {
        "column": "image_path",
        "views": {
            "embedding": {
                "type": "embedding",
                "embed_fn": my_image_embedder
            }
        }
    }
}

my_dataset = Corpus(df, modalities=modalities)

Optionally include metadata:

  • prevalence, content, labels, prediction

2. Train a Model

For Topic Models:

from models import GTM

model = GTM(
    n_topics=20, 
    doc_topic_prior="logistic_normal",
    ae_type="wae"
)

For Ideal Point Models:

from models import IdealPointNN

model = IdealPointNN(
    n_ideal_points=1, # one-dimensional ideal point model
    ae_type="vae"
)

🔧 Common Options

Argument Description
ae_type "wae" (Wasserstein autoencoder) or "vae" (variational autoencoder) or "ae" (plain autoencoder)
fusion "poe" (Product of Experts), "moe_gating" (Mixture of Experts), or "moe_average" (Simple averaging across modalities)
update_prior Learn a structured prior conditioned on prevalence covariates
w_prior Strength of prior alignment for wae
w_pred_loss Weight of supervised loss predicting label
kl_annealing_* Strength of prior alignment for vae. Helps preventing posterior collapse.

🔍 Analysis and Utilities

📚 Topic Models (GTM)

  • get_topic_words() – top words per topic
  • get_covariate_words() – word shifts by content covariates
  • get_top_docs() – representative documents
  • get_topic_word_distribution() – topic-word matrix
  • get_covariate_word_distribution() – word shift matrix
  • plot_topic_word_distribution() – word clouds / bar plots
  • visualize_docs() – document embeddings (UMAP, t-SNE, PCA)
  • visualize_words() – word embeddings
  • visualize_topics() – topic embeddings

👤 Ideal Point Models (IdealPointNN)

  • get_ideal_points() – ℝ️ⁿ latent space
  • get_predictions() – supervised output
  • get_modality_weights() – fusion weights (PoE or gating)

📁 Tutorials

Check out the example notebooks to get started.

Download sample data to run some notebooks: Congressional Speeches CSV


📖 References

  • Deep Latent Variable Models for Unstructured Data , Germain Gauthier, Philine Widmer, Elliott Ash (2025)

  • The Neural Ideal Point Model , Germain Gauthier, Hugo Subtil, Philine Widmer (2025)


⚠️ Disclaimer

This package is under active development 🚧 — feedback and contributions welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplatent-0.1.1.tar.gz (3.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplatent-0.1.1-py3-none-any.whl (50.0 kB view details)

Uploaded Python 3

File details

Details for the file deeplatent-0.1.1.tar.gz.

File metadata

  • Download URL: deeplatent-0.1.1.tar.gz
  • Upload date:
  • Size: 3.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for deeplatent-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8aea9e9389839e9f8b2a5051805e1ec1aefbd7d3cf42bdfa98ab53444806e524
MD5 0dbbb47f547edf733c45de4133931ea3
BLAKE2b-256 76a0f98f9d4f459e5481111e5f09a4c0431470e89057fa0a99be046cc9538749

See more details on using hashes here.

File details

Details for the file deeplatent-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: deeplatent-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 50.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for deeplatent-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 93044bb6b9abacaf0a6994932a0eb4138024e7629a958f104354300b0cc9b0a7
MD5 79694f921d251d3987f6e65309ac49ee
BLAKE2b-256 cb683c92fe4806fc02d86ec929ca4547b68a01a96444b555f6e67fdcc2a74f77

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page