Skip to main content

A unified latent variable modeling framework for analyzing large multimodal and multilingual datasets

Project description

DeepLatent

DeepLatent is a unified latent variable modeling framework for analyzing large multimodal and multilingual datasets. It relies on variational inference using deep neural networks for estimation.

The package currently supports:

  • Generic latent factor models
  • Topic models: The latent variables are a mixture of topics within documents.
  • Ideal point models: The latent variables are interpreted as ideological dimensions.

🌟 Key Features

  • Multilingual and multimodal support

    • Learn topics / ideal points across multiple modalities (e.g., texts and images, texts and votes, etc.)
    • Learn the weight of each modality in determining the latent variables per observation
  • Flexible metadata handling:

    • prevalence: covariates that influence the latent variables
    • content: covariates that influence the response variables conditional on the latent variables (e.g., topic-word distributions)
    • labels: outcomes for classification or regression tasks
    • prediction: additional predictors for the labels
  • Flexible input/output representations:

    • Document embeddings (for texts, images, audio-visual data)
    • Word frequencies (BoW)
    • Raw images
    • Discrete choice data
    • Voting records

📦 Models

GTM (Generalized Topic Model)

  • Learns topics on the simplex
  • Supports dirichlet or logistic_normal priors (optionally conditioned on covariates)

IdealPointNN

  • Learns unconstrained latent variables (ℝ️ⁿ) for ideal point modeling
  • Designed for political texts, images, audio and video recordings, surveys, and votes
  • Uses a gaussian prior (optionally conditioned on covariates)

Installation

From PyPI (Recommended)

pip install deeplatent

From Source

git clone https://github.com/PinchOfData/DeepLatent.git  
cd deeplatent
pip install -e .

Development Installation

git clone https://github.com/PinchOfData/DeepLatent.git 
cd deeplatent
python setup_dev.py

🚀 Getting Started

1. Prepare Your Data with Corpus()

Supports text, embeddings, votes, and survey questions:

import sys
sys.path.append('../src/')

from corpus import Corpus

modalities = {
    "text": {
        "column": "doc_clean",
        "views": {
            "bow": {
                "type": "bow",
                "vectorizer": CountVectorizer()
            }
        }
    },
    "image": {
        "column": "image_path",
        "views": {
            "embedding": {
                "type": "embedding",
                "embed_fn": my_image_embedder
            }
        }
    }
}

my_dataset = Corpus(df, modalities=modalities)

Optionally include metadata:

  • prevalence, content, labels, prediction

2. Train a Model

For Topic Models:

from models import GTM

model = GTM(
    n_topics=20, 
    doc_topic_prior="logistic_normal",
    ae_type="wae"
)

For Ideal Point Models:

from models import IdealPointNN

model = IdealPointNN(
    n_ideal_points=1, # one-dimensional ideal point model
    ae_type="vae"
)

🔧 Common Options

Argument Description
ae_type "wae" (Wasserstein autoencoder) or "vae" (variational autoencoder) or "ae" (plain autoencoder)
fusion "poe" (Product of Experts), "moe_gating" (Mixture of Experts), or "moe_average" (Simple averaging across modalities)
update_prior Learn a structured prior conditioned on prevalence covariates
w_prior Strength of prior alignment for wae
w_pred_loss Weight of supervised loss predicting label
kl_annealing_* Strength of prior alignment for vae. Helps preventing posterior collapse.

🔍 Analysis and Utilities

📚 Topic Models (GTM)

  • get_topic_words() – top words per topic
  • get_covariate_words() – word shifts by content covariates
  • get_top_docs() – representative documents
  • get_topic_word_distribution() – topic-word matrix
  • get_covariate_word_distribution() – word shift matrix
  • plot_topic_word_distribution() – word clouds / bar plots
  • visualize_docs() – document embeddings (UMAP, t-SNE, PCA)
  • visualize_words() – word embeddings
  • visualize_topics() – topic embeddings

👤 Ideal Point Models (IdealPointNN)

  • get_ideal_points() – ℝ️ⁿ latent space
  • get_predictions() – supervised output
  • get_modality_weights() – fusion weights (PoE or gating)

📁 Tutorials

Check out the example notebooks to get started.

Download sample data to run some notebooks: Congressional Speeches CSV


📖 References

  • Deep Latent Variable Models for Unstructured Data , Germain Gauthier, Philine Widmer, Elliott Ash (2025)

  • The Neural Ideal Point Model , Germain Gauthier, Hugo Subtil, Philine Widmer (2025)


⚠️ Disclaimer

This package is under active development 🚧 — feedback and contributions welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplatent-0.1.3.tar.gz (13.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplatent-0.1.3-py3-none-any.whl (62.4 kB view details)

Uploaded Python 3

File details

Details for the file deeplatent-0.1.3.tar.gz.

File metadata

  • Download URL: deeplatent-0.1.3.tar.gz
  • Upload date:
  • Size: 13.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for deeplatent-0.1.3.tar.gz
Algorithm Hash digest
SHA256 b66f3a5af24446f8f6fb5fd084579aec1b053389e624d439191a3df238ee8689
MD5 fce064e4c57b0d9df360f0fbe9ab579a
BLAKE2b-256 321ab8d2674ec81bb48fde5ae0f37a55341e65417afca1f7730f8629c4dfb040

See more details on using hashes here.

File details

Details for the file deeplatent-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: deeplatent-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 62.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for deeplatent-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9eae683b28e30a3cd813a5c6e3caf1cf2e7604e4f3c57dc9583d305cbc2cad8c
MD5 7d2c1e974e6bd18757e6c2d9e44143e0
BLAKE2b-256 0670e8d08892854561f4d34491102fc61f4770f4f4b352d83546d2f4ac434c33

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page