A unified latent variable modeling framework for analyzing large multimodal and multilingual datasets
Project description
DeepLatent
DeepLatent is a unified latent variable modeling framework for analyzing large multimodal and multilingual datasets. It relies on variational inference using deep neural networks for estimation.
The package currently supports:
- Generic latent factor models
- Topic models: The latent variables are a mixture of topics within documents.
- Ideal point models: The latent variables are interpreted as ideological dimensions.
🌟 Key Features
-
Multilingual and multimodal support
- Learn topics / ideal points across multiple modalities (e.g., texts and images, texts and votes, etc.)
- Learn the weight of each modality in determining the latent variables per observation
-
Flexible metadata handling:
prevalence: covariates that influence the latent variablescontent: covariates that influence the response variables conditional on the latent variables (e.g., topic-word distributions)labels: outcomes for classification or regression tasksprediction: additional predictors for the labels
-
Flexible input/output representations:
- Document embeddings (for texts, images, audio-visual data)
- Word frequencies (BoW)
- Raw images
- Discrete choice data
- Voting records
📦 Models
GTM (Generalized Topic Model)
- Learns topics on the simplex
- Supports
dirichletorlogistic_normalpriors (optionally conditioned on covariates)
IdealPointNN
- Learns unconstrained latent variables (ℝ️ⁿ) for ideal point modeling
- Designed for political texts, images, audio and video recordings, surveys, and votes
- Uses a
gaussianprior (optionally conditioned on covariates)
Installation
From PyPI (Recommended)
pip install deeplatent
From Source
git clone https://github.com/PinchOfData/DeepLatent.git
cd deeplatent
pip install -e .
Development Installation
git clone https://github.com/PinchOfData/DeepLatent.git
cd deeplatent
python setup_dev.py
🚀 Getting Started
1. Prepare Your Data with Corpus()
Supports text, embeddings, votes, and survey questions:
import sys
sys.path.append('../src/')
from corpus import Corpus
modalities = {
"text": {
"column": "doc_clean",
"views": {
"bow": {
"type": "bow",
"vectorizer": CountVectorizer()
}
}
},
"image": {
"column": "image_path",
"views": {
"embedding": {
"type": "embedding",
"embed_fn": my_image_embedder
}
}
}
}
my_dataset = Corpus(df, modalities=modalities)
Optionally include metadata:
prevalence,content,labels,prediction
2. Train a Model
For Topic Models:
from models import GTM
model = GTM(
n_topics=20,
doc_topic_prior="logistic_normal",
ae_type="wae"
)
For Ideal Point Models:
from models import IdealPointNN
model = IdealPointNN(
n_ideal_points=1, # one-dimensional ideal point model
ae_type="vae"
)
🔧 Common Options
| Argument | Description |
|---|---|
ae_type |
"wae" (Wasserstein autoencoder) or "vae" (variational autoencoder) or "ae" (plain autoencoder) |
fusion |
"poe" (Product of Experts), "moe_gating" (Mixture of Experts), or "moe_average" (Simple averaging across modalities) |
update_prior |
Learn a structured prior conditioned on prevalence covariates |
w_prior |
Strength of prior alignment for wae |
w_pred_loss |
Weight of supervised loss predicting label |
kl_annealing_* |
Strength of prior alignment for vae. Helps preventing posterior collapse. |
🔍 Analysis and Utilities
📚 Topic Models (GTM)
get_topic_words()– top words per topicget_covariate_words()– word shifts bycontentcovariatesget_top_docs()– representative documentsget_topic_word_distribution()– topic-word matrixget_covariate_word_distribution()– word shift matrixplot_topic_word_distribution()– word clouds / bar plotsvisualize_docs()– document embeddings (UMAP, t-SNE, PCA)visualize_words()– word embeddingsvisualize_topics()– topic embeddings
👤 Ideal Point Models (IdealPointNN)
get_ideal_points()– ℝ️ⁿ latent spaceget_predictions()– supervised outputget_modality_weights()– fusion weights (PoE or gating)
📁 Tutorials
Check out the example notebooks to get started.
Download sample data to run some notebooks: Congressional Speeches CSV
📖 References
-
Deep Latent Variable Models for Unstructured Data , Germain Gauthier, Philine Widmer, Elliott Ash (2025)
-
The Neural Ideal Point Model , Germain Gauthier, Hugo Subtil, Philine Widmer (2025)
⚠️ Disclaimer
This package is under active development 🚧 — feedback and contributions welcome!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deeplatent-0.1.0.tar.gz.
File metadata
- Download URL: deeplatent-0.1.0.tar.gz
- Upload date:
- Size: 2.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.24
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
107aeb87ce5d210cbee51391112af512d914a597f203732127c45c889bb113a6
|
|
| MD5 |
1e1ed6c4ea2b0abb24ec1e1f084af9cb
|
|
| BLAKE2b-256 |
163c09cdbec71ff9f58f95f144252c8ba2c71fa09ebca3d4de8bda78127d21b5
|
File details
Details for the file deeplatent-0.1.0-py3-none-any.whl.
File metadata
- Download URL: deeplatent-0.1.0-py3-none-any.whl
- Upload date:
- Size: 45.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.24
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ac30041947d63646c8e0f0e2250aa418cea36e178d99540351b854f76d772ff
|
|
| MD5 |
cd03ddad94ad36f337196bb8aed77df7
|
|
| BLAKE2b-256 |
eaff130ec424ca390b6b5c5bb361376bd4ee14b1e96b336a75e00e1b90c4bb14
|