Skip to main content

A unified latent variable modeling framework for analyzing large multimodal and multilingual datasets

Project description

DeepLatent

DeepLatent is a unified latent variable modeling framework for analyzing large multimodal and multilingual datasets. It relies on variational inference using deep neural networks for estimation.

The package currently supports:

  • Generic latent factor models
  • Topic models: The latent variables are a mixture of topics within documents.
  • Ideal point models: The latent variables are interpreted as ideological dimensions.

🌟 Key Features

  • Multilingual and multimodal support

    • Learn topics / ideal points across multiple modalities (e.g., texts and images, texts and votes, etc.)
    • Learn the weight of each modality in determining the latent variables per observation
  • Flexible metadata handling:

    • prevalence: covariates that influence the latent variables
    • content: covariates that influence the response variables conditional on the latent variables (e.g., topic-word distributions)
    • labels: outcomes for classification or regression tasks
    • prediction: additional predictors for the labels
  • Flexible input/output representations:

    • Document embeddings (for texts, images, audio-visual data)
    • Word frequencies (BoW)
    • Raw images
    • Discrete choice data
    • Voting records

📦 Models

GTM (Generalized Topic Model)

  • Learns topics on the simplex
  • Supports dirichlet or logistic_normal priors (optionally conditioned on covariates)

IdealPointNN

  • Learns unconstrained latent variables (ℝ️ⁿ) for ideal point modeling
  • Designed for political texts, images, audio and video recordings, surveys, and votes
  • Uses a gaussian prior (optionally conditioned on covariates)

Installation

From PyPI (Recommended)

pip install deeplatent

From Source

git clone https://github.com/PinchOfData/DeepLatent.git  
cd deeplatent
pip install -e .

Development Installation

git clone https://github.com/PinchOfData/DeepLatent.git 
cd deeplatent
python setup_dev.py

🚀 Getting Started

1. Prepare Your Data with Corpus()

Supports text, embeddings, votes, and survey questions:

import sys
sys.path.append('../src/')

from corpus import Corpus

modalities = {
    "text": {
        "column": "doc_clean",
        "views": {
            "bow": {
                "type": "bow",
                "vectorizer": CountVectorizer()
            }
        }
    },
    "image": {
        "column": "image_path",
        "views": {
            "embedding": {
                "type": "embedding",
                "embed_fn": my_image_embedder
            }
        }
    }
}

my_dataset = Corpus(df, modalities=modalities)

Optionally include metadata:

  • prevalence, content, labels, prediction

2. Train a Model

For Topic Models:

from models import GTM

model = GTM(
    n_topics=20, 
    doc_topic_prior="logistic_normal",
    ae_type="wae"
)

For Ideal Point Models:

from models import IdealPointNN

model = IdealPointNN(
    n_ideal_points=1, # one-dimensional ideal point model
    ae_type="vae"
)

🔧 Common Options

Argument Description
ae_type "wae" (Wasserstein autoencoder) or "vae" (variational autoencoder) or "ae" (plain autoencoder)
fusion "poe" (Product of Experts), "moe_gating" (Mixture of Experts), or "moe_average" (Simple averaging across modalities)
update_prior Learn a structured prior conditioned on prevalence covariates
w_prior Strength of prior alignment for wae
w_pred_loss Weight of supervised loss predicting label
kl_annealing_* Strength of prior alignment for vae. Helps preventing posterior collapse.

🔍 Analysis and Utilities

📚 Topic Models (GTM)

  • get_topic_words() – top words per topic
  • get_covariate_words() – word shifts by content covariates
  • get_top_docs() – representative documents
  • get_topic_word_distribution() – topic-word matrix
  • get_covariate_word_distribution() – word shift matrix
  • plot_topic_word_distribution() – word clouds / bar plots
  • visualize_docs() – document embeddings (UMAP, t-SNE, PCA)
  • visualize_words() – word embeddings
  • visualize_topics() – topic embeddings

👤 Ideal Point Models (IdealPointNN)

  • get_ideal_points() – ℝ️ⁿ latent space
  • get_predictions() – supervised output
  • get_modality_weights() – fusion weights (PoE or gating)

📁 Tutorials

Check out the example notebooks to get started.

Download sample data to run some notebooks: Congressional Speeches CSV


📖 References

  • Deep Latent Variable Models for Unstructured Data , Germain Gauthier, Philine Widmer, Elliott Ash (2025)

  • The Neural Ideal Point Model , Germain Gauthier, Hugo Subtil, Philine Widmer (2025)


⚠️ Disclaimer

This package is under active development 🚧 — feedback and contributions welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplatent-0.1.0.tar.gz (2.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplatent-0.1.0-py3-none-any.whl (45.8 kB view details)

Uploaded Python 3

File details

Details for the file deeplatent-0.1.0.tar.gz.

File metadata

  • Download URL: deeplatent-0.1.0.tar.gz
  • Upload date:
  • Size: 2.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.24

File hashes

Hashes for deeplatent-0.1.0.tar.gz
Algorithm Hash digest
SHA256 107aeb87ce5d210cbee51391112af512d914a597f203732127c45c889bb113a6
MD5 1e1ed6c4ea2b0abb24ec1e1f084af9cb
BLAKE2b-256 163c09cdbec71ff9f58f95f144252c8ba2c71fa09ebca3d4de8bda78127d21b5

See more details on using hashes here.

File details

Details for the file deeplatent-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: deeplatent-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.24

File hashes

Hashes for deeplatent-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7ac30041947d63646c8e0f0e2250aa418cea36e178d99540351b854f76d772ff
MD5 cd03ddad94ad36f337196bb8aed77df7
BLAKE2b-256 eaff130ec424ca390b6b5c5bb361376bd4ee14b1e96b336a75e00e1b90c4bb14

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page