Skip to main content

A simple, easy-to-use toolkit for Dynamic Embedded Topic Models on temporal document collections.

Project description

easy-detm Package Document

Package Scope

This package provides a Python interface for training and visualizing the Dynamic Embedded Topic Model (DETM).

The current data API is intentionally simple:

documents: List[str]
timestamps: List[int]

Main API

Model

from easy_detm import DETMModel

DETMModel is the high-level class for:

  • creating the DETM model,
  • fitting it to temporal documents,
  • extracting topics,
  • inferring document-topic distributions,
  • saving and loading checkpoints,
  • evaluating topic coherence and topic diversity.

Data

from easy_detm.data import create_dataset_from_list, DocumentCorpus

Use create_dataset_from_list() for most workflows. Use DocumentCorpus only when you need to manually control train/validation/test splits.

Visualization

from easy_detm import (
    configure_cjk_fonts,
    visualize_embeddings,
    visualize_embeddings_over_time,
    visualize_topic_evolution,
)

Visualization functions use the learned model parameters. They do not retrain or modify the model. configure_cjk_fonts() is called automatically by the visualization module and can also be called manually to inspect or reset CJK font support for Korean, Japanese, Chinese, and English labels.

Topic Metrics

diversity = model.get_topic_diversity(num_words=10)
coherence = model.get_topic_coherence(data=train, num_words=10)

get_topic_diversity() uses only the trained topic-word distributions. get_topic_coherence() also needs a reference corpus in DETM format, usually the training split. If you call it on a model restored with load(), pass data=train because checkpoints store model parameters and vocabulary, not the original corpus.

Input Requirements

Documents

Documents should be strings where tokens are separated by whitespace:

documents = [
    "climate carbon emissions",
    "trade market finance",
]

The current package does not perform advanced NLP preprocessing. Recommended preprocessing before calling the package:

  • lowercase text,
  • remove or normalize punctuation,
  • remove domain-specific noise,
  • tokenize consistently,
  • optionally remove stopwords,
  • optionally lemmatize or stem terms.

Timestamps

Timestamps should be integers:

timestamps = [0, 0, 1, 1, 2, 2]

Recommended convention:

  • use zero-based indices,
  • keep time IDs contiguous,
  • make sure every document has one timestamp.

Hyperparameter Notes

Important parameters:

  • num_topics: number of topics.
  • num_times: number of time periods.
  • rho_size: topic embedding dimension.
  • emb_size: word embedding dimension.
  • t_hidden_size: hidden size for the theta encoder.
  • eta_hidden_size: hidden size for the eta LSTM.
  • eta_nlayers: number of LSTM layers for eta.
  • delta: random-walk prior variance used by the original DETM implementation.
  • enc_drop: dropout in the theta encoder.
  • batch_size: minibatch size.
  • learning_rate: optimizer learning rate.

Output Interpretation

Topics

model.get_topics() returns top words from the learned topic-word distributions. Because DETM is dynamic, each topic can have different top words at different time points.

Document-Topic Matrix

model.get_document_topics() returns an array with shape:

num_documents x num_topics

Each row is a topic-proportion vector for one input document.

Visualizations

  • Embedding plots show topics and words in a shared 2D projection.
  • Topic evolution plots show word probability changes over time for one topic.

Acknowledgements

The core DETM model implementation is adapted from the original DETM code by Adji Bousso Dieng, Francisco J. R. Ruiz, and David M. Blei: https://github.com/adjidieng/DETM

Please cite the original paper when using the DETM model: "The Dynamic Embedded Topic Model" (Dieng, Ruiz, and Blei, 2019).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easy_detm-0.1.1.tar.gz (29.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

easy_detm-0.1.1-py3-none-any.whl (30.5 kB view details)

Uploaded Python 3

File details

Details for the file easy_detm-0.1.1.tar.gz.

File metadata

  • Download URL: easy_detm-0.1.1.tar.gz
  • Upload date:
  • Size: 29.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for easy_detm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9df60ea027c80a0ccf8ddda16508ab66c4eb6ecc4681622cf3c206db82067176
MD5 60e5b2403cdf17cb0b8d6cc46a6b256e
BLAKE2b-256 d262388e59a0d7ac68c1667950cda98327bbafe18a698211a77c3ddfbd08a1aa

See more details on using hashes here.

File details

Details for the file easy_detm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: easy_detm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 30.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for easy_detm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4f2be4eb396f76b2c447cf8af650766e52ac10d447d20813fac714e9bf849c6f
MD5 d47185618123709deccdf554cb99f078
BLAKE2b-256 6c2040bd515242817b30fe756684e5f81b6103c0a8fa8c333fcd262265698bb8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page