A simple, easy-to-use toolkit for Dynamic Embedded Topic Models on temporal document collections.
Project description
easy-detm Package Document
Package Scope
This package provides a Python interface for training and visualizing the Dynamic Embedded Topic Model (DETM).
The current data API is intentionally simple:
documents: List[str]
timestamps: List[int]
Main API
Model
from easy_detm import DETMModel
DETMModel is the high-level class for:
- creating the DETM model,
- fitting it to temporal documents,
- extracting topics,
- inferring document-topic distributions,
- saving and loading checkpoints,
- evaluating topic coherence and topic diversity.
Data
from easy_detm.data import create_dataset_from_list, DocumentCorpus
Use create_dataset_from_list() for most workflows. Use DocumentCorpus only
when you need to manually control train/validation/test splits.
Visualization
from easy_detm import (
configure_cjk_fonts,
visualize_embeddings,
visualize_embeddings_over_time,
visualize_topic_evolution,
)
Visualization functions use the learned model parameters. They do not retrain or
modify the model. configure_cjk_fonts() is called automatically by the
visualization module and can also be called manually to inspect or reset CJK font
support for Korean, Japanese, Chinese, and English labels.
Topic Metrics
diversity = model.get_topic_diversity(num_words=10)
coherence = model.get_topic_coherence(data=train, num_words=10)
get_topic_diversity() uses only the trained topic-word distributions.
get_topic_coherence() also needs a reference corpus in DETM format, usually
the training split. If you call it on a model restored with load(), pass
data=train because checkpoints store model parameters and vocabulary, not the
original corpus.
Input Requirements
Documents
Documents should be strings where tokens are separated by whitespace:
documents = [
"climate carbon emissions",
"trade market finance",
]
The current package does not perform advanced NLP preprocessing. Recommended preprocessing before calling the package:
- lowercase text,
- remove or normalize punctuation,
- remove domain-specific noise,
- tokenize consistently,
- optionally remove stopwords,
- optionally lemmatize or stem terms.
Timestamps
Timestamps should be integers:
timestamps = [0, 0, 1, 1, 2, 2]
Recommended convention:
- use zero-based indices,
- keep time IDs contiguous,
- make sure every document has one timestamp.
Hyperparameter Notes
Important parameters:
num_topics: number of topics.num_times: number of time periods.rho_size: topic embedding dimension.emb_size: word embedding dimension.t_hidden_size: hidden size for the theta encoder.eta_hidden_size: hidden size for the eta LSTM.eta_nlayers: number of LSTM layers for eta.delta: random-walk prior variance used by the original DETM implementation.enc_drop: dropout in the theta encoder.batch_size: minibatch size.learning_rate: optimizer learning rate.
Output Interpretation
Topics
model.get_topics() returns top words from the learned topic-word distributions.
Because DETM is dynamic, each topic can have different top words at different
time points.
Document-Topic Matrix
model.get_document_topics() returns an array with shape:
num_documents x num_topics
Each row is a topic-proportion vector for one input document.
Visualizations
- Embedding plots show topics and words in a shared 2D projection.
- Topic evolution plots show word probability changes over time for one topic.
Acknowledgements
The core DETM model implementation is adapted from the original DETM code by Adji Bousso Dieng, Francisco J. R. Ruiz, and David M. Blei: https://github.com/adjidieng/DETM
Please cite the original paper when using the DETM model: "The Dynamic Embedded Topic Model" (Dieng, Ruiz, and Blei, 2019).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file easy_detm-0.1.1.tar.gz.
File metadata
- Download URL: easy_detm-0.1.1.tar.gz
- Upload date:
- Size: 29.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9df60ea027c80a0ccf8ddda16508ab66c4eb6ecc4681622cf3c206db82067176
|
|
| MD5 |
60e5b2403cdf17cb0b8d6cc46a6b256e
|
|
| BLAKE2b-256 |
d262388e59a0d7ac68c1667950cda98327bbafe18a698211a77c3ddfbd08a1aa
|
File details
Details for the file easy_detm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: easy_detm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 30.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f2be4eb396f76b2c447cf8af650766e52ac10d447d20813fac714e9bf849c6f
|
|
| MD5 |
d47185618123709deccdf554cb99f078
|
|
| BLAKE2b-256 |
6c2040bd515242817b30fe756684e5f81b6103c0a8fa8c333fcd262265698bb8
|