Measure and visualize topic model stability across multiple runs
Project description
topic-stability
Measure and visualize the stability of topic models across multiple runs.
Topic models are stochastic: two runs with the same settings produce differently-labelled topics in a different order. topic-stability aligns topics across runs using sentence-embedding centroids and scores each topic by how consistently the same documents are assigned to it (Jensen-Shannon divergence). The result is a per-topic stability score in [0, 1] and a small-multiples UMAP visualization with stability annotated on each panel.
Works with any topic model that produces a document-topic matrix — LDA, NMF, BERTopic, and more.
Install
pip install topic-stability # core (numpy + scipy only)
pip install "topic-stability[embed]" # + sentence-transformers
pip install "topic-stability[umap,viz]" # + UMAP + matplotlib
pip install "topic-stability[all]" # everything
Quick start
sklearn (LDA, NMF, …)
from sklearn.decomposition import LatentDirichletAllocation
from topic_stability import TopicRun, StabilityAnalysis, DocumentEmbedder
# Embed documents once and cache to disk
embedder = DocumentEmbedder(cache_path="embeddings.npy")
embeddings = embedder.embed(texts, ids=doc_ids)
# Train several runs
runs = [TopicRun.from_sklearn(
LatentDirichletAllocation(n_components=20).fit(X), X
) for _ in range(5)]
analysis = StabilityAnalysis(runs, embeddings=embeddings)
analysis.align()
print(analysis.topic_stability()) # array of shape (n_topics,)
print(analysis.overall_stability()) # scalar
analysis.visualize("topics.png") # requires topic-stability[umap,viz]
Pass precomputed embeddings (e.g. from BERTopic)
from topic_stability.integrations.bertopic import from_bertopic
run, embeddings = from_bertopic(model, embeddings=precomputed_embeddings)
See BERTopic notes below for important differences.
From files (Mallet / CSV pipeline)
runs = [
TopicRun.from_csv(
f"model_42_run{i}/doc_topic_avg.csv",
word_topic_path=f"model_42_run{i}/word_topic_avg.csv",
)
for i in range(1, 6)
]
embedder = DocumentEmbedder(cache_path="embeddings.npy")
embeddings, _ = embedder.load()
analysis = StabilityAnalysis(runs, embeddings=embeddings)
analysis.align()
analysis.visualize("topics.png", umap_coords=precomputed_umap)
API
TopicRun
One run's topic distributions.
| Constructor | Use when |
|---|---|
TopicRun.from_matrix(doc_topic, *, doc_ids, word_topic, vocab) |
You have numpy arrays |
TopicRun.from_sklearn(model, X, *, doc_ids, vocab) |
sklearn transform() interface |
TopicRun.from_csv(doc_topic_path, *, word_topic_path) |
CSV files from the CLI pipeline |
TopicRun.from_mallet_states(model_dir, *, iterations, tsv_path) |
Mallet .gz state files |
DocumentEmbedder
embedder = DocumentEmbedder(model="all-MiniLM-L6-v2", cache_path="embeddings.npy")
embeddings = embedder.embed(texts, ids=doc_ids) # computes and caches
embeddings, ids = embedder.load() # load from cache
Pass the returned array directly to StabilityAnalysis(runs, embeddings=embeddings).
StabilityAnalysis
analysis = StabilityAnalysis(runs, embeddings, *, doc_ids=None)
analysis.align(reference=0) # must call before scoring
analysis.topic_stability() # ndarray (n_topics,) in [0, 1]
analysis.overall_stability() # float
analysis.umap_projection(**kwargs) # ndarray (n_docs, 2)
analysis.visualize(path, *, reference_run=0, umap_coords=None)
Alignment uses cosine similarity of per-topic embedding centroids
(centroid_k = Σ_d θ_dk · e_d, normalised) matched with the Hungarian
algorithm. No shared vocabulary is required, so runs from different model
types can be compared.
Stability score for topic k: mean pairwise 1 − JS(p, q) where p and
q are the normalised document-profile columns θ[:,k] (treated as a
distribution over documents) from each pair of aligned runs.
BERTopic
from topic_stability.integrations.bertopic import from_bertopic
run, embeddings = from_bertopic(model, docs=None, *, embeddings=None, doc_ids=None)
Returns (TopicRun, embeddings_array).
Key differences from LDA/NMF:
- BERTopic assigns each document to exactly one cluster (hard assignment). The
doc_topicmatrix is binary: 1 for the assigned topic, 0 elsewhere. Documents that HDBSCAN assigns to topic −1 (outliers) get an all-zero row. model.probabilities_contains HDBSCAN soft-membership scores, not topic-weight distributions. We do not use them — they are a geometric property of the embedding space, not comparable to LDA posterior weights.- Word representations come from c-TF-IDF scores, not a generative word distribution. Cross-model word-based comparison is not meaningful.
- Stability scores measure whether the same documents cluster together across runs, not whether the same word distributions recur.
CLI pipeline (Mallet / RustMallet)
The package includes CLI wrappers for a full file-based workflow:
# 1. Embed documents
topic-stability-embed corpus.tsv embeddings.npy
# 2. Project to 2D
topic-stability-project embeddings.npy umap_2d.csv
# 3. Estimate distributions from Mallet states
topic-stability-estimate model_42_run1/ 42 corpus.tsv
# 4. Visualize a single run
topic-stability-visualize umap_2d.csv model_42_run1/doc_topic_avg.csv \
model_42_run1/word_topic_avg.csv topics.png
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file topic_stability-0.1.0.tar.gz.
File metadata
- Download URL: topic_stability-0.1.0.tar.gz
- Upload date:
- Size: 161.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab5a780053be072b9270b58dbb4675a7ee0e670d9320fddbb0494a1727866eeb
|
|
| MD5 |
75a01080e347b8cee76a1e8f646fae99
|
|
| BLAKE2b-256 |
873ebcdf2dd044b2f2d655d13980380cbd03b52c2c5fb9a693f596c3a5cbbf2a
|
Provenance
The following attestation bundles were made for topic_stability-0.1.0.tar.gz:
Publisher:
publish.yml on mimno/TopicStability
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
topic_stability-0.1.0.tar.gz -
Subject digest:
ab5a780053be072b9270b58dbb4675a7ee0e670d9320fddbb0494a1727866eeb - Sigstore transparency entry: 1841213757
- Sigstore integration time:
-
Permalink:
mimno/TopicStability@e42573c459c3ca3513213858517fa9345cd1c7ad -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/mimno
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e42573c459c3ca3513213858517fa9345cd1c7ad -
Trigger Event:
push
-
Statement type:
File details
Details for the file topic_stability-0.1.0-py3-none-any.whl.
File metadata
- Download URL: topic_stability-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5bde09891fec06071be37694efa09f01b9b4b106ec32f98dce26a2740d4b1ff7
|
|
| MD5 |
c3bd49f608979933445d979c81c2f092
|
|
| BLAKE2b-256 |
6c46206e8bd2eefeb2026eb8688f0632a64305f1e9d7d4801abe9b4e2263dcfa
|
Provenance
The following attestation bundles were made for topic_stability-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on mimno/TopicStability
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
topic_stability-0.1.0-py3-none-any.whl -
Subject digest:
5bde09891fec06071be37694efa09f01b9b4b106ec32f98dce26a2740d4b1ff7 - Sigstore transparency entry: 1841213787
- Sigstore integration time:
-
Permalink:
mimno/TopicStability@e42573c459c3ca3513213858517fa9345cd1c7ad -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/mimno
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e42573c459c3ca3513213858517fa9345cd1c7ad -
Trigger Event:
push
-
Statement type: