Skip to main content

Measure and visualize topic model stability across multiple runs

Project description

topic-stability

Measure and visualize the stability of topic models across multiple runs.

Topic models are stochastic: two runs with the same settings produce differently-labelled topics in a different order. topic-stability aligns topics across runs using sentence-embedding centroids and scores each topic by how consistently the same documents are assigned to it (Jensen-Shannon divergence). The result is a per-topic stability score in [0, 1] and a small-multiples UMAP visualization with stability annotated on each panel.

Works with any topic model that produces a document-topic matrix — LDA, NMF, BERTopic, and more.

Install

pip install topic-stability                      # core (numpy + scipy only)
pip install "topic-stability[embed]"             # + sentence-transformers
pip install "topic-stability[umap,viz]"          # + UMAP + matplotlib
pip install "topic-stability[all]"               # everything

Quick start

sklearn (LDA, NMF, …)

from sklearn.decomposition import LatentDirichletAllocation
from topic_stability import TopicRun, StabilityAnalysis, DocumentEmbedder

# Embed documents once and cache to disk
embedder = DocumentEmbedder(cache_path="embeddings.npy")
embeddings = embedder.embed(texts, ids=doc_ids)

# Train several runs
runs = [TopicRun.from_sklearn(
            LatentDirichletAllocation(n_components=20).fit(X), X
        ) for _ in range(5)]

analysis = StabilityAnalysis(runs, embeddings=embeddings)
analysis.align()

print(analysis.topic_stability())   # array of shape (n_topics,)
print(analysis.overall_stability()) # scalar

analysis.visualize("topics.png")    # requires topic-stability[umap,viz]

Pass precomputed embeddings (e.g. from BERTopic)

from topic_stability.integrations.bertopic import from_bertopic

run, embeddings = from_bertopic(model, embeddings=precomputed_embeddings)

See BERTopic notes below for important differences.

From files (Mallet / CSV pipeline)

runs = [
    TopicRun.from_csv(
        f"model_42_run{i}/doc_topic_avg.csv",
        word_topic_path=f"model_42_run{i}/word_topic_avg.csv",
    )
    for i in range(1, 6)
]

embedder = DocumentEmbedder(cache_path="embeddings.npy")
embeddings, _ = embedder.load()

analysis = StabilityAnalysis(runs, embeddings=embeddings)
analysis.align()
analysis.visualize("topics.png", umap_coords=precomputed_umap)

API

TopicRun

One run's topic distributions.

Constructor Use when
TopicRun.from_matrix(doc_topic, *, doc_ids, word_topic, vocab) You have numpy arrays
TopicRun.from_sklearn(model, X, *, doc_ids, vocab) sklearn transform() interface
TopicRun.from_csv(doc_topic_path, *, word_topic_path) CSV files from the CLI pipeline
TopicRun.from_mallet_states(model_dir, *, iterations, tsv_path) Mallet .gz state files

DocumentEmbedder

embedder = DocumentEmbedder(model="all-MiniLM-L6-v2", cache_path="embeddings.npy")
embeddings = embedder.embed(texts, ids=doc_ids)  # computes and caches
embeddings, ids = embedder.load()                # load from cache

Pass the returned array directly to StabilityAnalysis(runs, embeddings=embeddings).

StabilityAnalysis

analysis = StabilityAnalysis(runs, embeddings, *, doc_ids=None)
analysis.align(reference=0)         # must call before scoring
analysis.topic_stability()          # ndarray (n_topics,) in [0, 1]
analysis.overall_stability()        # float
analysis.umap_projection(**kwargs)  # ndarray (n_docs, 2)
analysis.visualize(path, *, reference_run=0, umap_coords=None)

Alignment uses cosine similarity of per-topic embedding centroids (centroid_k = Σ_d θ_dk · e_d, normalised) matched with the Hungarian algorithm. No shared vocabulary is required, so runs from different model types can be compared.

Stability score for topic k: mean pairwise 1 − JS(p, q) where p and q are the normalised document-profile columns θ[:,k] (treated as a distribution over documents) from each pair of aligned runs.

BERTopic

from topic_stability.integrations.bertopic import from_bertopic

run, embeddings = from_bertopic(model, docs=None, *, embeddings=None, doc_ids=None)

Returns (TopicRun, embeddings_array).

Key differences from LDA/NMF:

  • BERTopic assigns each document to exactly one cluster (hard assignment). The doc_topic matrix is binary: 1 for the assigned topic, 0 elsewhere. Documents that HDBSCAN assigns to topic −1 (outliers) get an all-zero row.
  • model.probabilities_ contains HDBSCAN soft-membership scores, not topic-weight distributions. We do not use them — they are a geometric property of the embedding space, not comparable to LDA posterior weights.
  • Word representations come from c-TF-IDF scores, not a generative word distribution. Cross-model word-based comparison is not meaningful.
  • Stability scores measure whether the same documents cluster together across runs, not whether the same word distributions recur.

CLI pipeline (Mallet / RustMallet)

The package includes CLI wrappers for a full file-based workflow:

# 1. Embed documents
topic-stability-embed corpus.tsv embeddings.npy

# 2. Project to 2D
topic-stability-project embeddings.npy umap_2d.csv

# 3. Estimate distributions from Mallet states
topic-stability-estimate model_42_run1/ 42 corpus.tsv

# 4. Visualize a single run
topic-stability-visualize umap_2d.csv model_42_run1/doc_topic_avg.csv \
    model_42_run1/word_topic_avg.csv topics.png

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topic_stability-0.1.0.tar.gz (161.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topic_stability-0.1.0-py3-none-any.whl (21.7 kB view details)

Uploaded Python 3

File details

Details for the file topic_stability-0.1.0.tar.gz.

File metadata

  • Download URL: topic_stability-0.1.0.tar.gz
  • Upload date:
  • Size: 161.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for topic_stability-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ab5a780053be072b9270b58dbb4675a7ee0e670d9320fddbb0494a1727866eeb
MD5 75a01080e347b8cee76a1e8f646fae99
BLAKE2b-256 873ebcdf2dd044b2f2d655d13980380cbd03b52c2c5fb9a693f596c3a5cbbf2a

See more details on using hashes here.

Provenance

The following attestation bundles were made for topic_stability-0.1.0.tar.gz:

Publisher: publish.yml on mimno/TopicStability

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file topic_stability-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: topic_stability-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for topic_stability-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5bde09891fec06071be37694efa09f01b9b4b106ec32f98dce26a2740d4b1ff7
MD5 c3bd49f608979933445d979c81c2f092
BLAKE2b-256 6c46206e8bd2eefeb2026eb8688f0632a64305f1e9d7d4801abe9b4e2263dcfa

See more details on using hashes here.

Provenance

The following attestation bundles were made for topic_stability-0.1.0-py3-none-any.whl:

Publisher: publish.yml on mimno/TopicStability

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page