Measure and visualize topic model stability across multiple runs

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dmimno

These details have not been verified by PyPI

Project description

topic-stability

Measure and visualize the stability of topic models across multiple runs.

Topic models are stochastic: two runs with the same settings produce differently-labelled topics in a different order. topic-stability aligns topics across runs using sentence-embedding centroids and scores each topic by how consistently the same documents are assigned to it (Jensen-Shannon divergence). The result is a per-topic stability score in [0, 1] and a small-multiples UMAP visualization with stability annotated on each panel.

Works with any topic model that produces a document-topic matrix — LDA, NMF, BERTopic, and more.

Install

pip install topic-stability                      # core (numpy + scipy only)
pip install "topic-stability[embed]"             # + sentence-transformers
pip install "topic-stability[umap,viz]"          # + UMAP + matplotlib
pip install "topic-stability[all]"               # everything

Quick start

sklearn (LDA, NMF, …)

from sklearn.decomposition import LatentDirichletAllocation
from topic_stability import TopicRun, StabilityAnalysis, DocumentEmbedder

# Embed documents once and cache to disk
embedder = DocumentEmbedder(cache_path="embeddings.npy")
embeddings = embedder.embed(texts, ids=doc_ids)

# Train several runs
runs = [TopicRun.from_sklearn(
            LatentDirichletAllocation(n_components=20).fit(X), X
        ) for _ in range(5)]

analysis = StabilityAnalysis(runs, embeddings=embeddings)
analysis.align()

print(analysis.topic_stability())   # array of shape (n_topics,)
print(analysis.overall_stability()) # scalar

analysis.visualize("topics.png")    # requires topic-stability[umap,viz]

Pass precomputed embeddings (e.g. from BERTopic)

from topic_stability.integrations.bertopic import from_bertopic

run, embeddings = from_bertopic(model, embeddings=precomputed_embeddings)

See BERTopic notes below for important differences.

From files (Mallet / CSV pipeline)

runs = [
    TopicRun.from_csv(
        f"model_42_run{i}/doc_topic_avg.csv",
        word_topic_path=f"model_42_run{i}/word_topic_avg.csv",
    )
    for i in range(1, 6)
]

embedder = DocumentEmbedder(cache_path="embeddings.npy")
embeddings, _ = embedder.load()

analysis = StabilityAnalysis(runs, embeddings=embeddings)
analysis.align()
analysis.visualize("topics.png", umap_coords=precomputed_umap)

API

`TopicRun`

One run's topic distributions.

Constructor	Use when
`TopicRun.from_matrix(doc_topic, *, doc_ids, word_topic, vocab)`	You have numpy arrays
`TopicRun.from_sklearn(model, X, *, doc_ids, vocab)`	sklearn `transform()` interface
`TopicRun.from_csv(doc_topic_path, *, word_topic_path)`	CSV files from the CLI pipeline
`TopicRun.from_mallet_states(model_dir, *, iterations, tsv_path)`	Mallet `.gz` state files

`DocumentEmbedder`

embedder = DocumentEmbedder(model="all-MiniLM-L6-v2", cache_path="embeddings.npy")
embeddings = embedder.embed(texts, ids=doc_ids)  # computes and caches
embeddings, ids = embedder.load()                # load from cache

Pass the returned array directly to StabilityAnalysis(runs, embeddings=embeddings).

`StabilityAnalysis`

analysis = StabilityAnalysis(runs, embeddings, *, doc_ids=None)
analysis.align(reference=0)         # must call before scoring
analysis.topic_stability()          # ndarray (n_topics,) in [0, 1]
analysis.overall_stability()        # float
analysis.umap_projection(**kwargs)  # ndarray (n_docs, 2)
analysis.visualize(path, *, reference_run=0, umap_coords=None)

Alignment uses cosine similarity of per-topic embedding centroids (centroid_k = Σ_d θ_dk · e_d, normalised) matched with the Hungarian algorithm. No shared vocabulary is required, so runs from different model types can be compared.

Stability score for topic k: mean pairwise 1 − JS(p, q) where p and q are the normalised document-profile columns θ[:,k] (treated as a distribution over documents) from each pair of aligned runs.

BERTopic

from topic_stability.integrations.bertopic import from_bertopic

run, embeddings = from_bertopic(model, docs=None, *, embeddings=None, doc_ids=None)

Returns (TopicRun, embeddings_array).

Key differences from LDA/NMF:

BERTopic assigns each document to exactly one cluster (hard assignment). The doc_topic matrix is binary: 1 for the assigned topic, 0 elsewhere. Documents that HDBSCAN assigns to topic −1 (outliers) get an all-zero row.
model.probabilities_ contains HDBSCAN soft-membership scores, not topic-weight distributions. We do not use them — they are a geometric property of the embedding space, not comparable to LDA posterior weights.
Word representations come from c-TF-IDF scores, not a generative word distribution. Cross-model word-based comparison is not meaningful.
Stability scores measure whether the same documents cluster together across runs, not whether the same word distributions recur.

CLI pipeline (Mallet / RustMallet)

The package includes CLI wrappers for a full file-based workflow:

# 1. Embed documents
topic-stability-embed corpus.tsv embeddings.npy

# 2. Project to 2D
topic-stability-project embeddings.npy umap_2d.csv

# 3. Estimate distributions from Mallet states
topic-stability-estimate model_42_run1/ 42 corpus.tsv

# 4. Visualize a single run
topic-stability-visualize umap_2d.csv model_42_run1/doc_topic_avg.csv \
    model_42_run1/word_topic_avg.csv topics.png

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dmimno

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topic_stability-0.1.0.tar.gz (161.8 kB view details)

Uploaded Jun 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

topic_stability-0.1.0-py3-none-any.whl (21.7 kB view details)

Uploaded Jun 16, 2026 Python 3

File details

Details for the file topic_stability-0.1.0.tar.gz.

File metadata

Download URL: topic_stability-0.1.0.tar.gz
Upload date: Jun 16, 2026
Size: 161.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for topic_stability-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ab5a780053be072b9270b58dbb4675a7ee0e670d9320fddbb0494a1727866eeb`
MD5	`75a01080e347b8cee76a1e8f646fae99`
BLAKE2b-256	`873ebcdf2dd044b2f2d655d13980380cbd03b52c2c5fb9a693f596c3a5cbbf2a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for topic_stability-0.1.0.tar.gz:

Publisher: publish.yml on mimno/TopicStability

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: topic_stability-0.1.0.tar.gz
- Subject digest: ab5a780053be072b9270b58dbb4675a7ee0e670d9320fddbb0494a1727866eeb
- Sigstore transparency entry: 1841213757
- Sigstore integration time: Jun 16, 2026
Source repository:
- Permalink: mimno/TopicStability@e42573c459c3ca3513213858517fa9345cd1c7ad
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/mimno
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e42573c459c3ca3513213858517fa9345cd1c7ad
- Trigger Event: push

File details

Details for the file topic_stability-0.1.0-py3-none-any.whl.

File metadata

Download URL: topic_stability-0.1.0-py3-none-any.whl
Upload date: Jun 16, 2026
Size: 21.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for topic_stability-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5bde09891fec06071be37694efa09f01b9b4b106ec32f98dce26a2740d4b1ff7`
MD5	`c3bd49f608979933445d979c81c2f092`
BLAKE2b-256	`6c46206e8bd2eefeb2026eb8688f0632a64305f1e9d7d4801abe9b4e2263dcfa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for topic_stability-0.1.0-py3-none-any.whl:

Publisher: publish.yml on mimno/TopicStability

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: topic_stability-0.1.0-py3-none-any.whl
- Subject digest: 5bde09891fec06071be37694efa09f01b9b4b106ec32f98dce26a2740d4b1ff7
- Sigstore transparency entry: 1841213787
- Sigstore integration time: Jun 16, 2026
Source repository:
- Permalink: mimno/TopicStability@e42573c459c3ca3513213858517fa9345cd1c7ad
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/mimno
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e42573c459c3ca3513213858517fa9345cd1c7ad
- Trigger Event: push

topic-stability 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

topic-stability

Install

Quick start

sklearn (LDA, NMF, …)

Pass precomputed embeddings (e.g. from BERTopic)

From files (Mallet / CSV pipeline)

API

TopicRun

DocumentEmbedder

StabilityAnalysis

BERTopic

CLI pipeline (Mallet / RustMallet)

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`TopicRun`

`DocumentEmbedder`

`StabilityAnalysis`