Fast, low-memory spectral topic modeling via on-the-fly rectification
Project description
anchortopics
A fast, low-memory Python implementation of On-the-Fly Rectification (OTFR) spectral topic modeling: Moontae Lee, Sungjun Cho, Kun Dong, David Mimno, and David Bindel, On-the-Fly Rectification for Robust Large-Vocabulary Topic Inference (ICML 2021), building on the Joint-Stochastic Matrix Factorization / Anchor Word framework (Arora et al. 2013; Moontae et al. 2015, JSMF / pyJSMF-RAW).
Spectral ("anchor word") topic models recover topics from low-order word
co-occurrence statistics in one shot — no sampling, no EM, just an
eigendecomposition and some linear algebra. The catch is that empirical
co-occurrence matrices are noisy, indefinite, and dense once rectified, so
the classical pipeline costs O(N^2) space and O(N^2 K) time in the
vocabulary size N. OTFR avoids ever forming the N x N matrix: it
maintains the rectified co-occurrence as an implicit low-rank-plus-sparse
operator (Y Y^T + E + r * 1 1^T) and recovers anchor words directly from
the low-rank factor Y, bringing the cost down to O(N K) space and
O(N K^2) time.
Install
uv add anchortopics # or: pip install anchortopics
Quickstart
from sklearn.feature_extraction.text import CountVectorizer
from anchortopics import OTFR
vectorizer = CountVectorizer(max_features=20000, stop_words="english")
X = vectorizer.fit_transform(documents) # (n_documents, n_vocab) sparse counts
model = OTFR(n_topics=20).fit(X)
for k, words in enumerate(model.top_words(vectorizer.get_feature_names_out(), n_words=10)):
print(k, words)
model.components_ is the (n_topics, n_vocab) word-topic distribution
matrix (p(word | topic)), model.topic_correlation_ is the (n_topics, n_topics) topic-topic correlation matrix, and model.anchors_ holds the
vocabulary indices selected as anchor words.
If you already have a co-occurrence matrix C (dense or sparse, N x N),
use model.fit_cooccurrence(C) instead of fit(X).
For very large vocabularies/corpora, pass randomized_init=True to
initialize the rectification with a one-pass randomized eigendecomposition
computed directly from the word-document counts (Halko, Martinsson & Tropp
2011), rather than ARPACK.
How it works
CooccurrenceOperator(anchortopics.cooccurrence) builds the unbiased joint-stochastic co-occurrence estimator of Arora et al. as an implicit linear operator over a sparse word-document count matrix, withO(nnz(X))matrix-vector products — the dense matrix is never formed.enn_rectify(anchortopics.enn) runs Epsilon Non-Negative (ENN) rectification: iteratively project toward the nearest joint-stochastic, rank-K, (epsilon-)non-negative matrix, representing the non-negativity correction as a sparse matrixErather than a dense one.law(anchortopics.law) runs the Low-rank Anchor Word algorithm: selects anchor words via column-pivoted QR performed in theK-dimensional compressed space (equivalent to pivoting on the full matrix, per Lemma 1 of the paper, butO(N K^2)instead ofO(N^2 K)), then recovers the word-topic matrixBand topic correlation matrixA.
anchortopics.model.OTFR wires these together behind a small, sklearn-style fit
API.
Diagnostics
anchortopics.diagnostics computes everything below from the fitted model's own
attributes — no held-out data, human judgments, or extra passes over the
corpus required:
from anchortopics import diagnostics
vocab = vectorizer.get_feature_names_out()
report = diagnostics.summary(model, vocabulary=vocab)
print(report["specificity"], report["dissimilarity"])
for c in report["stopword_candidates"]:
print(c["word"], c["topic_entropy"])
stopword_candidates: ranks frequent words whose posterior over topics is close to uniform (high entropy) as candidates to add to a stoplist. On a real abstracts corpus this reliably surfaces academic boilerplate ("furthermore", "state-of-the-art", "leveraging", "address") rather than topical words.relative_approximation:‖C − BAB^T‖_F / ‖C‖_Fagainst the original (unrectified) co-occurrence, estimated inO(NK)via a Hutchinson trace estimator — never materializes the dense matrix.relative_recovery: how well the selected anchors reconstruct the rest of the normalized co-occurrence space.relative_dominancy: how concentrated the topic-correlation matrixAis on its diagonal (independence) vs. off-diagonal (correlation).specificity: average KL divergence of each topic from the corpus unigram distribution; low values flag generic, high-frequency-word-driven topics.dissimilarity: fraction of each topic's top words that don't recur in any other topic's top words; low values flag redundant topics.eigengap: relative gap between the K-th and (K+1)-th eigenvalues of the rectified spectrum; a shrinking gap as K grows is a sign you're past the number of topics the data actually supports.
These catch degenerate cases (redundant or generic topics, over-large K) rather than confirming topics are meaningful — they're a complement to, not a replacement for, reading the topics. (Not implemented: the paper's MST-Incoherence metric, which needs an NPMI graph over prominent + "characteristic" words per topic — a reasonable follow-up.)
Development
uv sync
uv run pytest
uv run python examples/synthetic_example.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file anchortopics-0.1.0.tar.gz.
File metadata
- Download URL: anchortopics-0.1.0.tar.gz
- Upload date:
- Size: 13.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d33932848ac09a8a0919660c606c0dae22ffeac688f0c71631faa3d9ccc1296
|
|
| MD5 |
b6325b77e7467e32e54ae2fab24544f1
|
|
| BLAKE2b-256 |
ae01cfa24c24c66589eb6bbd35abcc7437e27793b31c2379da2c7b97ea9501c3
|
Provenance
The following attestation bundles were made for anchortopics-0.1.0.tar.gz:
Publisher:
publish.yml on mimno/anchortopics
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
anchortopics-0.1.0.tar.gz -
Subject digest:
9d33932848ac09a8a0919660c606c0dae22ffeac688f0c71631faa3d9ccc1296 - Sigstore transparency entry: 1930012777
- Sigstore integration time:
-
Permalink:
mimno/anchortopics@994fd2aff96c3bfaa621186653848ee9d6777929 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/mimno
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@994fd2aff96c3bfaa621186653848ee9d6777929 -
Trigger Event:
release
-
Statement type:
File details
Details for the file anchortopics-0.1.0-py3-none-any.whl.
File metadata
- Download URL: anchortopics-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2829186390d368ed2da599becd55e0e6ec103591d5c7e9e5e7e478e3806a3ad7
|
|
| MD5 |
f46f2dcbb2e6ea8363a8a3843b666e05
|
|
| BLAKE2b-256 |
7c8f1522f238ef5f55b1ab4ebfb68ed6792b00373032567427562ac5f08d2377
|
Provenance
The following attestation bundles were made for anchortopics-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on mimno/anchortopics
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
anchortopics-0.1.0-py3-none-any.whl -
Subject digest:
2829186390d368ed2da599becd55e0e6ec103591d5c7e9e5e7e478e3806a3ad7 - Sigstore transparency entry: 1930012943
- Sigstore integration time:
-
Permalink:
mimno/anchortopics@994fd2aff96c3bfaa621186653848ee9d6777929 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/mimno
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@994fd2aff96c3bfaa621186653848ee9d6777929 -
Trigger Event:
release
-
Statement type: